Temporarily enable console output to diagnose "failed to start recording" error in the PyInstaller build. This will show all print() statements and error messages that are currently being hidden. Change console=False to console=True in the spec file. Once the issue is identified and fixed, set back to console=False for a production build without the console window. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Local Transcription for Streamers
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
Features
- Standalone Desktop Application: Use locally with built-in GUI display - no server required
- Local Transcription: Run Whisper (or compatible models) locally on your machine
- CPU/GPU Support: Choose between CPU or GPU processing based on your hardware
- Real-time Processing: Live audio transcription with minimal latency
- Noise Suppression: Built-in audio preprocessing to reduce background noise
- User Configuration: Set your display name and preferences through the GUI
- Optional Multi-user Sync: Connect to a server to sync transcriptions with other users
- OBS Integration: Web-based output designed for easy browser source capture
- Privacy-First: All processing happens locally; only transcription text is shared
- Customizable: Configure model size, language, and streaming settings
Quick Start
Running from Source
# Install dependencies
uv sync
# Run the application
uv run python main.py
Building Standalone Executables
To create standalone executables for distribution:
Linux:
./build.sh
Windows:
build.bat
For detailed build instructions, see BUILD.md.
Architecture Overview
The application can run in two modes:
Standalone Mode (No Server Required):
- Desktop Application: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
Multi-user Sync Mode (Optional):
- Local Transcription Client: Captures audio, performs speech-to-text, and sends results to the web server
- Centralized Web Server: Aggregates transcriptions from multiple clients and serves a web stream
- Web Stream Interface: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
Use Cases
- Multi-language Streams: Multiple translators transcribing in different languages
- Accessibility: Provide real-time captions for viewers
- Collaborative Podcasts: Multiple hosts with separate transcriptions
- Gaming Commentary: Track who said what in multiplayer sessions
Implementation Plan
Phase 1: Standalone Desktop Application
Objective: Build a fully functional standalone transcription app with GUI that works without any server
Components:
-
Audio Capture Module
- Capture system audio or microphone input
- Support multiple audio sources (virtual audio cables, physical devices)
- Real-time audio buffering with configurable chunk sizes
- Noise Suppression: Preprocess audio to reduce background noise
- Libraries:
pyaudio,sounddevice,noisereduce,webrtcvad
-
Noise Suppression Engine
- Real-time noise reduction using RNNoise or noisereduce
- Adjustable noise reduction strength
- Optional VAD (Voice Activity Detection) to skip silent segments
- Libraries:
noisereduce,rnnoise-python,webrtcvad
-
Transcription Engine
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
- Support multiple model sizes (tiny, base, small, medium, large)
- CPU and GPU inference options
- Model management and automatic downloading
- Libraries:
openai-whisper,faster-whisper,torch
-
Device Selection
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
- Allow user to specify preferred device via GUI
- Graceful fallback if GPU unavailable
- Display device status and performance metrics
-
Desktop GUI Application
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
- Main transcription display window (scrolling text area)
- Settings panel for configuration
- User name input field
- Audio input device selector
- Model size selector
- CPU/GPU toggle
- Start/Stop transcription button
- Optional: System tray integration
- Libraries:
PyQt6,customtkinter, ortkinter
-
Local Display
- Real-time transcription display in GUI window
- Scrolling text with timestamps
- User name/label shown with transcriptions
- Copy transcription to clipboard
- Optional: Save transcription to file (TXT, SRT, VTT)
Tasks:
- Set up project structure and dependencies
- Implement audio capture with device selection
- Add noise suppression and VAD preprocessing
- Integrate Whisper model loading and inference
- Add CPU/GPU device detection and selection logic
- Create real-time audio buffer processing pipeline
- Design and implement GUI layout (main window)
- Add settings panel with user name configuration
- Implement local transcription display area
- Add start/stop controls and status indicators
- Test transcription accuracy and latency
- Test noise suppression effectiveness
Phase 2: Web Server and Sync System
Objective: Create a centralized server to aggregate and serve transcriptions
Components:
-
Web Server
- FastAPI or Flask-based REST API
- WebSocket support for real-time updates
- User/client registration and management
- Libraries:
fastapi,uvicorn,websockets
-
Transcription Aggregator
- Receive transcription chunks from multiple clients
- Associate transcriptions with user IDs/names
- Timestamp management and synchronization
- Buffer management for smooth streaming
-
Database/Storage (Optional)
- Store transcription history (SQLite for simplicity)
- Session management
- Export functionality (SRT, VTT, TXT formats)
API Endpoints:
POST /api/register- Register a new clientPOST /api/transcription- Submit transcription chunkWS /api/stream- WebSocket for real-time transcription streamGET /stream- Web page for OBS browser source
Tasks:
- Set up FastAPI server with CORS support
- Implement WebSocket handler for real-time streaming
- Create client registration system
- Build transcription aggregation logic
- Add timestamp synchronization
- Create data models for clients and transcriptions
Phase 3: Client-Server Communication (Optional Multi-user Mode)
Objective: Add optional server connectivity to enable multi-user transcription sync
Components:
-
HTTP/WebSocket Client
- Register client with server on startup
- Send transcription chunks as they're generated
- Handle connection drops and reconnection
- Libraries:
requests,websockets
-
Configuration System
- Config file for server URL, API keys, user settings
- Model preferences (size, language)
- Audio input settings
- Format: YAML or JSON
-
Status Monitoring
- Connection status indicator
- Transcription queue health
- Error handling and logging
Tasks:
- Add "Enable Server Sync" toggle to GUI
- Add server URL configuration field in settings
- Implement WebSocket client for sending transcriptions
- Add configuration file support (YAML/JSON)
- Create connection management with auto-reconnect
- Add local logging and error handling
- Add server connection status indicator to GUI
- Allow app to function normally if server is unavailable
Phase 4: Web Stream Interface (OBS Integration)
Objective: Create a web page that displays synchronized transcriptions for OBS
Components:
-
Web Frontend
- HTML/CSS/JavaScript page for displaying transcriptions
- Responsive design with customizable styling
- Auto-scroll with configurable retention window
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
-
Styling Options
- Customizable fonts, colors, sizes
- Background transparency for OBS chroma key
- User name/ID display options
- Timestamp display (optional)
-
Display Modes
- Scrolling captions (like live TV captions)
- Multi-user panel view (separate sections per user)
- Overlay mode (minimal UI for transparency)
Tasks:
- Create HTML template for transcription display
- Implement WebSocket client in JavaScript
- Add CSS styling with OBS-friendly transparency
- Create customization controls (URL parameters or UI)
- Test with OBS browser source
- Add configurable retention/scroll behavior
Phase 5: Advanced Features
Objective: Enhance functionality and user experience
Features:
-
Language Detection
- Auto-detect spoken language
- Multi-language support in single stream
- Language selector in GUI
-
Speaker Diarization (Optional)
- Identify different speakers
- Label transcriptions by speaker
- Useful for multi-host streams
-
Profanity Filtering
- Optional word filtering/replacement
- Customizable filter lists
- Toggle in GUI settings
-
Advanced Noise Profiles
- Save and load custom noise profiles
- Adaptive noise suppression
- Different profiles for different environments
-
Export Functionality
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
- Export button in GUI
- Automatic session saving
-
Hotkey Support
- Global hotkeys to start/stop transcription
- Mute/unmute hotkey
- Quick save hotkey
-
Docker Support
- Containerized server deployment
- Docker Compose for easy multi-component setup
- Pre-built images for easy deployment
-
Themes and Customization
- Dark/light theme toggle
- Customizable font sizes and colors for display
- OBS-friendly transparent overlay mode
Tasks:
- Add language detection and multi-language support
- Implement speaker diarization
- Create optional profanity filter
- Add export functionality (SRT, VTT, plain text, JSON)
- Implement global hotkey support
- Create Docker containers for server component
- Add theme customization options
- Create advanced noise profile management
Technology Stack
Local Client:
- Python 3.9+
- GUI: PyQt6 / CustomTkinter / tkinter
- Audio: PyAudio / sounddevice
- Noise Suppression: noisereduce / rnnoise-python
- VAD: webrtcvad
- ML Framework: PyTorch (for Whisper)
- Transcription: openai-whisper / faster-whisper
- Networking: websockets, requests (optional for server sync)
- Config: PyYAML / json
Server:
- Backend: FastAPI / Flask
- WebSocket: python-websockets / FastAPI WebSockets
- Server: Uvicorn / Gunicorn
- Database (optional): SQLite / PostgreSQL
- CORS: fastapi-cors
Web Interface:
- Frontend: HTML5, CSS3, JavaScript (ES6+)
- Real-time: WebSocket API
- Styling: CSS Grid/Flexbox for layout
Project Structure
local-transcription/
client/ # Local transcription client
__init__.py
audio_capture.py # Audio input handling
transcription_engine.py # Whisper integration
network_client.py # Server communication
config.py # Configuration management
main.py # Client entry point
server/ # Centralized web server
__init__.py
api.py # FastAPI routes
websocket_handler.py # WebSocket management
models.py # Data models
database.py # Optional DB layer
main.py # Server entry point
web/ # Web stream interface
index.html # OBS browser source page
styles.css # Customizable styling
app.js # WebSocket client & UI logic
config/
client_config.example.yaml
server_config.example.yaml
tests/
test_audio.py
test_transcription.py
test_server.py
requirements.txt # Python dependencies
README.md
main.py # Combined launcher (optional)
Installation (Planned)
Prerequisites:
- Python 3.9 or higher
- CUDA-capable GPU (optional, for GPU acceleration)
- FFmpeg (required by Whisper)
Steps:
-
Clone the repository
git clone <repository-url> cd local-transcription -
Install dependencies
pip install -r requirements.txt -
Download Whisper models
# Models will be auto-downloaded on first run # Or manually download: python -c "import whisper; whisper.load_model('base')" -
Configure client
cp config/client_config.example.yaml config/client_config.yaml # Edit config/client_config.yaml with your settings -
Run the server (one instance)
python server/main.py -
Run the client (on each user's machine)
python client/main.py -
Add to OBS
- Add a Browser Source
- URL:
http://<server-ip>:8000/stream - Set width/height as needed
- Check "Shutdown source when not visible" for performance
Configuration (Planned)
Client Configuration:
user:
name: "Streamer1" # Display name for transcriptions
id: "unique-user-id" # Optional unique identifier
audio:
input_device: "default" # or specific device index
sample_rate: 16000
chunk_duration: 2.0 # seconds
noise_suppression:
enabled: true # Enable/disable noise reduction
strength: 0.7 # 0.0 to 1.0 - reduction strength
method: "noisereduce" # "noisereduce" or "rnnoise"
transcription:
model: "base" # tiny, base, small, medium, large
device: "cuda" # cpu, cuda, mps
language: "en" # or "auto" for detection
task: "transcribe" # or "translate"
processing:
use_vad: true # Voice Activity Detection
min_confidence: 0.5 # Minimum transcription confidence
server_sync:
enabled: false # Enable multi-user server sync
url: "ws://localhost:8000" # Server URL (when enabled)
api_key: "" # Optional API key
display:
show_timestamps: true # Show timestamps in local display
max_lines: 100 # Maximum lines to keep in display
font_size: 12 # GUI font size
Server Configuration:
server:
host: "0.0.0.0"
port: 8000
api_key_required: false
stream:
max_clients: 10
buffer_size: 100 # messages to buffer
retention_time: 300 # seconds
database:
enabled: false
path: "transcriptions.db"
Roadmap
- Project planning and architecture design
- Phase 1: Standalone desktop application with GUI
- Phase 2: Web server and sync system (optional multi-user mode)
- Phase 3: Client-server communication (optional)
- Phase 4: Web stream interface for OBS (optional)
- Phase 5: Advanced features (hotkeys, themes, Docker, etc.)
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
License
[Choose appropriate license - MIT, Apache 2.0, etc.]
Acknowledgments
- OpenAI Whisper for the excellent speech recognition model
- The streaming community for inspiration and use cases