# Local Transcription for Streamers A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software. ## Features - **Standalone Desktop Application**: Use locally with built-in GUI display - no server required - **Local Transcription**: Run Whisper (or compatible models) locally on your machine - **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware - **Real-time Processing**: Live audio transcription with minimal latency - **Noise Suppression**: Built-in audio preprocessing to reduce background noise - **User Configuration**: Set your display name and preferences through the GUI - **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users - **OBS Integration**: Web-based output designed for easy browser source capture - **Privacy-First**: All processing happens locally; only transcription text is shared - **Customizable**: Configure model size, language, and streaming settings ## Quick Start ### Running from Source ```bash # Install dependencies uv sync # Run the application uv run python main.py ``` ### Building Standalone Executables To create standalone executables for distribution: **Linux:** ```bash ./build.sh ``` **Windows:** ```cmd build.bat ``` For detailed build instructions, see [BUILD.md](BUILD.md). ## Architecture Overview The application can run in two modes: ### Standalone Mode (No Server Required): 1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window ### Multi-user Sync Mode (Optional): 1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server 2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream 3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture) ## Use Cases - **Multi-language Streams**: Multiple translators transcribing in different languages - **Accessibility**: Provide real-time captions for viewers - **Collaborative Podcasts**: Multiple hosts with separate transcriptions - **Gaming Commentary**: Track who said what in multiplayer sessions --- ## Implementation Plan ### Phase 1: Standalone Desktop Application **Objective**: Build a fully functional standalone transcription app with GUI that works without any server #### Components: 1. **Audio Capture Module** - Capture system audio or microphone input - Support multiple audio sources (virtual audio cables, physical devices) - Real-time audio buffering with configurable chunk sizes - **Noise Suppression**: Preprocess audio to reduce background noise - Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad` 2. **Noise Suppression Engine** - Real-time noise reduction using RNNoise or noisereduce - Adjustable noise reduction strength - Optional VAD (Voice Activity Detection) to skip silent segments - Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad` 3. **Transcription Engine** - Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp) - Support multiple model sizes (tiny, base, small, medium, large) - CPU and GPU inference options - Model management and automatic downloading - Libraries: `openai-whisper`, `faster-whisper`, `torch` 4. **Device Selection** - Auto-detect available compute devices (CPU, CUDA, MPS for Mac) - Allow user to specify preferred device via GUI - Graceful fallback if GPU unavailable - Display device status and performance metrics 5. **Desktop GUI Application** - Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter - Main transcription display window (scrolling text area) - Settings panel for configuration - User name input field - Audio input device selector - Model size selector - CPU/GPU toggle - Start/Stop transcription button - Optional: System tray integration - Libraries: `PyQt6`, `customtkinter`, or `tkinter` 6. **Local Display** - Real-time transcription display in GUI window - Scrolling text with timestamps - User name/label shown with transcriptions - Copy transcription to clipboard - Optional: Save transcription to file (TXT, SRT, VTT) #### Tasks: - [ ] Set up project structure and dependencies - [ ] Implement audio capture with device selection - [ ] Add noise suppression and VAD preprocessing - [ ] Integrate Whisper model loading and inference - [ ] Add CPU/GPU device detection and selection logic - [ ] Create real-time audio buffer processing pipeline - [ ] Design and implement GUI layout (main window) - [ ] Add settings panel with user name configuration - [ ] Implement local transcription display area - [ ] Add start/stop controls and status indicators - [ ] Test transcription accuracy and latency - [ ] Test noise suppression effectiveness --- ### Phase 2: Web Server and Sync System **Objective**: Create a centralized server to aggregate and serve transcriptions #### Components: 1. **Web Server** - FastAPI or Flask-based REST API - WebSocket support for real-time updates - User/client registration and management - Libraries: `fastapi`, `uvicorn`, `websockets` 2. **Transcription Aggregator** - Receive transcription chunks from multiple clients - Associate transcriptions with user IDs/names - Timestamp management and synchronization - Buffer management for smooth streaming 3. **Database/Storage** (Optional) - Store transcription history (SQLite for simplicity) - Session management - Export functionality (SRT, VTT, TXT formats) #### API Endpoints: - `POST /api/register` - Register a new client - `POST /api/transcription` - Submit transcription chunk - `WS /api/stream` - WebSocket for real-time transcription stream - `GET /stream` - Web page for OBS browser source #### Tasks: - [ ] Set up FastAPI server with CORS support - [ ] Implement WebSocket handler for real-time streaming - [ ] Create client registration system - [ ] Build transcription aggregation logic - [ ] Add timestamp synchronization - [ ] Create data models for clients and transcriptions --- ### Phase 3: Client-Server Communication (Optional Multi-user Mode) **Objective**: Add optional server connectivity to enable multi-user transcription sync #### Components: 1. **HTTP/WebSocket Client** - Register client with server on startup - Send transcription chunks as they're generated - Handle connection drops and reconnection - Libraries: `requests`, `websockets` 2. **Configuration System** - Config file for server URL, API keys, user settings - Model preferences (size, language) - Audio input settings - Format: YAML or JSON 3. **Status Monitoring** - Connection status indicator - Transcription queue health - Error handling and logging #### Tasks: - [ ] Add "Enable Server Sync" toggle to GUI - [ ] Add server URL configuration field in settings - [ ] Implement WebSocket client for sending transcriptions - [ ] Add configuration file support (YAML/JSON) - [ ] Create connection management with auto-reconnect - [ ] Add local logging and error handling - [ ] Add server connection status indicator to GUI - [ ] Allow app to function normally if server is unavailable --- ### Phase 4: Web Stream Interface (OBS Integration) **Objective**: Create a web page that displays synchronized transcriptions for OBS #### Components: 1. **Web Frontend** - HTML/CSS/JavaScript page for displaying transcriptions - Responsive design with customizable styling - Auto-scroll with configurable retention window - Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx) 2. **Styling Options** - Customizable fonts, colors, sizes - Background transparency for OBS chroma key - User name/ID display options - Timestamp display (optional) 3. **Display Modes** - Scrolling captions (like live TV captions) - Multi-user panel view (separate sections per user) - Overlay mode (minimal UI for transparency) #### Tasks: - [ ] Create HTML template for transcription display - [ ] Implement WebSocket client in JavaScript - [ ] Add CSS styling with OBS-friendly transparency - [ ] Create customization controls (URL parameters or UI) - [ ] Test with OBS browser source - [ ] Add configurable retention/scroll behavior --- ### Phase 5: Advanced Features **Objective**: Enhance functionality and user experience #### Features: 1. **Language Detection** - Auto-detect spoken language - Multi-language support in single stream - Language selector in GUI 2. **Speaker Diarization** (Optional) - Identify different speakers - Label transcriptions by speaker - Useful for multi-host streams 3. **Profanity Filtering** - Optional word filtering/replacement - Customizable filter lists - Toggle in GUI settings 4. **Advanced Noise Profiles** - Save and load custom noise profiles - Adaptive noise suppression - Different profiles for different environments 5. **Export Functionality** - Save transcriptions in multiple formats (TXT, SRT, VTT, JSON) - Export button in GUI - Automatic session saving 6. **Hotkey Support** - Global hotkeys to start/stop transcription - Mute/unmute hotkey - Quick save hotkey 7. **Docker Support** - Containerized server deployment - Docker Compose for easy multi-component setup - Pre-built images for easy deployment 8. **Themes and Customization** - Dark/light theme toggle - Customizable font sizes and colors for display - OBS-friendly transparent overlay mode #### Tasks: - [ ] Add language detection and multi-language support - [ ] Implement speaker diarization - [ ] Create optional profanity filter - [ ] Add export functionality (SRT, VTT, plain text, JSON) - [ ] Implement global hotkey support - [ ] Create Docker containers for server component - [ ] Add theme customization options - [ ] Create advanced noise profile management --- ## Technology Stack ### Local Client: - **Python 3.9+** - **GUI**: PyQt6 / CustomTkinter / tkinter - **Audio**: PyAudio / sounddevice - **Noise Suppression**: noisereduce / rnnoise-python - **VAD**: webrtcvad - **ML Framework**: PyTorch (for Whisper) - **Transcription**: openai-whisper / faster-whisper - **Networking**: websockets, requests (optional for server sync) - **Config**: PyYAML / json ### Server: - **Backend**: FastAPI / Flask - **WebSocket**: python-websockets / FastAPI WebSockets - **Server**: Uvicorn / Gunicorn - **Database** (optional): SQLite / PostgreSQL - **CORS**: fastapi-cors ### Web Interface: - **Frontend**: HTML5, CSS3, JavaScript (ES6+) - **Real-time**: WebSocket API - **Styling**: CSS Grid/Flexbox for layout --- ## Project Structure ``` local-transcription/  client/ # Local transcription client   __init__.py   audio_capture.py # Audio input handling   transcription_engine.py # Whisper integration   network_client.py # Server communication   config.py # Configuration management   main.py # Client entry point  server/ # Centralized web server   __init__.py   api.py # FastAPI routes   websocket_handler.py # WebSocket management   models.py # Data models   database.py # Optional DB layer   main.py # Server entry point  web/ # Web stream interface   index.html # OBS browser source page   styles.css # Customizable styling   app.js # WebSocket client & UI logic  config/   client_config.example.yaml   server_config.example.yaml  tests/   test_audio.py   test_transcription.py   test_server.py  requirements.txt # Python dependencies  README.md  main.py # Combined launcher (optional) ``` --- ## Installation (Planned) ### Prerequisites: - Python 3.9 or higher - CUDA-capable GPU (optional, for GPU acceleration) - FFmpeg (required by Whisper) ### Steps: 1. **Clone the repository** ```bash git clone cd local-transcription ``` 2. **Install dependencies** ```bash pip install -r requirements.txt ``` 3. **Download Whisper models** ```bash # Models will be auto-downloaded on first run # Or manually download: python -c "import whisper; whisper.load_model('base')" ``` 4. **Configure client** ```bash cp config/client_config.example.yaml config/client_config.yaml # Edit config/client_config.yaml with your settings ``` 5. **Run the server** (one instance) ```bash python server/main.py ``` 6. **Run the client** (on each user's machine) ```bash python client/main.py ``` 7. **Add to OBS** - Add a Browser Source - URL: `http://:8000/stream` - Set width/height as needed - Check "Shutdown source when not visible" for performance --- ## Configuration (Planned) ### Client Configuration: ```yaml user: name: "Streamer1" # Display name for transcriptions id: "unique-user-id" # Optional unique identifier audio: input_device: "default" # or specific device index sample_rate: 16000 chunk_duration: 2.0 # seconds noise_suppression: enabled: true # Enable/disable noise reduction strength: 0.7 # 0.0 to 1.0 - reduction strength method: "noisereduce" # "noisereduce" or "rnnoise" transcription: model: "base" # tiny, base, small, medium, large device: "cuda" # cpu, cuda, mps language: "en" # or "auto" for detection task: "transcribe" # or "translate" processing: use_vad: true # Voice Activity Detection min_confidence: 0.5 # Minimum transcription confidence server_sync: enabled: false # Enable multi-user server sync url: "ws://localhost:8000" # Server URL (when enabled) api_key: "" # Optional API key display: show_timestamps: true # Show timestamps in local display max_lines: 100 # Maximum lines to keep in display font_size: 12 # GUI font size ``` ### Server Configuration: ```yaml server: host: "0.0.0.0" port: 8000 api_key_required: false stream: max_clients: 10 buffer_size: 100 # messages to buffer retention_time: 300 # seconds database: enabled: false path: "transcriptions.db" ``` --- ## Roadmap - [x] Project planning and architecture design - [ ] Phase 1: Standalone desktop application with GUI - [ ] Phase 2: Web server and sync system (optional multi-user mode) - [ ] Phase 3: Client-server communication (optional) - [ ] Phase 4: Web stream interface for OBS (optional) - [ ] Phase 5: Advanced features (hotkeys, themes, Docker, etc.) --- ## Contributing Contributions are welcome! Please feel free to submit issues or pull requests. --- ## License [Choose appropriate license - MIT, Apache 2.0, etc.] --- ## Acknowledgments - OpenAI Whisper for the excellent speech recognition model - The streaming community for inspiration and use cases