Major fixes: - Integrated ServerSyncClient into GUI for actual multi-user sync - Fixed CUDA device display to show actual hardware used - Optimized server sync with parallel HTTP requests (5x faster) - Fixed 2-second DNS delay by using 127.0.0.1 instead of localhost - Added comprehensive debugging and performance logging Performance improvements: - HTTP requests: 2045ms → 52ms (97% faster) - Multi-user sync lag: ~4s → ~100ms (97% faster) - Parallel request processing with ThreadPoolExecutor (3 workers) New features: - Room generator with one-click copy on Node.js landing page - Auto-detection of PHP vs Node.js server types - Localhost warning banner for WSL2 users - Comprehensive debug logging throughout sync pipeline Files modified: - gui/main_window_qt.py - Server sync integration, device display fix - client/server_sync.py - Parallel HTTP, server type detection - server/nodejs/server.js - Room generator, warnings, debug logs Documentation added: - PERFORMANCE_FIX.md - Server sync optimization details - FIX_2_SECOND_HTTP_DELAY.md - DNS/localhost issue solution - LATENCY_GUIDE.md - Audio chunk duration tuning guide - DEBUG_4_SECOND_LAG.md - Comprehensive debugging guide - SESSION_SUMMARY.md - Complete session summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Local Transcription for Streamers
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
Features
- Standalone Desktop Application: Use locally with built-in GUI display - no server required
- Local Transcription: Run Whisper (or compatible models) locally on your machine
- CPU/GPU Support: Choose between CPU or GPU processing based on your hardware
- Real-time Processing: Live audio transcription with minimal latency
- Noise Suppression: Built-in audio preprocessing to reduce background noise
- User Configuration: Set your display name and preferences through the GUI
- Optional Multi-user Sync: Connect to a server to sync transcriptions with other users
- OBS Integration: Web-based output designed for easy browser source capture
- Privacy-First: All processing happens locally; only transcription text is shared
- Customizable: Configure model size, language, and streaming settings
Quick Start
Running from Source
# Install dependencies
uv sync
# Run the application
uv run python main.py
Building Standalone Executables
To create standalone executables for distribution:
Linux:
./build.sh
Windows:
build.bat
For detailed build instructions, see BUILD.md.
Architecture Overview
The application can run in two modes:
Standalone Mode (No Server Required):
- Desktop Application: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
Multi-user Sync Mode (Optional):
- Local Transcription Client: Captures audio, performs speech-to-text, and sends results to the web server
- Centralized Web Server: Aggregates transcriptions from multiple clients and serves a web stream
- Web Stream Interface: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
Use Cases
- Multi-language Streams: Multiple translators transcribing in different languages
- Accessibility: Provide real-time captions for viewers
- Collaborative Podcasts: Multiple hosts with separate transcriptions
- Gaming Commentary: Track who said what in multiplayer sessions
Implementation Plan
Phase 1: Standalone Desktop Application
Objective: Build a fully functional standalone transcription app with GUI that works without any server
Components:
-
Audio Capture Module
- Capture system audio or microphone input
- Support multiple audio sources (virtual audio cables, physical devices)
- Real-time audio buffering with configurable chunk sizes
- Noise Suppression: Preprocess audio to reduce background noise
- Libraries:
pyaudio,sounddevice,noisereduce,webrtcvad
-
Noise Suppression Engine
- Real-time noise reduction using RNNoise or noisereduce
- Adjustable noise reduction strength
- Optional VAD (Voice Activity Detection) to skip silent segments
- Libraries:
noisereduce,rnnoise-python,webrtcvad
-
Transcription Engine
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
- Support multiple model sizes (tiny, base, small, medium, large)
- CPU and GPU inference options
- Model management and automatic downloading
- Libraries:
openai-whisper,faster-whisper,torch
-
Device Selection
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
- Allow user to specify preferred device via GUI
- Graceful fallback if GPU unavailable
- Display device status and performance metrics
-
Desktop GUI Application
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
- Main transcription display window (scrolling text area)
- Settings panel for configuration
- User name input field
- Audio input device selector
- Model size selector
- CPU/GPU toggle
- Start/Stop transcription button
- Optional: System tray integration
- Libraries:
PyQt6,customtkinter, ortkinter
-
Local Display
- Real-time transcription display in GUI window
- Scrolling text with timestamps
- User name/label shown with transcriptions
- Copy transcription to clipboard
- Optional: Save transcription to file (TXT, SRT, VTT)
Tasks:
- Set up project structure and dependencies
- Implement audio capture with device selection
- Add noise suppression and VAD preprocessing
- Integrate Whisper model loading and inference
- Add CPU/GPU device detection and selection logic
- Create real-time audio buffer processing pipeline
- Design and implement GUI layout (main window)
- Add settings panel with user name configuration
- Implement local transcription display area
- Add start/stop controls and status indicators
- Test transcription accuracy and latency
- Test noise suppression effectiveness
Phase 2: Web Server and Sync System
Objective: Create a centralized server to aggregate and serve transcriptions
Components:
-
Web Server
- FastAPI or Flask-based REST API
- WebSocket support for real-time updates
- User/client registration and management
- Libraries:
fastapi,uvicorn,websockets
-
Transcription Aggregator
- Receive transcription chunks from multiple clients
- Associate transcriptions with user IDs/names
- Timestamp management and synchronization
- Buffer management for smooth streaming
-
Database/Storage (Optional)
- Store transcription history (SQLite for simplicity)
- Session management
- Export functionality (SRT, VTT, TXT formats)
API Endpoints:
POST /api/register- Register a new clientPOST /api/transcription- Submit transcription chunkWS /api/stream- WebSocket for real-time transcription streamGET /stream- Web page for OBS browser source
Tasks:
- Set up FastAPI server with CORS support
- Implement WebSocket handler for real-time streaming
- Create client registration system
- Build transcription aggregation logic
- Add timestamp synchronization
- Create data models for clients and transcriptions
Phase 3: Client-Server Communication (Optional Multi-user Mode)
Objective: Add optional server connectivity to enable multi-user transcription sync
Components:
-
HTTP/WebSocket Client
- Register client with server on startup
- Send transcription chunks as they're generated
- Handle connection drops and reconnection
- Libraries:
requests,websockets
-
Configuration System
- Config file for server URL, API keys, user settings
- Model preferences (size, language)
- Audio input settings
- Format: YAML or JSON
-
Status Monitoring
- Connection status indicator
- Transcription queue health
- Error handling and logging
Tasks:
- Add "Enable Server Sync" toggle to GUI
- Add server URL configuration field in settings
- Implement WebSocket client for sending transcriptions
- Add configuration file support (YAML/JSON)
- Create connection management with auto-reconnect
- Add local logging and error handling
- Add server connection status indicator to GUI
- Allow app to function normally if server is unavailable
Phase 4: Web Stream Interface (OBS Integration)
Objective: Create a web page that displays synchronized transcriptions for OBS
Components:
-
Web Frontend
- HTML/CSS/JavaScript page for displaying transcriptions
- Responsive design with customizable styling
- Auto-scroll with configurable retention window
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
-
Styling Options
- Customizable fonts, colors, sizes
- Background transparency for OBS chroma key
- User name/ID display options
- Timestamp display (optional)
-
Display Modes
- Scrolling captions (like live TV captions)
- Multi-user panel view (separate sections per user)
- Overlay mode (minimal UI for transparency)
Tasks:
- Create HTML template for transcription display
- Implement WebSocket client in JavaScript
- Add CSS styling with OBS-friendly transparency
- Create customization controls (URL parameters or UI)
- Test with OBS browser source
- Add configurable retention/scroll behavior
Phase 5: Advanced Features
Objective: Enhance functionality and user experience
Features:
-
Language Detection
- Auto-detect spoken language
- Multi-language support in single stream
- Language selector in GUI
-
Speaker Diarization (Optional)
- Identify different speakers
- Label transcriptions by speaker
- Useful for multi-host streams
-
Profanity Filtering
- Optional word filtering/replacement
- Customizable filter lists
- Toggle in GUI settings
-
Advanced Noise Profiles
- Save and load custom noise profiles
- Adaptive noise suppression
- Different profiles for different environments
-
Export Functionality
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
- Export button in GUI
- Automatic session saving
-
Hotkey Support
- Global hotkeys to start/stop transcription
- Mute/unmute hotkey
- Quick save hotkey
-
Docker Support
- Containerized server deployment
- Docker Compose for easy multi-component setup
- Pre-built images for easy deployment
-
Themes and Customization
- Dark/light theme toggle
- Customizable font sizes and colors for display
- OBS-friendly transparent overlay mode
Tasks:
- Add language detection and multi-language support
- Implement speaker diarization
- Create optional profanity filter
- Add export functionality (SRT, VTT, plain text, JSON)
- Implement global hotkey support
- Create Docker containers for server component
- Add theme customization options
- Create advanced noise profile management
Technology Stack
Local Client:
- Python 3.9+
- GUI: PyQt6 / CustomTkinter / tkinter
- Audio: PyAudio / sounddevice
- Noise Suppression: noisereduce / rnnoise-python
- VAD: webrtcvad
- ML Framework: PyTorch (for Whisper)
- Transcription: openai-whisper / faster-whisper
- Networking: websockets, requests (optional for server sync)
- Config: PyYAML / json
Server:
- Backend: FastAPI / Flask
- WebSocket: python-websockets / FastAPI WebSockets
- Server: Uvicorn / Gunicorn
- Database (optional): SQLite / PostgreSQL
- CORS: fastapi-cors
Web Interface:
- Frontend: HTML5, CSS3, JavaScript (ES6+)
- Real-time: WebSocket API
- Styling: CSS Grid/Flexbox for layout
Project Structure
local-transcription/
client/ # Local transcription client
__init__.py
audio_capture.py # Audio input handling
transcription_engine.py # Whisper integration
network_client.py # Server communication
config.py # Configuration management
main.py # Client entry point
server/ # Centralized web server
__init__.py
api.py # FastAPI routes
websocket_handler.py # WebSocket management
models.py # Data models
database.py # Optional DB layer
main.py # Server entry point
web/ # Web stream interface
index.html # OBS browser source page
styles.css # Customizable styling
app.js # WebSocket client & UI logic
config/
client_config.example.yaml
server_config.example.yaml
tests/
test_audio.py
test_transcription.py
test_server.py
requirements.txt # Python dependencies
README.md
main.py # Combined launcher (optional)
Installation (Planned)
Prerequisites:
- Python 3.9 or higher
- CUDA-capable GPU (optional, for GPU acceleration)
- FFmpeg (required by Whisper)
Steps:
-
Clone the repository
git clone <repository-url> cd local-transcription -
Install dependencies
pip install -r requirements.txt -
Download Whisper models
# Models will be auto-downloaded on first run # Or manually download: python -c "import whisper; whisper.load_model('base')" -
Configure client
cp config/client_config.example.yaml config/client_config.yaml # Edit config/client_config.yaml with your settings -
Run the server (one instance)
python server/main.py -
Run the client (on each user's machine)
python client/main.py -
Add to OBS
- Add a Browser Source
- URL:
http://<server-ip>:8000/stream - Set width/height as needed
- Check "Shutdown source when not visible" for performance
Configuration (Planned)
Client Configuration:
user:
name: "Streamer1" # Display name for transcriptions
id: "unique-user-id" # Optional unique identifier
audio:
input_device: "default" # or specific device index
sample_rate: 16000
chunk_duration: 2.0 # seconds
noise_suppression:
enabled: true # Enable/disable noise reduction
strength: 0.7 # 0.0 to 1.0 - reduction strength
method: "noisereduce" # "noisereduce" or "rnnoise"
transcription:
model: "base" # tiny, base, small, medium, large
device: "cuda" # cpu, cuda, mps
language: "en" # or "auto" for detection
task: "transcribe" # or "translate"
processing:
use_vad: true # Voice Activity Detection
min_confidence: 0.5 # Minimum transcription confidence
server_sync:
enabled: false # Enable multi-user server sync
url: "ws://localhost:8000" # Server URL (when enabled)
api_key: "" # Optional API key
display:
show_timestamps: true # Show timestamps in local display
max_lines: 100 # Maximum lines to keep in display
font_size: 12 # GUI font size
Server Configuration:
server:
host: "0.0.0.0"
port: 8000
api_key_required: false
stream:
max_clients: 10
buffer_size: 100 # messages to buffer
retention_time: 300 # seconds
database:
enabled: false
path: "transcriptions.db"
Roadmap
- Project planning and architecture design
- Phase 1: Standalone desktop application with GUI
- Phase 2: Web server and sync system (optional multi-user mode)
- Phase 3: Client-server communication (optional)
- Phase 4: Web stream interface for OBS (optional)
- Phase 5: Advanced features (hotkeys, themes, Docker, etc.)
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
License
[Choose appropriate license - MIT, Apache 2.0, etc.]
Acknowledgments
- OpenAI Whisper for the excellent speech recognition model
- The streaming community for inspiration and use cases