Remove outdated implementation plan and task checklists. Document actual implemented features including RealtimeSTT, dual-layer VAD, custom fonts/colors, and auto-updates. Add practical usage instructions for standalone mode, OBS setup, and multi-user sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8.4 KiB
Local Transcription
A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.
Version 1.4.0
Features
- Real-Time Transcription: Live speech-to-text using Whisper models with minimal latency
- Standalone Desktop App: PySide6/Qt GUI that works without any server
- CPU & GPU Support: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
- Advanced Voice Detection: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
- OBS Integration: Built-in web server for browser source capture at
http://localhost:8080 - Multi-User Sync: Optional Node.js server to sync transcriptions across multiple users
- Custom Fonts: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
- Customizable Colors: User-configurable colors for name, text, and background
- Noise Suppression: Built-in audio preprocessing to reduce background noise
- Auto-Updates: Automatic update checking with release notes display
- Cross-Platform: Builds available for Windows and Linux
Quick Start
Running from Source
# Install dependencies
uv sync
# Run the application
uv run python main.py
Using Pre-Built Executables
Download the latest release from the releases page and run the executable for your platform.
Building from Source
Linux:
./build.sh
# Output: dist/LocalTranscription/LocalTranscription
Windows:
build.bat
# Output: dist\LocalTranscription\LocalTranscription.exe
For detailed build instructions, see BUILD.md.
Usage
Standalone Mode
- Launch the application
- Select your microphone from the audio device dropdown
- Choose a Whisper model (smaller = faster, larger = more accurate):
tiny.en/tiny- Fastest, good for quick captionsbase.en/base- Balanced speed and accuracysmall.en/small- Better accuracymedium.en/medium- High accuracylarge-v3- Best accuracy (requires more resources)
- Click Start to begin transcription
- Transcriptions appear in the main window and at
http://localhost:8080
OBS Browser Source Setup
- Start the Local Transcription app
- In OBS, add a Browser source
- Set URL to
http://localhost:8080 - Set dimensions (e.g., 1920x300)
- Check "Shutdown source when not visible" for performance
Multi-User Mode (Optional)
For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):
- Deploy the Node.js server (see server/nodejs/README.md)
- In the app settings, enable Server Sync
- Enter the server URL (e.g.,
http://your-server:3000/api/send) - Set a room name and passphrase (shared with other users)
- In OBS, use the server's display URL with your room name:
http://your-server:3000/display?room=YOURROOM×tamps=true&maxlines=50
Configuration
Settings are stored at ~/.local-transcription/config.yaml and can be modified through the GUI settings panel.
Key Settings
| Setting | Description | Default |
|---|---|---|
transcription.model |
Whisper model to use | base.en |
transcription.device |
Processing device (auto/cuda/cpu) | auto |
transcription.enable_realtime_transcription |
Show preview while speaking | false |
transcription.silero_sensitivity |
VAD sensitivity (0-1, lower = more sensitive) | 0.4 |
transcription.post_speech_silence_duration |
Silence before finalizing (seconds) | 0.3 |
transcription.continuous_mode |
Fast speaker mode for quick talkers | false |
display.show_timestamps |
Show timestamps with transcriptions | true |
display.fade_after_seconds |
Fade out time (0 = never) | 10 |
display.font_source |
Font type (System Font/Web-Safe/Google Font/Custom File) | System Font |
web_server.port |
Local web server port | 8080 |
See config/default_config.yaml for all available options.
Project Structure
local-transcription/
├── client/ # Core transcription modules
│ ├── audio_capture.py # Audio input handling
│ ├── transcription_engine_realtime.py # RealtimeSTT integration
│ ├── noise_suppression.py # VAD and noise reduction
│ ├── device_utils.py # CPU/GPU detection
│ ├── config.py # Configuration management
│ ├── server_sync.py # Multi-user server client
│ └── update_checker.py # Auto-update functionality
├── gui/ # Desktop application UI
│ ├── main_window_qt.py # Main application window
│ ├── settings_dialog_qt.py # Settings dialog
│ └── transcription_display_qt.py # Display widget
├── server/ # Web servers
│ ├── web_display.py # Local FastAPI server for OBS
│ └── nodejs/ # Multi-user sync server
│ ├── server.js # Express + WebSocket server
│ └── README.md # Deployment instructions
├── config/
│ └── default_config.yaml # Default settings template
├── main.py # GUI entry point
├── main_cli.py # CLI version (for testing)
├── build.sh # Linux build script
├── build.bat # Windows build script
└── local-transcription.spec # PyInstaller configuration
Technology Stack
Desktop Application
- Python 3.9+
- PySide6 - Qt6 GUI framework
- RealtimeSTT - Real-time speech-to-text with advanced VAD
- faster-whisper - Optimized Whisper model inference
- PyTorch - ML framework (CUDA-enabled)
- sounddevice - Cross-platform audio capture
- webrtcvad + silero_vad - Voice activity detection
- noisereduce - Noise suppression
Web Servers
- FastAPI + Uvicorn - Local web display server
- Node.js + Express + WebSocket - Multi-user sync server
Build Tools
- PyInstaller - Executable packaging
- uv - Fast Python package manager
System Requirements
Minimum
- Python 3.9+
- 4GB RAM
- Any modern CPU
Recommended (for real-time performance)
- 8GB+ RAM
- NVIDIA GPU with CUDA support (for GPU acceleration)
- FFmpeg (installed automatically with dependencies)
For Building
- Linux: gcc, Python dev headers
- Windows: Visual Studio Build Tools, Python dev headers
Troubleshooting
Model Loading Issues
- Models download automatically on first use to
~/.cache/huggingface/ - First run requires internet connection
- Check disk space (models range from 75MB to 3GB)
Audio Device Issues
# List available audio devices
uv run python main_cli.py --list-devices
- Ensure microphone permissions are granted
- Try different device indices in settings
GPU Not Detected
# Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"
- Install NVIDIA drivers (CUDA toolkit is bundled)
- The app automatically falls back to CPU if no GPU is available
Web Server Port Conflicts
- Default port is 8080
- Change in settings or edit config file
- Check for conflicts:
lsof -i :8080(Linux) ornetstat -ano | findstr :8080(Windows)
Use Cases
- Live Streaming Captions: Add real-time captions to your Twitch/YouTube streams
- Multi-Language Translation: Multiple translators transcribing in different languages
- Accessibility: Provide captions for hearing-impaired viewers
- Podcast Recording: Real-time transcription for multi-host shows
- Gaming Commentary: Track who said what in multiplayer sessions
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests at the repository.
License
MIT License
Acknowledgments
- OpenAI Whisper for the speech recognition model
- RealtimeSTT for real-time transcription capabilities
- faster-whisper for optimized inference