Files
local-transcription/README.md
jknapp bb8a8c251d Update README to reflect current application state
Remove outdated implementation plan and task checklists. Document
actual implemented features including RealtimeSTT, dual-layer VAD,
custom fonts/colors, and auto-updates. Add practical usage instructions
for standalone mode, OBS setup, and multi-user sync.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 06:31:27 -08:00

8.4 KiB

Local Transcription

A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.

Version 1.4.0

Features

  • Real-Time Transcription: Live speech-to-text using Whisper models with minimal latency
  • Standalone Desktop App: PySide6/Qt GUI that works without any server
  • CPU & GPU Support: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
  • Advanced Voice Detection: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
  • OBS Integration: Built-in web server for browser source capture at http://localhost:8080
  • Multi-User Sync: Optional Node.js server to sync transcriptions across multiple users
  • Custom Fonts: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
  • Customizable Colors: User-configurable colors for name, text, and background
  • Noise Suppression: Built-in audio preprocessing to reduce background noise
  • Auto-Updates: Automatic update checking with release notes display
  • Cross-Platform: Builds available for Windows and Linux

Quick Start

Running from Source

# Install dependencies
uv sync

# Run the application
uv run python main.py

Using Pre-Built Executables

Download the latest release from the releases page and run the executable for your platform.

Building from Source

Linux:

./build.sh
# Output: dist/LocalTranscription/LocalTranscription

Windows:

build.bat
# Output: dist\LocalTranscription\LocalTranscription.exe

For detailed build instructions, see BUILD.md.

Usage

Standalone Mode

  1. Launch the application
  2. Select your microphone from the audio device dropdown
  3. Choose a Whisper model (smaller = faster, larger = more accurate):
    • tiny.en / tiny - Fastest, good for quick captions
    • base.en / base - Balanced speed and accuracy
    • small.en / small - Better accuracy
    • medium.en / medium - High accuracy
    • large-v3 - Best accuracy (requires more resources)
  4. Click Start to begin transcription
  5. Transcriptions appear in the main window and at http://localhost:8080

OBS Browser Source Setup

  1. Start the Local Transcription app
  2. In OBS, add a Browser source
  3. Set URL to http://localhost:8080
  4. Set dimensions (e.g., 1920x300)
  5. Check "Shutdown source when not visible" for performance

Multi-User Mode (Optional)

For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):

  1. Deploy the Node.js server (see server/nodejs/README.md)
  2. In the app settings, enable Server Sync
  3. Enter the server URL (e.g., http://your-server:3000/api/send)
  4. Set a room name and passphrase (shared with other users)
  5. In OBS, use the server's display URL with your room name:
    http://your-server:3000/display?room=YOURROOM&timestamps=true&maxlines=50
    

Configuration

Settings are stored at ~/.local-transcription/config.yaml and can be modified through the GUI settings panel.

Key Settings

Setting Description Default
transcription.model Whisper model to use base.en
transcription.device Processing device (auto/cuda/cpu) auto
transcription.enable_realtime_transcription Show preview while speaking false
transcription.silero_sensitivity VAD sensitivity (0-1, lower = more sensitive) 0.4
transcription.post_speech_silence_duration Silence before finalizing (seconds) 0.3
transcription.continuous_mode Fast speaker mode for quick talkers false
display.show_timestamps Show timestamps with transcriptions true
display.fade_after_seconds Fade out time (0 = never) 10
display.font_source Font type (System Font/Web-Safe/Google Font/Custom File) System Font
web_server.port Local web server port 8080

See config/default_config.yaml for all available options.

Project Structure

local-transcription/
├── client/                      # Core transcription modules
│   ├── audio_capture.py         # Audio input handling
│   ├── transcription_engine_realtime.py  # RealtimeSTT integration
│   ├── noise_suppression.py     # VAD and noise reduction
│   ├── device_utils.py          # CPU/GPU detection
│   ├── config.py                # Configuration management
│   ├── server_sync.py           # Multi-user server client
│   └── update_checker.py        # Auto-update functionality
├── gui/                         # Desktop application UI
│   ├── main_window_qt.py        # Main application window
│   ├── settings_dialog_qt.py    # Settings dialog
│   └── transcription_display_qt.py  # Display widget
├── server/                      # Web servers
│   ├── web_display.py           # Local FastAPI server for OBS
│   └── nodejs/                  # Multi-user sync server
│       ├── server.js            # Express + WebSocket server
│       └── README.md            # Deployment instructions
├── config/
│   └── default_config.yaml      # Default settings template
├── main.py                      # GUI entry point
├── main_cli.py                  # CLI version (for testing)
├── build.sh                     # Linux build script
├── build.bat                    # Windows build script
└── local-transcription.spec     # PyInstaller configuration

Technology Stack

Desktop Application

  • Python 3.9+
  • PySide6 - Qt6 GUI framework
  • RealtimeSTT - Real-time speech-to-text with advanced VAD
  • faster-whisper - Optimized Whisper model inference
  • PyTorch - ML framework (CUDA-enabled)
  • sounddevice - Cross-platform audio capture
  • webrtcvad + silero_vad - Voice activity detection
  • noisereduce - Noise suppression

Web Servers

  • FastAPI + Uvicorn - Local web display server
  • Node.js + Express + WebSocket - Multi-user sync server

Build Tools

  • PyInstaller - Executable packaging
  • uv - Fast Python package manager

System Requirements

Minimum

  • Python 3.9+
  • 4GB RAM
  • Any modern CPU
  • 8GB+ RAM
  • NVIDIA GPU with CUDA support (for GPU acceleration)
  • FFmpeg (installed automatically with dependencies)

For Building

  • Linux: gcc, Python dev headers
  • Windows: Visual Studio Build Tools, Python dev headers

Troubleshooting

Model Loading Issues

  • Models download automatically on first use to ~/.cache/huggingface/
  • First run requires internet connection
  • Check disk space (models range from 75MB to 3GB)

Audio Device Issues

# List available audio devices
uv run python main_cli.py --list-devices
  • Ensure microphone permissions are granted
  • Try different device indices in settings

GPU Not Detected

# Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"
  • Install NVIDIA drivers (CUDA toolkit is bundled)
  • The app automatically falls back to CPU if no GPU is available

Web Server Port Conflicts

  • Default port is 8080
  • Change in settings or edit config file
  • Check for conflicts: lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

Use Cases

  • Live Streaming Captions: Add real-time captions to your Twitch/YouTube streams
  • Multi-Language Translation: Multiple translators transcribing in different languages
  • Accessibility: Provide captions for hearing-impaired viewers
  • Podcast Recording: Real-time transcription for multi-host shows
  • Gaming Commentary: Track who said what in multiplayer sessions

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests at the repository.

License

MIT License

Acknowledgments