jknapp bb8a8c251d Update README to reflect current application state
Remove outdated implementation plan and task checklists. Document
actual implemented features including RealtimeSTT, dual-layer VAD,
custom fonts/colors, and auto-updates. Add practical usage instructions
for standalone mode, OBS setup, and multi-user sync.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 06:31:27 -08:00

Local Transcription

A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.

Version 1.4.0

Features

  • Real-Time Transcription: Live speech-to-text using Whisper models with minimal latency
  • Standalone Desktop App: PySide6/Qt GUI that works without any server
  • CPU & GPU Support: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
  • Advanced Voice Detection: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
  • OBS Integration: Built-in web server for browser source capture at http://localhost:8080
  • Multi-User Sync: Optional Node.js server to sync transcriptions across multiple users
  • Custom Fonts: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
  • Customizable Colors: User-configurable colors for name, text, and background
  • Noise Suppression: Built-in audio preprocessing to reduce background noise
  • Auto-Updates: Automatic update checking with release notes display
  • Cross-Platform: Builds available for Windows and Linux

Quick Start

Running from Source

# Install dependencies
uv sync

# Run the application
uv run python main.py

Using Pre-Built Executables

Download the latest release from the releases page and run the executable for your platform.

Building from Source

Linux:

./build.sh
# Output: dist/LocalTranscription/LocalTranscription

Windows:

build.bat
# Output: dist\LocalTranscription\LocalTranscription.exe

For detailed build instructions, see BUILD.md.

Usage

Standalone Mode

  1. Launch the application
  2. Select your microphone from the audio device dropdown
  3. Choose a Whisper model (smaller = faster, larger = more accurate):
    • tiny.en / tiny - Fastest, good for quick captions
    • base.en / base - Balanced speed and accuracy
    • small.en / small - Better accuracy
    • medium.en / medium - High accuracy
    • large-v3 - Best accuracy (requires more resources)
  4. Click Start to begin transcription
  5. Transcriptions appear in the main window and at http://localhost:8080

OBS Browser Source Setup

  1. Start the Local Transcription app
  2. In OBS, add a Browser source
  3. Set URL to http://localhost:8080
  4. Set dimensions (e.g., 1920x300)
  5. Check "Shutdown source when not visible" for performance

Multi-User Mode (Optional)

For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):

  1. Deploy the Node.js server (see server/nodejs/README.md)
  2. In the app settings, enable Server Sync
  3. Enter the server URL (e.g., http://your-server:3000/api/send)
  4. Set a room name and passphrase (shared with other users)
  5. In OBS, use the server's display URL with your room name:
    http://your-server:3000/display?room=YOURROOM&timestamps=true&maxlines=50
    

Configuration

Settings are stored at ~/.local-transcription/config.yaml and can be modified through the GUI settings panel.

Key Settings

Setting Description Default
transcription.model Whisper model to use base.en
transcription.device Processing device (auto/cuda/cpu) auto
transcription.enable_realtime_transcription Show preview while speaking false
transcription.silero_sensitivity VAD sensitivity (0-1, lower = more sensitive) 0.4
transcription.post_speech_silence_duration Silence before finalizing (seconds) 0.3
transcription.continuous_mode Fast speaker mode for quick talkers false
display.show_timestamps Show timestamps with transcriptions true
display.fade_after_seconds Fade out time (0 = never) 10
display.font_source Font type (System Font/Web-Safe/Google Font/Custom File) System Font
web_server.port Local web server port 8080

See config/default_config.yaml for all available options.

Project Structure

local-transcription/
├── client/                      # Core transcription modules
│   ├── audio_capture.py         # Audio input handling
│   ├── transcription_engine_realtime.py  # RealtimeSTT integration
│   ├── noise_suppression.py     # VAD and noise reduction
│   ├── device_utils.py          # CPU/GPU detection
│   ├── config.py                # Configuration management
│   ├── server_sync.py           # Multi-user server client
│   └── update_checker.py        # Auto-update functionality
├── gui/                         # Desktop application UI
│   ├── main_window_qt.py        # Main application window
│   ├── settings_dialog_qt.py    # Settings dialog
│   └── transcription_display_qt.py  # Display widget
├── server/                      # Web servers
│   ├── web_display.py           # Local FastAPI server for OBS
│   └── nodejs/                  # Multi-user sync server
│       ├── server.js            # Express + WebSocket server
│       └── README.md            # Deployment instructions
├── config/
│   └── default_config.yaml      # Default settings template
├── main.py                      # GUI entry point
├── main_cli.py                  # CLI version (for testing)
├── build.sh                     # Linux build script
├── build.bat                    # Windows build script
└── local-transcription.spec     # PyInstaller configuration

Technology Stack

Desktop Application

  • Python 3.9+
  • PySide6 - Qt6 GUI framework
  • RealtimeSTT - Real-time speech-to-text with advanced VAD
  • faster-whisper - Optimized Whisper model inference
  • PyTorch - ML framework (CUDA-enabled)
  • sounddevice - Cross-platform audio capture
  • webrtcvad + silero_vad - Voice activity detection
  • noisereduce - Noise suppression

Web Servers

  • FastAPI + Uvicorn - Local web display server
  • Node.js + Express + WebSocket - Multi-user sync server

Build Tools

  • PyInstaller - Executable packaging
  • uv - Fast Python package manager

System Requirements

Minimum

  • Python 3.9+
  • 4GB RAM
  • Any modern CPU
  • 8GB+ RAM
  • NVIDIA GPU with CUDA support (for GPU acceleration)
  • FFmpeg (installed automatically with dependencies)

For Building

  • Linux: gcc, Python dev headers
  • Windows: Visual Studio Build Tools, Python dev headers

Troubleshooting

Model Loading Issues

  • Models download automatically on first use to ~/.cache/huggingface/
  • First run requires internet connection
  • Check disk space (models range from 75MB to 3GB)

Audio Device Issues

# List available audio devices
uv run python main_cli.py --list-devices
  • Ensure microphone permissions are granted
  • Try different device indices in settings

GPU Not Detected

# Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"
  • Install NVIDIA drivers (CUDA toolkit is bundled)
  • The app automatically falls back to CPU if no GPU is available

Web Server Port Conflicts

  • Default port is 8080
  • Change in settings or edit config file
  • Check for conflicts: lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

Use Cases

  • Live Streaming Captions: Add real-time captions to your Twitch/YouTube streams
  • Multi-Language Translation: Multiple translators transcribing in different languages
  • Accessibility: Provide captions for hearing-impaired viewers
  • Podcast Recording: Real-time transcription for multi-host shows
  • Gaming Commentary: Track who said what in multiplayer sessions

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests at the repository.

License

MIT License

Acknowledgments

Description
Run local speech to text transcription for captioning streamers.
Readme 830 KiB
2026-01-23 02:13:56 +00:00
Languages
Python 84%
JavaScript 14.8%
Shell 0.8%
Batchfile 0.4%