Files

jknapp bb8a8c251d Update README to reflect current application state

Remove outdated implementation plan and task checklists. Document
actual implemented features including RealtimeSTT, dual-layer VAD,
custom fonts/colors, and auto-updates. Add practical usage instructions
for standalone mode, OBS setup, and multi-user sync.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-23 06:31:27 -08:00

8.4 KiB

Raw Blame History

Local Transcription

A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.

Version 1.4.0

Features

Real-Time Transcription: Live speech-to-text using Whisper models with minimal latency
Standalone Desktop App: PySide6/Qt GUI that works without any server
CPU & GPU Support: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
Advanced Voice Detection: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
OBS Integration: Built-in web server for browser source capture at http://localhost:8080
Multi-User Sync: Optional Node.js server to sync transcriptions across multiple users
Custom Fonts: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
Customizable Colors: User-configurable colors for name, text, and background
Noise Suppression: Built-in audio preprocessing to reduce background noise
Auto-Updates: Automatic update checking with release notes display
Cross-Platform: Builds available for Windows and Linux

Quick Start

Running from Source

# Install dependencies
uv sync

# Run the application
uv run python main.py

Using Pre-Built Executables

Download the latest release from the releases page and run the executable for your platform.

Building from Source

Linux:

./build.sh
# Output: dist/LocalTranscription/LocalTranscription

Windows:

build.bat
# Output: dist\LocalTranscription\LocalTranscription.exe

For detailed build instructions, see BUILD.md.

Usage

Standalone Mode

Launch the application
Select your microphone from the audio device dropdown
Choose a Whisper model (smaller = faster, larger = more accurate):
- tiny.en / tiny - Fastest, good for quick captions
- base.en / base - Balanced speed and accuracy
- small.en / small - Better accuracy
- medium.en / medium - High accuracy
- large-v3 - Best accuracy (requires more resources)
Click Start to begin transcription
Transcriptions appear in the main window and at http://localhost:8080

OBS Browser Source Setup

Start the Local Transcription app
In OBS, add a Browser source
Set URL to http://localhost:8080
Set dimensions (e.g., 1920x300)
Check "Shutdown source when not visible" for performance

Multi-User Mode (Optional)

For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):

Deploy the Node.js server (see server/nodejs/README.md)
In the app settings, enable Server Sync
Enter the server URL (e.g., http://your-server:3000/api/send)
Set a room name and passphrase (shared with other users)

In OBS, use the server's display URL with your room name:

http://your-server:3000/display?room=YOURROOM&timestamps=true&maxlines=50

Configuration

Settings are stored at ~/.local-transcription/config.yaml and can be modified through the GUI settings panel.

Key Settings

Setting	Description	Default
`transcription.model`	Whisper model to use	`base.en`
`transcription.device`	Processing device (auto/cuda/cpu)	`auto`
`transcription.enable_realtime_transcription`	Show preview while speaking	`false`
`transcription.silero_sensitivity`	VAD sensitivity (0-1, lower = more sensitive)	`0.4`
`transcription.post_speech_silence_duration`	Silence before finalizing (seconds)	`0.3`
`transcription.continuous_mode`	Fast speaker mode for quick talkers	`false`
`display.show_timestamps`	Show timestamps with transcriptions	`true`
`display.fade_after_seconds`	Fade out time (0 = never)	`10`
`display.font_source`	Font type (System Font/Web-Safe/Google Font/Custom File)	`System Font`
`web_server.port`	Local web server port	`8080`

See config/default_config.yaml for all available options.

Project Structure

local-transcription/
├── client/                      # Core transcription modules
│   ├── audio_capture.py         # Audio input handling
│   ├── transcription_engine_realtime.py  # RealtimeSTT integration
│   ├── noise_suppression.py     # VAD and noise reduction
│   ├── device_utils.py          # CPU/GPU detection
│   ├── config.py                # Configuration management
│   ├── server_sync.py           # Multi-user server client
│   └── update_checker.py        # Auto-update functionality
├── gui/                         # Desktop application UI
│   ├── main_window_qt.py        # Main application window
│   ├── settings_dialog_qt.py    # Settings dialog
│   └── transcription_display_qt.py  # Display widget
├── server/                      # Web servers
│   ├── web_display.py           # Local FastAPI server for OBS
│   └── nodejs/                  # Multi-user sync server
│       ├── server.js            # Express + WebSocket server
│       └── README.md            # Deployment instructions
├── config/
│   └── default_config.yaml      # Default settings template
├── main.py                      # GUI entry point
├── main_cli.py                  # CLI version (for testing)
├── build.sh                     # Linux build script
├── build.bat                    # Windows build script
└── local-transcription.spec     # PyInstaller configuration

Technology Stack

Desktop Application

Python 3.9+
PySide6 - Qt6 GUI framework
RealtimeSTT - Real-time speech-to-text with advanced VAD
faster-whisper - Optimized Whisper model inference
PyTorch - ML framework (CUDA-enabled)
sounddevice - Cross-platform audio capture
webrtcvad + silero_vad - Voice activity detection
noisereduce - Noise suppression

Web Servers

FastAPI + Uvicorn - Local web display server
Node.js + Express + WebSocket - Multi-user sync server

Build Tools

PyInstaller - Executable packaging
uv - Fast Python package manager

System Requirements

Minimum

Python 3.9+
4GB RAM
Any modern CPU

Recommended (for real-time performance)

8GB+ RAM
NVIDIA GPU with CUDA support (for GPU acceleration)
FFmpeg (installed automatically with dependencies)

For Building

Linux: gcc, Python dev headers
Windows: Visual Studio Build Tools, Python dev headers

Troubleshooting

Model Loading Issues

Models download automatically on first use to ~/.cache/huggingface/
First run requires internet connection
Check disk space (models range from 75MB to 3GB)

Audio Device Issues

# List available audio devices
uv run python main_cli.py --list-devices

Ensure microphone permissions are granted
Try different device indices in settings

GPU Not Detected

# Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"

Install NVIDIA drivers (CUDA toolkit is bundled)
The app automatically falls back to CPU if no GPU is available

Web Server Port Conflicts

Default port is 8080
Change in settings or edit config file
Check for conflicts: lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

Use Cases

Live Streaming Captions: Add real-time captions to your Twitch/YouTube streams
Multi-Language Translation: Multiple translators transcribing in different languages
Accessibility: Provide captions for hearing-impaired viewers
Podcast Recording: Real-time transcription for multi-host shows
Gaming Commentary: Track who said what in multiplayer sessions

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests at the repository.

License

MIT License

Acknowledgments

OpenAI Whisper for the speech recognition model
RealtimeSTT for real-time transcription capabilities
faster-whisper for optimized inference

8.4 KiB Raw Blame History