Update both docs to reflect the new architecture: - Tauri v2 + Svelte 5 frontend replacing PySide6/Qt - Headless Python backend with FastAPI control API - Cross-platform support (Windows, macOS, Linux) - Deepgram remote transcription (managed/BYOK) - Gitea CI/CD workflows for automated builds - New project structure with backend/, src/, src-tauri/ - Updated development commands and build instructions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Local Transcription
A real-time speech-to-text desktop application for streamers. Runs locally on your machine with GPU or CPU, displays transcriptions via OBS browser source, and optionally syncs with other users through a multi-user server.
Version 1.4.0
Features
- Real-Time Transcription: Live speech-to-text using Whisper models with minimal latency
- Cross-Platform: Native desktop app for Windows, macOS, and Linux via Tauri
- Dual Transcription Modes: Local (Whisper) or cloud (Deepgram) with managed billing or BYOK
- CPU & GPU Support: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
- Advanced Voice Detection: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
- OBS Integration: Built-in web server for browser source capture at
http://localhost:8080 - Multi-User Sync: Optional Node.js server to sync transcriptions across multiple users
- Custom Fonts: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
- Customizable Colors: User-configurable colors for name, text, and background
- Noise Suppression: Built-in audio preprocessing to reduce background noise
- Auto-Updates: Automatic update checking with release notes display
Architecture
The application uses a two-process architecture:
- Tauri Shell (Svelte 5 frontend) — lightweight native window (~50MB) rendering the UI
- Python Backend (sidecar) — headless process running transcription, audio capture, and the OBS web server
The Tauri frontend communicates with the Python backend via REST API and WebSocket, following the same pattern as voice-to-notes.
Tauri App (user launches this)
└─ Spawns Python backend as sidecar
├─ FastAPI REST API (control endpoints)
├─ WebSocket /ws/control (real-time state + transcriptions)
├─ OBS web display at http://localhost:8080
└─ Transcription engine (Whisper or Deepgram)
Legacy GUI: The original PySide6/Qt desktop GUI (
main.py) still works alongside the new Tauri frontend during the transition period.
Quick Start
Running from Source
# Install Python dependencies
uv sync
# Run the Tauri app (frontend + backend)
npm install
npm run tauri dev
# Or run just the headless backend (for development)
uv run python -m backend.main_headless
# Or run the legacy PySide6 GUI
uv run python main.py
Using Pre-Built Executables
Download the latest release from the releases page:
- App installer (Tauri shell):
.msi(Windows),.dmg(macOS),.deb/.rpm/.AppImage(Linux) - Sidecar (Python backend): Download the matching
sidecar-*zip for your platform (CUDA or CPU)
Building from Source
# Build the Tauri app
npm install
npm run tauri build
# Output: src-tauri/target/release/bundle/
# Build the Python sidecar (headless, no Qt)
uv sync
uv run pyinstaller local-transcription-headless.spec
# Output: dist/local-transcription-backend/
# Build the legacy PySide6 app (Linux)
./build.sh
# Build the legacy PySide6 app (Windows)
build.bat
For detailed build instructions, see BUILD.md.
Usage
Standalone Mode
- Launch the application
- Select your microphone from the audio device dropdown
- Choose a Whisper model (smaller = faster, larger = more accurate):
tiny.en/tiny— Fastest, good for quick captionsbase.en/base— Balanced speed and accuracysmall.en/small— Better accuracymedium.en/medium— High accuracylarge-v3— Best accuracy (requires more resources)
- Click Start to begin transcription
- Transcriptions appear in the main window and at
http://localhost:8080
Remote Transcription (Deepgram)
Instead of local Whisper models, you can use cloud-based transcription:
- Managed mode: Sign up via the transcription proxy for metered billing
- BYOK mode: Bring your own Deepgram API key for direct access
Configure in Settings > Remote Transcription.
OBS Browser Source Setup
- Start the Local Transcription app
- In OBS, add a Browser source
- Set URL to
http://localhost:8080 - Set dimensions (e.g., 1920x300)
- Check "Shutdown source when not visible" for performance
Multi-User Mode (Optional)
For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):
- Deploy the Node.js server (see server/nodejs/README.md)
- In the app settings, enable Server Sync
- Enter the server URL (e.g.,
http://your-server:3000/api/send) - Set a room name and passphrase (shared with other users)
- In OBS, use the server's display URL with your room name:
http://your-server:3000/display?room=YOURROOM×tamps=true&maxlines=50
Configuration
Settings are stored at ~/.local-transcription/config.yaml and can be modified through the GUI settings panel or the REST API.
Key Settings
| Setting | Description | Default |
|---|---|---|
transcription.model |
Whisper model to use | base.en |
transcription.device |
Processing device (auto/cuda/cpu) | auto |
transcription.enable_realtime_transcription |
Show preview while speaking | false |
transcription.silero_sensitivity |
VAD sensitivity (0-1, lower = more sensitive) | 0.4 |
transcription.post_speech_silence_duration |
Silence before finalizing (seconds) | 0.3 |
transcription.continuous_mode |
Fast speaker mode for quick talkers | false |
remote.mode |
Transcription mode (local/managed/byok) | local |
display.show_timestamps |
Show timestamps with transcriptions | true |
display.fade_after_seconds |
Fade out time (0 = never) | 10 |
display.font_source |
Font type (System Font/Web-Safe/Google Font/Custom File) | System Font |
web_server.port |
Local web server port | 8080 |
See config/default_config.yaml for all available options.
Project Structure
local-transcription/
├── src/ # Svelte 5 frontend (Tauri UI)
│ ├── App.svelte # Main app shell
│ ├── lib/components/ # UI components
│ │ ├── Header.svelte
│ │ ├── StatusBar.svelte
│ │ ├── Controls.svelte
│ │ ├── TranscriptionDisplay.svelte
│ │ └── Settings.svelte
│ └── lib/stores/ # Reactive state management
│ ├── backend.ts # WebSocket + REST API client
│ ├── config.ts # App configuration
│ └── transcriptions.ts # Transcription data
├── src-tauri/ # Tauri v2 Rust shell
│ ├── src/main.rs
│ └── tauri.conf.json
├── backend/ # Headless Python backend (sidecar)
│ ├── app_controller.py # Orchestration logic (engine, sync, config)
│ ├── api_server.py # FastAPI REST + WebSocket control API
│ └── main_headless.py # Headless entry point
├── client/ # Core transcription modules
│ ├── audio_capture.py # Audio input handling
│ ├── transcription_engine_realtime.py # RealtimeSTT / Whisper
│ ├── deepgram_transcription.py # Deepgram cloud transcription
│ ├── noise_suppression.py # VAD and noise reduction
│ ├── device_utils.py # CPU/GPU/MPS detection
│ ├── config.py # Configuration management
│ ├── server_sync.py # Multi-user server client
│ └── update_checker.py # Auto-update functionality
├── gui/ # Legacy PySide6/Qt GUI
│ ├── main_window_qt.py
│ ├── settings_dialog_qt.py
│ └── transcription_display_qt.py
├── server/ # Web servers
│ ├── web_display.py # Local FastAPI server for OBS
│ └── nodejs/ # Multi-user sync server
├── .gitea/workflows/ # CI/CD
│ ├── release.yml # Tauri app builds (all platforms)
│ └── build-sidecar.yml # Python sidecar builds (CUDA + CPU)
├── config/
│ └── default_config.yaml # Default settings template
├── main.py # Legacy GUI entry point
├── main_cli.py # CLI version (for testing)
├── local-transcription.spec # PyInstaller config (legacy, with PySide6)
├── local-transcription-headless.spec # PyInstaller config (headless sidecar)
├── pyproject.toml # Python dependencies
└── package.json # Node.js / Tauri dependencies
Technology Stack
Frontend (Tauri)
- Tauri v2 — Native cross-platform shell (Rust)
- Svelte 5 — Reactive UI framework (TypeScript)
- Vite — Frontend build tool
Backend (Python Sidecar)
- Python 3.9+
- FastAPI + Uvicorn — REST API and WebSocket server
- RealtimeSTT — Real-time speech-to-text with advanced VAD
- faster-whisper — Optimized Whisper model inference (CTranslate2)
- PyTorch — ML framework (CUDA-enabled builds available)
- sounddevice — Cross-platform audio capture
- webrtcvad + silero_vad — Voice activity detection
Multi-User Server (Optional)
- Node.js + Express + WebSocket — Real-time sync server
Build & CI/CD
- PyInstaller — Python sidecar packaging
- Tauri CLI — App bundling (.msi, .dmg, .deb, .rpm, .AppImage)
- Gitea Actions — Automated cross-platform builds
- uv — Fast Python package manager
CI/CD
Two Gitea Actions workflows in .gitea/workflows/:
| Workflow | Trigger | Produces |
|---|---|---|
release.yml |
Push to main |
Tauri app installers for all platforms |
build-sidecar.yml |
Changes to client/, server/, backend/, or pyproject.toml |
Python sidecar zips (CUDA + CPU) |
Both workflows require a BUILD_TOKEN secret in the repo settings (Gitea API token with release write access).
Release Artifacts
| Platform | App Installer | Sidecar (CUDA) | Sidecar (CPU) |
|---|---|---|---|
| Linux x86_64 | .deb, .rpm, .AppImage |
sidecar-linux-x86_64-cuda.zip |
sidecar-linux-x86_64-cpu.zip |
| Windows x86_64 | .msi, -setup.exe |
sidecar-windows-x86_64-cuda.zip |
sidecar-windows-x86_64-cpu.zip |
| macOS ARM64 | .dmg |
— | sidecar-macos-aarch64-cpu.zip |
System Requirements
Minimum
- 4GB RAM
- Any modern CPU
Recommended (for local real-time transcription)
- 8GB+ RAM
- NVIDIA GPU with CUDA support (for GPU acceleration)
For Building
- Tauri app: Node.js 20+, Rust stable, platform SDK (see Tauri prerequisites)
- Python sidecar: Python 3.9+, uv, PyInstaller
- Linux:
libgtk-3-dev,libwebkit2gtk-4.1-dev,libappindicator3-dev,librsvg2-dev,patchelf - Windows: Visual Studio Build Tools, WebView2
- macOS: Xcode Command Line Tools
Troubleshooting
Model Loading Issues
- Models download automatically on first use to
~/.cache/huggingface/ - First run requires internet connection
- Check disk space (models range from 75MB to 3GB)
Audio Device Issues
# List available audio devices
uv run python main_cli.py --list-devices
- Ensure microphone permissions are granted (especially on macOS)
- Try different device indices in settings
GPU Not Detected
# Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"
- Install NVIDIA drivers (CUDA toolkit is bundled in CUDA sidecar builds)
- The app automatically falls back to CPU if no GPU is available
Web Server Port Conflicts
- Default port is 8080; the app tries ports 8080-8084 automatically
- Change in settings or edit config file
- Check for conflicts:
lsof -i :8080(Linux/macOS) ornetstat -ano | findstr :8080(Windows)
Use Cases
- Live Streaming Captions: Add real-time captions to your Twitch/YouTube streams
- Multi-Language Translation: Multiple translators transcribing in different languages
- Accessibility: Provide captions for hearing-impaired viewers
- Podcast Recording: Real-time transcription for multi-host shows
- Gaming Commentary: Track who said what in multiplayer sessions
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests at the repository.
License
MIT License
Acknowledgments
- OpenAI Whisper for the speech recognition model
- RealtimeSTT for real-time transcription capabilities
- faster-whisper for optimized inference
- Tauri for the cross-platform desktop framework
- Deepgram for cloud transcription API