# Local Transcription A real-time speech-to-text desktop application for streamers. Runs locally on your machine with GPU or CPU, displays transcriptions via OBS browser source, and optionally syncs with other users through a multi-user server. **Version 1.4.0** ## Features - **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency - **Cross-Platform**: Native desktop app for Windows, macOS, and Linux via [Tauri](https://tauri.app/) - **Dual Transcription Modes**: Local (Whisper) or cloud (Deepgram) with managed billing or BYOK - **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback - **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection - **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080` - **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users - **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files - **Customizable Colors**: User-configurable colors for name, text, and background - **Noise Suppression**: Built-in audio preprocessing to reduce background noise - **Auto-Updates**: Automatic update checking with release notes display ## Architecture The application uses a two-process architecture: 1. **Tauri Shell** (Svelte 5 frontend) — lightweight native window (~50MB) rendering the UI 2. **Python Backend** (sidecar) — headless process running transcription, audio capture, and the OBS web server The Tauri frontend communicates with the Python backend via REST API and WebSocket, following the same pattern as [voice-to-notes](https://repo.anhonesthost.net/MacroPad/voice-to-notes). ``` Tauri App (user launches this) └─ Spawns Python backend as sidecar ├─ FastAPI REST API (control endpoints) ├─ WebSocket /ws/control (real-time state + transcriptions) ├─ OBS web display at http://localhost:8080 └─ Transcription engine (Whisper or Deepgram) ``` > **Legacy GUI**: The original PySide6/Qt desktop GUI (`main.py`) still works alongside the new Tauri frontend during the transition period. ## Quick Start ### Running from Source ```bash # Install Python dependencies uv sync # Run the Tauri app (frontend + backend) npm install npm run tauri dev # Or run just the headless backend (for development) uv run python -m backend.main_headless # Or run the legacy PySide6 GUI uv run python main.py ``` ### Using Pre-Built Executables Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases): - **App installer** (Tauri shell): `.msi` (Windows), `.dmg` (macOS), `.deb`/`.rpm`/`.AppImage` (Linux) - **Sidecar** (Python backend): Download the matching `sidecar-*` zip for your platform (CUDA or CPU) ### Building from Source ```bash # Build the Tauri app npm install npm run tauri build # Output: src-tauri/target/release/bundle/ # Build the Python sidecar (headless, no Qt) uv sync uv run pyinstaller local-transcription-headless.spec # Output: dist/local-transcription-backend/ # Build the legacy PySide6 app (Linux) ./build.sh # Build the legacy PySide6 app (Windows) build.bat ``` For detailed build instructions, see [BUILD.md](BUILD.md). ## Usage ### Standalone Mode 1. Launch the application 2. Select your microphone from the audio device dropdown 3. Choose a Whisper model (smaller = faster, larger = more accurate): - `tiny.en` / `tiny` — Fastest, good for quick captions - `base.en` / `base` — Balanced speed and accuracy - `small.en` / `small` — Better accuracy - `medium.en` / `medium` — High accuracy - `large-v3` — Best accuracy (requires more resources) 4. Click **Start** to begin transcription 5. Transcriptions appear in the main window and at `http://localhost:8080` ### Remote Transcription (Deepgram) Instead of local Whisper models, you can use cloud-based transcription: - **Managed mode**: Sign up via the transcription proxy for metered billing - **BYOK mode**: Bring your own Deepgram API key for direct access Configure in Settings > Remote Transcription. ### OBS Browser Source Setup 1. Start the Local Transcription app 2. In OBS, add a **Browser** source 3. Set URL to `http://localhost:8080` 4. Set dimensions (e.g., 1920x300) 5. Check "Shutdown source when not visible" for performance ### Multi-User Mode (Optional) For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams): 1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md)) 2. In the app settings, enable **Server Sync** 3. Enter the server URL (e.g., `http://your-server:3000/api/send`) 4. Set a room name and passphrase (shared with other users) 5. In OBS, use the server's display URL with your room name: ``` http://your-server:3000/display?room=YOURROOM×tamps=true&maxlines=50 ``` ## Configuration Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel or the REST API. ### Key Settings | Setting | Description | Default | |---------|-------------|---------| | `transcription.model` | Whisper model to use | `base.en` | | `transcription.device` | Processing device (auto/cuda/cpu) | `auto` | | `transcription.enable_realtime_transcription` | Show preview while speaking | `false` | | `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` | | `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` | | `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` | | `remote.mode` | Transcription mode (local/managed/byok) | `local` | | `display.show_timestamps` | Show timestamps with transcriptions | `true` | | `display.fade_after_seconds` | Fade out time (0 = never) | `10` | | `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` | | `web_server.port` | Local web server port | `8080` | See [config/default_config.yaml](config/default_config.yaml) for all available options. ## Project Structure ``` local-transcription/ ├── src/ # Svelte 5 frontend (Tauri UI) │ ├── App.svelte # Main app shell │ ├── lib/components/ # UI components │ │ ├── Header.svelte │ │ ├── StatusBar.svelte │ │ ├── Controls.svelte │ │ ├── TranscriptionDisplay.svelte │ │ └── Settings.svelte │ └── lib/stores/ # Reactive state management │ ├── backend.ts # WebSocket + REST API client │ ├── config.ts # App configuration │ └── transcriptions.ts # Transcription data ├── src-tauri/ # Tauri v2 Rust shell │ ├── src/main.rs │ └── tauri.conf.json ├── backend/ # Headless Python backend (sidecar) │ ├── app_controller.py # Orchestration logic (engine, sync, config) │ ├── api_server.py # FastAPI REST + WebSocket control API │ └── main_headless.py # Headless entry point ├── client/ # Core transcription modules │ ├── audio_capture.py # Audio input handling │ ├── transcription_engine_realtime.py # RealtimeSTT / Whisper │ ├── deepgram_transcription.py # Deepgram cloud transcription │ ├── noise_suppression.py # VAD and noise reduction │ ├── device_utils.py # CPU/GPU/MPS detection │ ├── config.py # Configuration management │ ├── server_sync.py # Multi-user server client │ └── update_checker.py # Auto-update functionality ├── gui/ # Legacy PySide6/Qt GUI │ ├── main_window_qt.py │ ├── settings_dialog_qt.py │ └── transcription_display_qt.py ├── server/ # Web servers │ ├── web_display.py # Local FastAPI server for OBS │ └── nodejs/ # Multi-user sync server ├── .gitea/workflows/ # CI/CD │ ├── release.yml # Tauri app builds (all platforms) │ └── build-sidecar.yml # Python sidecar builds (CUDA + CPU) ├── config/ │ └── default_config.yaml # Default settings template ├── main.py # Legacy GUI entry point ├── main_cli.py # CLI version (for testing) ├── local-transcription.spec # PyInstaller config (legacy, with PySide6) ├── local-transcription-headless.spec # PyInstaller config (headless sidecar) ├── pyproject.toml # Python dependencies └── package.json # Node.js / Tauri dependencies ``` ## Technology Stack ### Frontend (Tauri) - **Tauri v2** — Native cross-platform shell (Rust) - **Svelte 5** — Reactive UI framework (TypeScript) - **Vite** — Frontend build tool ### Backend (Python Sidecar) - **Python 3.9+** - **FastAPI + Uvicorn** — REST API and WebSocket server - **RealtimeSTT** — Real-time speech-to-text with advanced VAD - **faster-whisper** — Optimized Whisper model inference (CTranslate2) - **PyTorch** — ML framework (CUDA-enabled builds available) - **sounddevice** — Cross-platform audio capture - **webrtcvad + silero_vad** — Voice activity detection ### Multi-User Server (Optional) - **Node.js + Express + WebSocket** — Real-time sync server ### Build & CI/CD - **PyInstaller** — Python sidecar packaging - **Tauri CLI** — App bundling (.msi, .dmg, .deb, .rpm, .AppImage) - **Gitea Actions** — Automated cross-platform builds - **uv** — Fast Python package manager ## CI/CD Two Gitea Actions workflows in `.gitea/workflows/`: | Workflow | Trigger | Produces | |----------|---------|----------| | `release.yml` | Push to `main` | Tauri app installers for all platforms | | `build-sidecar.yml` | Changes to `client/`, `server/`, `backend/`, or `pyproject.toml` | Python sidecar zips (CUDA + CPU) | Both workflows require a `BUILD_TOKEN` secret in the repo settings (Gitea API token with release write access). ### Release Artifacts | Platform | App Installer | Sidecar (CUDA) | Sidecar (CPU) | |----------|--------------|----------------|---------------| | Linux x86_64 | `.deb`, `.rpm`, `.AppImage` | `sidecar-linux-x86_64-cuda.zip` | `sidecar-linux-x86_64-cpu.zip` | | Windows x86_64 | `.msi`, `-setup.exe` | `sidecar-windows-x86_64-cuda.zip` | `sidecar-windows-x86_64-cpu.zip` | | macOS ARM64 | `.dmg` | — | `sidecar-macos-aarch64-cpu.zip` | ## System Requirements ### Minimum - 4GB RAM - Any modern CPU ### Recommended (for local real-time transcription) - 8GB+ RAM - NVIDIA GPU with CUDA support (for GPU acceleration) ### For Building - **Tauri app**: Node.js 20+, Rust stable, platform SDK (see [Tauri prerequisites](https://tauri.app/start/prerequisites/)) - **Python sidecar**: Python 3.9+, uv, PyInstaller - **Linux**: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev`, `libappindicator3-dev`, `librsvg2-dev`, `patchelf` - **Windows**: Visual Studio Build Tools, WebView2 - **macOS**: Xcode Command Line Tools ## Troubleshooting ### Model Loading Issues - Models download automatically on first use to `~/.cache/huggingface/` - First run requires internet connection - Check disk space (models range from 75MB to 3GB) ### Audio Device Issues ```bash # List available audio devices uv run python main_cli.py --list-devices ``` - Ensure microphone permissions are granted (especially on macOS) - Try different device indices in settings ### GPU Not Detected ```bash # Check CUDA availability uv run python -c "import torch; print(torch.cuda.is_available())" ``` - Install NVIDIA drivers (CUDA toolkit is bundled in CUDA sidecar builds) - The app automatically falls back to CPU if no GPU is available ### Web Server Port Conflicts - Default port is 8080; the app tries ports 8080-8084 automatically - Change in settings or edit config file - Check for conflicts: `lsof -i :8080` (Linux/macOS) or `netstat -ano | findstr :8080` (Windows) ## Use Cases - **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams - **Multi-Language Translation**: Multiple translators transcribing in different languages - **Accessibility**: Provide captions for hearing-impaired viewers - **Podcast Recording**: Real-time transcription for multi-host shows - **Gaming Commentary**: Track who said what in multiplayer sessions ## Contributing Contributions are welcome! Please feel free to submit issues or pull requests at the [repository](https://repo.anhonesthost.net/streamer-tools/local-transcription). ## License MIT License ## Acknowledgments - [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model - [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities - [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference - [Tauri](https://tauri.app/) for the cross-platform desktop framework - [Deepgram](https://deepgram.com/) for cloud transcription API