2026-01-23 06:31:27 -08:00
# Local Transcription
2025-12-25 18:48:23 -08:00
2026-04-06 13:34:10 -07:00
A real-time speech-to-text desktop application for streamers. Runs locally on your machine with GPU or CPU, displays transcriptions via OBS browser source, and optionally syncs with other users through a multi-user server.
2026-01-23 06:31:27 -08:00
**Version 1.4.0**
2025-12-25 18:48:23 -08:00
## Features
2026-01-23 06:31:27 -08:00
- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
2026-04-06 13:34:10 -07:00
- **Cross-Platform**: Native desktop app for Windows, macOS, and Linux via [Tauri ](https://tauri.app/ )
- **Dual Transcription Modes**: Local (Whisper) or cloud (Deepgram) with managed billing or BYOK
2026-01-23 06:31:27 -08:00
- **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
- **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users
- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
- **Customizable Colors**: User-configurable colors for name, text, and background
2025-12-25 18:48:23 -08:00
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
2026-01-23 06:31:27 -08:00
- **Auto-Updates**: Automatic update checking with release notes display
2026-04-06 13:34:10 -07:00
## Architecture
The application uses a two-process architecture:
1. **Tauri Shell ** (Svelte 5 frontend) — lightweight native window (~50MB) rendering the UI
2. **Python Backend ** (sidecar) — headless process running transcription, audio capture, and the OBS web server
The Tauri frontend communicates with the Python backend via REST API and WebSocket, following the same pattern as [voice-to-notes ](https://repo.anhonesthost.net/MacroPad/voice-to-notes ).
```
Tauri App (user launches this)
└─ Spawns Python backend as sidecar
├─ FastAPI REST API (control endpoints)
├─ WebSocket /ws/control (real-time state + transcriptions)
├─ OBS web display at http://localhost:8080
└─ Transcription engine (Whisper or Deepgram)
```
> **Legacy GUI**: The original PySide6/Qt desktop GUI (`main.py`) still works alongside the new Tauri frontend during the transition period.
2025-12-25 18:48:23 -08:00
## Quick Start
### Running from Source
```bash
2026-04-06 13:34:10 -07:00
# Install Python dependencies
2025-12-25 18:48:23 -08:00
uv sync
2026-04-06 13:34:10 -07:00
# Run the Tauri app (frontend + backend)
npm install
npm run tauri dev
# Or run just the headless backend (for development)
uv run python -m backend.main_headless
# Or run the legacy PySide6 GUI
2025-12-25 18:48:23 -08:00
uv run python main.py
```
2026-01-23 06:31:27 -08:00
### Using Pre-Built Executables
2026-04-06 13:34:10 -07:00
Download the latest release from the [releases page ](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases ):
- **App installer** (Tauri shell): `.msi` (Windows), `.dmg` (macOS), `.deb` /`.rpm` /`.AppImage` (Linux)
- **Sidecar** (Python backend): Download the matching `sidecar-*` zip for your platform (CUDA or CPU)
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
### Building from Source
2025-12-25 18:48:23 -08:00
```bash
2026-04-06 13:34:10 -07:00
# Build the Tauri app
npm install
npm run tauri build
# Output: src-tauri/target/release/bundle/
# Build the Python sidecar (headless, no Qt)
uv sync
uv run pyinstaller local-transcription-headless.spec
# Output: dist/local-transcription-backend/
2025-12-25 18:48:23 -08:00
2026-04-06 13:34:10 -07:00
# Build the legacy PySide6 app (Linux)
./build.sh
# Build the legacy PySide6 app (Windows)
2025-12-25 18:48:23 -08:00
build.bat
```
For detailed build instructions, see [BUILD.md ](BUILD.md ).
2026-01-23 06:31:27 -08:00
## Usage
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
### Standalone Mode
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
1. Launch the application
2. Select your microphone from the audio device dropdown
3. Choose a Whisper model (smaller = faster, larger = more accurate):
2026-04-06 13:34:10 -07:00
- `tiny.en` / `tiny` — Fastest, good for quick captions
- `base.en` / `base` — Balanced speed and accuracy
- `small.en` / `small` — Better accuracy
- `medium.en` / `medium` — High accuracy
- `large-v3` — Best accuracy (requires more resources)
2026-01-23 06:31:27 -08:00
4. Click **Start ** to begin transcription
5. Transcriptions appear in the main window and at `http://localhost:8080`
2025-12-25 18:48:23 -08:00
2026-04-06 13:34:10 -07:00
### Remote Transcription (Deepgram)
Instead of local Whisper models, you can use cloud-based transcription:
- **Managed mode**: Sign up via the transcription proxy for metered billing
- **BYOK mode**: Bring your own Deepgram API key for direct access
Configure in Settings > Remote Transcription.
2026-01-23 06:31:27 -08:00
### OBS Browser Source Setup
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
1. Start the Local Transcription app
2. In OBS, add a **Browser ** source
3. Set URL to `http://localhost:8080`
4. Set dimensions (e.g., 1920x300)
5. Check "Shutdown source when not visible" for performance
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
### Multi-User Mode (Optional)
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
1. Deploy the Node.js server (see [server/nodejs/README.md ](server/nodejs/README.md ))
2. In the app settings, enable **Server Sync **
3. Enter the server URL (e.g., `http://your-server:3000/api/send` )
4. Set a room name and passphrase (shared with other users)
5. In OBS, use the server's display URL with your room name:
```
http://your-server:3000/display?room=YOURROOM×tamps=true&maxlines=50
```
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
## Configuration
2025-12-25 18:48:23 -08:00
2026-04-06 13:34:10 -07:00
Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel or the REST API.
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
### Key Settings
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
| Setting | Description | Default |
|---------|-------------|---------|
| `transcription.model` | Whisper model to use | `base.en` |
| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
2026-04-06 13:34:10 -07:00
| `remote.mode` | Transcription mode (local/managed/byok) | `local` |
2026-01-23 06:31:27 -08:00
| `display.show_timestamps` | Show timestamps with transcriptions | `true` |
| `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
| `web_server.port` | Local web server port | `8080` |
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
See [config/default_config.yaml ](config/default_config.yaml ) for all available options.
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
## Project Structure
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
```
local-transcription/
2026-04-06 13:34:10 -07:00
├── src/ # Svelte 5 frontend (Tauri UI)
│ ├── App.svelte # Main app shell
│ ├── lib/components/ # UI components
│ │ ├── Header.svelte
│ │ ├── StatusBar.svelte
│ │ ├── Controls.svelte
│ │ ├── TranscriptionDisplay.svelte
│ │ └── Settings.svelte
│ └── lib/stores/ # Reactive state management
│ ├── backend.ts # WebSocket + REST API client
│ ├── config.ts # App configuration
│ └── transcriptions.ts # Transcription data
├── src-tauri/ # Tauri v2 Rust shell
│ ├── src/main.rs
│ └── tauri.conf.json
├── backend/ # Headless Python backend (sidecar)
│ ├── app_controller.py # Orchestration logic (engine, sync, config)
│ ├── api_server.py # FastAPI REST + WebSocket control API
│ └── main_headless.py # Headless entry point
├── client/ # Core transcription modules
│ ├── audio_capture.py # Audio input handling
│ ├── transcription_engine_realtime.py # RealtimeSTT / Whisper
│ ├── deepgram_transcription.py # Deepgram cloud transcription
│ ├── noise_suppression.py # VAD and noise reduction
│ ├── device_utils.py # CPU/GPU/MPS detection
│ ├── config.py # Configuration management
│ ├── server_sync.py # Multi-user server client
│ └── update_checker.py # Auto-update functionality
├── gui/ # Legacy PySide6/Qt GUI
│ ├── main_window_qt.py
│ ├── settings_dialog_qt.py
│ └── transcription_display_qt.py
├── server/ # Web servers
│ ├── web_display.py # Local FastAPI server for OBS
│ └── nodejs/ # Multi-user sync server
├── .gitea/workflows/ # CI/CD
│ ├── release.yml # Tauri app builds (all platforms)
│ └── build-sidecar.yml # Python sidecar builds (CUDA + CPU)
2026-01-23 06:31:27 -08:00
├── config/
2026-04-06 13:34:10 -07:00
│ └── default_config.yaml # Default settings template
├── main.py # Legacy GUI entry point
├── main_cli.py # CLI version (for testing)
├── local-transcription.spec # PyInstaller config (legacy, with PySide6)
├── local-transcription-headless.spec # PyInstaller config (headless sidecar)
├── pyproject.toml # Python dependencies
└── package.json # Node.js / Tauri dependencies
2026-01-23 06:31:27 -08:00
```
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
## Technology Stack
2025-12-25 18:48:23 -08:00
2026-04-06 13:34:10 -07:00
### Frontend (Tauri)
- **Tauri v2** — Native cross-platform shell (Rust)
- **Svelte 5** — Reactive UI framework (TypeScript)
- **Vite** — Frontend build tool
### Backend (Python Sidecar)
2026-01-23 06:31:27 -08:00
- **Python 3.9+**
2026-04-06 13:34:10 -07:00
- **FastAPI + Uvicorn** — REST API and WebSocket server
- **RealtimeSTT** — Real-time speech-to-text with advanced VAD
- **faster-whisper** — Optimized Whisper model inference (CTranslate2)
- **PyTorch** — ML framework (CUDA-enabled builds available)
- **sounddevice** — Cross-platform audio capture
- **webrtcvad + silero_vad** — Voice activity detection
### Multi-User Server (Optional)
- **Node.js + Express + WebSocket** — Real-time sync server
### Build & CI/CD
- **PyInstaller** — Python sidecar packaging
- **Tauri CLI** — App bundling (.msi, .dmg, .deb, .rpm, .AppImage)
- **Gitea Actions** — Automated cross-platform builds
- **uv** — Fast Python package manager
## CI/CD
Two Gitea Actions workflows in `.gitea/workflows/` :
| Workflow | Trigger | Produces |
|----------|---------|----------|
| `release.yml` | Push to `main` | Tauri app installers for all platforms |
| `build-sidecar.yml` | Changes to `client/` , `server/` , `backend/` , or `pyproject.toml` | Python sidecar zips (CUDA + CPU) |
Both workflows require a `BUILD_TOKEN` secret in the repo settings (Gitea API token with release write access).
### Release Artifacts
| Platform | App Installer | Sidecar (CUDA) | Sidecar (CPU) |
|----------|--------------|----------------|---------------|
| Linux x86_64 | `.deb` , `.rpm` , `.AppImage` | `sidecar-linux-x86_64-cuda.zip` | `sidecar-linux-x86_64-cpu.zip` |
| Windows x86_64 | `.msi` , `-setup.exe` | `sidecar-windows-x86_64-cuda.zip` | `sidecar-windows-x86_64-cpu.zip` |
| macOS ARM64 | `.dmg` | — | `sidecar-macos-aarch64-cpu.zip` |
2026-01-23 06:31:27 -08:00
## System Requirements
### Minimum
- 4GB RAM
- Any modern CPU
2026-04-06 13:34:10 -07:00
### Recommended (for local real-time transcription)
2026-01-23 06:31:27 -08:00
- 8GB+ RAM
- NVIDIA GPU with CUDA support (for GPU acceleration)
### For Building
2026-04-06 13:34:10 -07:00
- **Tauri app**: Node.js 20+, Rust stable, platform SDK (see [Tauri prerequisites ](https://tauri.app/start/prerequisites/ ))
- **Python sidecar**: Python 3.9+, uv, PyInstaller
- **Linux**: `libgtk-3-dev` , `libwebkit2gtk-4.1-dev` , `libappindicator3-dev` , `librsvg2-dev` , `patchelf`
- **Windows**: Visual Studio Build Tools, WebView2
- **macOS**: Xcode Command Line Tools
2026-01-23 06:31:27 -08:00
## Troubleshooting
2026-04-08 11:02:47 -07:00
### macOS: "App is damaged and can't be opened"
macOS Gatekeeper blocks unsigned applications. Since the app is not yet signed with an Apple Developer certificate, you need to remove the quarantine flag before opening:
```bash
xattr -cr "/Applications/Local Transcription.app"
```
Then open the app normally. You only need to do this once after downloading.
2026-01-23 06:31:27 -08:00
### Model Loading Issues
- Models download automatically on first use to `~/.cache/huggingface/`
- First run requires internet connection
- Check disk space (models range from 75MB to 3GB)
### Audio Device Issues
```bash
# List available audio devices
uv run python main_cli.py --list-devices
2025-12-25 18:48:23 -08:00
```
2026-04-06 13:34:10 -07:00
- Ensure microphone permissions are granted (especially on macOS)
2026-01-23 06:31:27 -08:00
- Try different device indices in settings
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
### GPU Not Detected
```bash
# Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"
2025-12-25 18:48:23 -08:00
```
2026-04-06 13:34:10 -07:00
- Install NVIDIA drivers (CUDA toolkit is bundled in CUDA sidecar builds)
2026-01-23 06:31:27 -08:00
- The app automatically falls back to CPU if no GPU is available
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
### Web Server Port Conflicts
2026-04-06 13:34:10 -07:00
- Default port is 8080; the app tries ports 8080-8084 automatically
2026-01-23 06:31:27 -08:00
- Change in settings or edit config file
2026-04-06 13:34:10 -07:00
- Check for conflicts: `lsof -i :8080` (Linux/macOS) or `netstat -ano | findstr :8080` (Windows)
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
## Use Cases
2025-12-25 18:48:23 -08:00
2026-01-23 06:31:27 -08:00
- **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams
- **Multi-Language Translation**: Multiple translators transcribing in different languages
- **Accessibility**: Provide captions for hearing-impaired viewers
- **Podcast Recording**: Real-time transcription for multi-host shows
- **Gaming Commentary**: Track who said what in multiplayer sessions
2025-12-25 18:48:23 -08:00
## Contributing
2026-01-23 06:31:27 -08:00
Contributions are welcome! Please feel free to submit issues or pull requests at the [repository ](https://repo.anhonesthost.net/streamer-tools/local-transcription ).
2025-12-25 18:48:23 -08:00
## License
2026-01-23 06:31:27 -08:00
MIT License
2025-12-25 18:48:23 -08:00
## Acknowledgments
2026-01-23 06:31:27 -08:00
- [OpenAI Whisper ](https://github.com/openai/whisper ) for the speech recognition model
- [RealtimeSTT ](https://github.com/KoljaB/RealtimeSTT ) for real-time transcription capabilities
- [faster-whisper ](https://github.com/guillaumekln/faster-whisper ) for optimized inference
2026-04-06 13:34:10 -07:00
- [Tauri ](https://tauri.app/ ) for the cross-platform desktop framework
- [Deepgram ](https://deepgram.com/ ) for cloud transcription API