- README: document cloud-first quick start, shared captions workflow (create room, join via share code, share existing room), and self-hosting option - README: update default remote.mode from local to byok in config table - CLAUDE.md: reflect cloud-first default, settings gating, and shared captions features Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
355 lines
15 KiB
Markdown
355 lines
15 KiB
Markdown
# Local Transcription
|
|
|
|
A real-time speech-to-text desktop application for streamers. Runs locally on your machine with GPU or CPU, displays transcriptions via OBS browser source, and optionally syncs with other users through a multi-user server.
|
|
|
|
**Version 1.4.0**
|
|
|
|
## Features
|
|
|
|
- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
|
|
- **Cloud-First**: Defaults to Deepgram cloud transcription — get started with just an API key
|
|
- **Cross-Platform**: Native desktop app for Windows, macOS, and Linux via [Tauri](https://tauri.app/)
|
|
- **Dual Transcription Modes**: Cloud (Deepgram) or local (Whisper) with automatic GPU/CPU detection
|
|
- **Shared Captions**: Create a room and share a code so others can join — no server setup needed
|
|
- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
|
|
- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
|
|
- **Customizable Colors**: User-configurable colors for name, text, and background
|
|
- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
|
|
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
|
|
- **Auto-Updates**: Automatic update checking with release notes display
|
|
|
|
## Architecture
|
|
|
|
The application uses a two-process architecture:
|
|
|
|
1. **Tauri Shell** (Svelte 5 frontend) — lightweight native window (~50MB) rendering the UI
|
|
2. **Python Backend** (sidecar) — headless process running transcription, audio capture, and the OBS web server
|
|
|
|
The Tauri frontend communicates with the Python backend via REST API and WebSocket, following the same pattern as [voice-to-notes](https://repo.anhonesthost.net/MacroPad/voice-to-notes).
|
|
|
|
```
|
|
Tauri App (user launches this)
|
|
└─ Spawns Python backend as sidecar
|
|
├─ FastAPI REST API (control endpoints)
|
|
├─ WebSocket /ws/control (real-time state + transcriptions)
|
|
├─ OBS web display at http://localhost:8080
|
|
└─ Transcription engine (Whisper or Deepgram)
|
|
```
|
|
|
|
> **Legacy GUI**: The original PySide6/Qt desktop GUI (`main.py`) still works alongside the new Tauri frontend during the transition period.
|
|
|
|
## Quick Start
|
|
|
|
### Running from Source
|
|
|
|
```bash
|
|
# Install Python dependencies
|
|
uv sync
|
|
|
|
# Run the Tauri app (frontend + backend)
|
|
npm install
|
|
npm run tauri dev
|
|
|
|
# Or run just the headless backend (for development)
|
|
uv run python -m backend.main_headless
|
|
|
|
# Or run the legacy PySide6 GUI
|
|
uv run python main.py
|
|
```
|
|
|
|
### Using Pre-Built Executables
|
|
|
|
Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases):
|
|
|
|
- **App installer** (Tauri shell): `.msi` (Windows), `.dmg` (macOS), `.deb`/`.rpm`/`.AppImage` (Linux)
|
|
- **Sidecar** (Python backend): Download the matching `sidecar-*` zip for your platform (CUDA or CPU)
|
|
|
|
### Building from Source
|
|
|
|
```bash
|
|
# Build the Tauri app
|
|
npm install
|
|
npm run tauri build
|
|
# Output: src-tauri/target/release/bundle/
|
|
|
|
# Build the Python sidecar (headless, no Qt)
|
|
uv sync
|
|
uv run pyinstaller local-transcription-headless.spec
|
|
# Output: dist/local-transcription-backend/
|
|
|
|
# Build the legacy PySide6 app (Linux)
|
|
./build.sh
|
|
# Build the legacy PySide6 app (Windows)
|
|
build.bat
|
|
```
|
|
|
|
For detailed build instructions, see [BUILD.md](BUILD.md).
|
|
|
|
## Usage
|
|
|
|
### Quick Setup (Cloud — Recommended)
|
|
|
|
1. Launch the application
|
|
2. Open **Settings** — the transcription mode defaults to **Cloud (Deepgram)**
|
|
3. Get a free API key at [console.deepgram.com](https://console.deepgram.com) and paste it in Settings
|
|
4. Select your microphone from the audio device dropdown
|
|
5. Click **Start Transcription**
|
|
6. Transcriptions appear in the main window and at `http://localhost:8080`
|
|
|
|
> The Start button is disabled until an API key is entered. Local-only settings (model, VAD, timing) are hidden in cloud mode to keep things simple.
|
|
|
|
### Local Mode (Whisper)
|
|
|
|
For offline/on-device transcription, switch to **Local (Whisper)** in Settings:
|
|
|
|
1. Choose a Whisper model (smaller = faster, larger = more accurate):
|
|
- `tiny.en` / `tiny` — Fastest, good for quick captions
|
|
- `base.en` / `base` — Balanced speed and accuracy
|
|
- `small.en` / `small` — Better accuracy
|
|
- `medium.en` / `medium` — High accuracy
|
|
- `large-v3` — Best accuracy (requires more resources)
|
|
2. Select compute device (Auto/CUDA/CPU) and compute type
|
|
3. Tune VAD sensitivity and timing settings as needed
|
|
4. Click **Start Transcription**
|
|
|
|
### OBS Browser Source Setup
|
|
|
|
1. Start the Local Transcription app
|
|
2. In OBS, add a **Browser** source
|
|
3. Set URL to `http://localhost:8080`
|
|
4. Set dimensions (e.g., 1920x300)
|
|
5. Check "Shutdown source when not visible" for performance
|
|
|
|
### Shared Captions (Multi-User)
|
|
|
|
Share live captions across multiple users using the hosted service at `https://caption.shadowdao.com/` — no server setup required.
|
|
|
|
#### Creating a Room
|
|
|
|
1. Open **Settings** and enable **Shared Captions**
|
|
2. Click **Create Room** — this generates a room name and passphrase automatically
|
|
3. A **share code** is generated and copied to your clipboard
|
|
4. Send the share code to anyone who should join
|
|
|
|
#### Joining a Room
|
|
|
|
1. Open **Settings** and enable **Shared Captions**
|
|
2. Paste the share code you received into the **"Paste share code to join"** field
|
|
3. Click **Join** — the server URL, room, and passphrase are auto-filled
|
|
4. Click **Save**
|
|
|
|
#### Sharing an Existing Room
|
|
|
|
If you already have a room configured and want to invite others:
|
|
|
|
1. Open **Settings** and scroll to **Shared Captions**
|
|
2. Click **Share Current Room** — generates a share code from your current config and copies it to the clipboard
|
|
3. Send the code to others
|
|
|
|
#### OBS Display for Shared Rooms
|
|
|
|
In OBS, add a Browser source pointing to the server's display URL:
|
|
```
|
|
https://caption.shadowdao.com/display?room=YOURROOM×tamps=true&maxlines=50
|
|
```
|
|
|
|
#### Self-Hosting
|
|
|
|
You can also self-host the sync server. See [server/nodejs/README.md](server/nodejs/README.md) for setup instructions, then enter your own server URL in the Shared Captions settings.
|
|
|
|
## Configuration
|
|
|
|
Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel or the REST API.
|
|
|
|
### Key Settings
|
|
|
|
| Setting | Description | Default |
|
|
|---------|-------------|---------|
|
|
| `transcription.model` | Whisper model to use | `base.en` |
|
|
| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
|
|
| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
|
|
| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
|
|
| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
|
|
| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
|
|
| `remote.mode` | Transcription mode (local/managed/byok) | `byok` |
|
|
| `display.show_timestamps` | Show timestamps with transcriptions | `true` |
|
|
| `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
|
|
| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
|
|
| `web_server.port` | Local web server port | `8080` |
|
|
|
|
See [config/default_config.yaml](config/default_config.yaml) for all available options.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
local-transcription/
|
|
├── src/ # Svelte 5 frontend (Tauri UI)
|
|
│ ├── App.svelte # Main app shell
|
|
│ ├── lib/components/ # UI components
|
|
│ │ ├── Header.svelte
|
|
│ │ ├── StatusBar.svelte
|
|
│ │ ├── Controls.svelte
|
|
│ │ ├── TranscriptionDisplay.svelte
|
|
│ │ └── Settings.svelte
|
|
│ └── lib/stores/ # Reactive state management
|
|
│ ├── backend.ts # WebSocket + REST API client
|
|
│ ├── config.ts # App configuration
|
|
│ └── transcriptions.ts # Transcription data
|
|
├── src-tauri/ # Tauri v2 Rust shell
|
|
│ ├── src/main.rs
|
|
│ └── tauri.conf.json
|
|
├── backend/ # Headless Python backend (sidecar)
|
|
│ ├── app_controller.py # Orchestration logic (engine, sync, config)
|
|
│ ├── api_server.py # FastAPI REST + WebSocket control API
|
|
│ └── main_headless.py # Headless entry point
|
|
├── client/ # Core transcription modules
|
|
│ ├── audio_capture.py # Audio input handling
|
|
│ ├── transcription_engine_realtime.py # RealtimeSTT / Whisper
|
|
│ ├── deepgram_transcription.py # Deepgram cloud transcription
|
|
│ ├── noise_suppression.py # VAD and noise reduction
|
|
│ ├── device_utils.py # CPU/GPU/MPS detection
|
|
│ ├── config.py # Configuration management
|
|
│ ├── server_sync.py # Multi-user server client
|
|
│ └── update_checker.py # Auto-update functionality
|
|
├── gui/ # Legacy PySide6/Qt GUI
|
|
│ ├── main_window_qt.py
|
|
│ ├── settings_dialog_qt.py
|
|
│ └── transcription_display_qt.py
|
|
├── server/ # Web servers
|
|
│ ├── web_display.py # Local FastAPI server for OBS
|
|
│ └── nodejs/ # Multi-user sync server
|
|
├── .gitea/workflows/ # CI/CD
|
|
│ ├── release.yml # Tauri app builds (all platforms)
|
|
│ └── build-sidecar.yml # Python sidecar builds (CUDA + CPU)
|
|
├── config/
|
|
│ └── default_config.yaml # Default settings template
|
|
├── main.py # Legacy GUI entry point
|
|
├── main_cli.py # CLI version (for testing)
|
|
├── local-transcription.spec # PyInstaller config (legacy, with PySide6)
|
|
├── local-transcription-headless.spec # PyInstaller config (headless sidecar)
|
|
├── pyproject.toml # Python dependencies
|
|
└── package.json # Node.js / Tauri dependencies
|
|
```
|
|
|
|
## Technology Stack
|
|
|
|
### Frontend (Tauri)
|
|
- **Tauri v2** — Native cross-platform shell (Rust)
|
|
- **Svelte 5** — Reactive UI framework (TypeScript)
|
|
- **Vite** — Frontend build tool
|
|
|
|
### Backend (Python Sidecar)
|
|
- **Python 3.9+**
|
|
- **FastAPI + Uvicorn** — REST API and WebSocket server
|
|
- **RealtimeSTT** — Real-time speech-to-text with advanced VAD
|
|
- **faster-whisper** — Optimized Whisper model inference (CTranslate2)
|
|
- **PyTorch** — ML framework (CUDA-enabled builds available)
|
|
- **sounddevice** — Cross-platform audio capture
|
|
- **webrtcvad + silero_vad** — Voice activity detection
|
|
|
|
### Multi-User Server (Optional)
|
|
- **Node.js + Express + WebSocket** — Real-time sync server
|
|
|
|
### Build & CI/CD
|
|
- **PyInstaller** — Python sidecar packaging
|
|
- **Tauri CLI** — App bundling (.msi, .dmg, .deb, .rpm, .AppImage)
|
|
- **Gitea Actions** — Automated cross-platform builds
|
|
- **uv** — Fast Python package manager
|
|
|
|
## CI/CD
|
|
|
|
Two Gitea Actions workflows in `.gitea/workflows/`:
|
|
|
|
| Workflow | Trigger | Produces |
|
|
|----------|---------|----------|
|
|
| `release.yml` | Push to `main` | Tauri app installers for all platforms |
|
|
| `build-sidecar.yml` | Changes to `client/`, `server/`, `backend/`, or `pyproject.toml` | Python sidecar zips (CUDA + CPU) |
|
|
|
|
Both workflows require a `BUILD_TOKEN` secret in the repo settings (Gitea API token with release write access).
|
|
|
|
### Release Artifacts
|
|
|
|
| Platform | App Installer | Sidecar (CUDA) | Sidecar (CPU) |
|
|
|----------|--------------|----------------|---------------|
|
|
| Linux x86_64 | `.deb`, `.rpm`, `.AppImage` | `sidecar-linux-x86_64-cuda.zip` | `sidecar-linux-x86_64-cpu.zip` |
|
|
| Windows x86_64 | `.msi`, `-setup.exe` | `sidecar-windows-x86_64-cuda.zip` | `sidecar-windows-x86_64-cpu.zip` |
|
|
| macOS ARM64 | `.dmg` | — | `sidecar-macos-aarch64-cpu.zip` |
|
|
|
|
## System Requirements
|
|
|
|
### Minimum
|
|
- 4GB RAM
|
|
- Any modern CPU
|
|
|
|
### Recommended (for local real-time transcription)
|
|
- 8GB+ RAM
|
|
- NVIDIA GPU with CUDA support (for GPU acceleration)
|
|
|
|
### For Building
|
|
- **Tauri app**: Node.js 20+, Rust stable, platform SDK (see [Tauri prerequisites](https://tauri.app/start/prerequisites/))
|
|
- **Python sidecar**: Python 3.9+, uv, PyInstaller
|
|
- **Linux**: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev`, `libappindicator3-dev`, `librsvg2-dev`, `patchelf`
|
|
- **Windows**: Visual Studio Build Tools, WebView2
|
|
- **macOS**: Xcode Command Line Tools
|
|
|
|
## Troubleshooting
|
|
|
|
### macOS: "App is damaged and can't be opened"
|
|
macOS Gatekeeper blocks unsigned applications. Since the app is not yet signed with an Apple Developer certificate, you need to remove the quarantine flag before opening:
|
|
|
|
```bash
|
|
xattr -cr "/Applications/Local Transcription.app"
|
|
```
|
|
|
|
Then open the app normally. You only need to do this once after downloading.
|
|
|
|
### Model Loading Issues
|
|
- Models download automatically on first use to `~/.cache/huggingface/`
|
|
- First run requires internet connection
|
|
- Check disk space (models range from 75MB to 3GB)
|
|
|
|
### Audio Device Issues
|
|
```bash
|
|
# List available audio devices
|
|
uv run python main_cli.py --list-devices
|
|
```
|
|
- Ensure microphone permissions are granted (especially on macOS)
|
|
- Try different device indices in settings
|
|
|
|
### GPU Not Detected
|
|
```bash
|
|
# Check CUDA availability
|
|
uv run python -c "import torch; print(torch.cuda.is_available())"
|
|
```
|
|
- Install NVIDIA drivers (CUDA toolkit is bundled in CUDA sidecar builds)
|
|
- The app automatically falls back to CPU if no GPU is available
|
|
|
|
### Web Server Port Conflicts
|
|
- Default port is 8080; the app tries ports 8080-8084 automatically
|
|
- Change in settings or edit config file
|
|
- Check for conflicts: `lsof -i :8080` (Linux/macOS) or `netstat -ano | findstr :8080` (Windows)
|
|
|
|
## Use Cases
|
|
|
|
- **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams
|
|
- **Multi-Language Translation**: Multiple translators transcribing in different languages
|
|
- **Accessibility**: Provide captions for hearing-impaired viewers
|
|
- **Podcast Recording**: Real-time transcription for multi-host shows
|
|
- **Gaming Commentary**: Track who said what in multiplayer sessions
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome! Please feel free to submit issues or pull requests at the [repository](https://repo.anhonesthost.net/streamer-tools/local-transcription).
|
|
|
|
## License
|
|
|
|
MIT License
|
|
|
|
## Acknowledgments
|
|
|
|
- [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model
|
|
- [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities
|
|
- [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference
|
|
- [Tauri](https://tauri.app/) for the cross-platform desktop framework
|
|
- [Deepgram](https://deepgram.com/) for cloud transcription API
|