local-transcription/README.md

# Local Transcription

A real-time speech-to-text desktop application for streamers. Runs locally on your machine with GPU or CPU, displays transcriptions via OBS browser source, and optionally syncs with other users through a multi-user server.

**Version 1.4.0**

## Features

- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
- **Cross-Platform**: Native desktop app for Windows, macOS, and Linux via [Tauri](https://tauri.app/)
- **Dual Transcription Modes**: Local (Whisper) or cloud (Deepgram) with managed billing or BYOK
- **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
- **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users
- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
- **Customizable Colors**: User-configurable colors for name, text, and background
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
- **Auto-Updates**: Automatic update checking with release notes display

## Architecture

The application uses a two-process architecture:

1. **Tauri Shell** (Svelte 5 frontend) — lightweight native window (~50MB) rendering the UI
2. **Python Backend** (sidecar) — headless process running transcription, audio capture, and the OBS web server

The Tauri frontend communicates with the Python backend via REST API and WebSocket, following the same pattern as [voice-to-notes](https://repo.anhonesthost.net/MacroPad/voice-to-notes).

```
Tauri App (user launches this)
  └─ Spawns Python backend as sidecar
       ├─ FastAPI REST API (control endpoints)
       ├─ WebSocket /ws/control (real-time state + transcriptions)
       ├─ OBS web display at http://localhost:8080
       └─ Transcription engine (Whisper or Deepgram)
```

> **Legacy GUI**: The original PySide6/Qt desktop GUI (`main.py`) still works alongside the new Tauri frontend during the transition period.

## Quick Start

### Running from Source

```bash
# Install Python dependencies
uv sync

# Run the Tauri app (frontend + backend)
npm install
npm run tauri dev

# Or run just the headless backend (for development)
uv run python -m backend.main_headless

# Or run the legacy PySide6 GUI
uv run python main.py
```

### Using Pre-Built Executables

Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases):

- **App installer** (Tauri shell): `.msi` (Windows), `.dmg` (macOS), `.deb`/`.rpm`/`.AppImage` (Linux)
- **Sidecar** (Python backend): Download the matching `sidecar-*` zip for your platform (CUDA or CPU)

### Building from Source

```bash
# Build the Tauri app
npm install
npm run tauri build
# Output: src-tauri/target/release/bundle/

# Build the Python sidecar (headless, no Qt)
uv sync
uv run pyinstaller local-transcription-headless.spec
# Output: dist/local-transcription-backend/

# Build the legacy PySide6 app (Linux)
./build.sh
# Build the legacy PySide6 app (Windows)
build.bat
```

For detailed build instructions, see [BUILD.md](BUILD.md).

## Usage

### Standalone Mode

1. Launch the application
2. Select your microphone from the audio device dropdown
3. Choose a Whisper model (smaller = faster, larger = more accurate):
   - `tiny.en` / `tiny` — Fastest, good for quick captions
   - `base.en` / `base` — Balanced speed and accuracy
   - `small.en` / `small` — Better accuracy
   - `medium.en` / `medium` — High accuracy
   - `large-v3` — Best accuracy (requires more resources)
4. Click **Start** to begin transcription
5. Transcriptions appear in the main window and at `http://localhost:8080`

### Remote Transcription (Deepgram)

Instead of local Whisper models, you can use cloud-based transcription:

- **Managed mode**: Sign up via the transcription proxy for metered billing
- **BYOK mode**: Bring your own Deepgram API key for direct access

Configure in Settings > Remote Transcription.

### OBS Browser Source Setup

1. Start the Local Transcription app
2. In OBS, add a **Browser** source
3. Set URL to `http://localhost:8080`
4. Set dimensions (e.g., 1920x300)
5. Check "Shutdown source when not visible" for performance

### Multi-User Mode (Optional)

For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):

1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
2. In the app settings, enable **Server Sync**
3. Enter the server URL (e.g., `http://your-server:3000/api/send`)
4. Set a room name and passphrase (shared with other users)
5. In OBS, use the server's display URL with your room name:
   ```
   http://your-server:3000/display?room=YOURROOM&timestamps=true&maxlines=50
   ```

## Configuration

Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel or the REST API.

### Key Settings

| Setting | Description | Default |
|---------|-------------|---------|
| `transcription.model` | Whisper model to use | `base.en` |
| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
| `remote.mode` | Transcription mode (local/managed/byok) | `local` |
| `display.show_timestamps` | Show timestamps with transcriptions | `true` |
| `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
| `web_server.port` | Local web server port | `8080` |

See [config/default_config.yaml](config/default_config.yaml) for all available options.

## Project Structure

```
local-transcription/
├── src/                             # Svelte 5 frontend (Tauri UI)
│   ├── App.svelte                   # Main app shell
│   ├── lib/components/              # UI components
│   │   ├── Header.svelte
│   │   ├── StatusBar.svelte
│   │   ├── Controls.svelte
│   │   ├── TranscriptionDisplay.svelte
│   │   └── Settings.svelte
│   └── lib/stores/                  # Reactive state management
│       ├── backend.ts               # WebSocket + REST API client
│       ├── config.ts                # App configuration
│       └── transcriptions.ts        # Transcription data
├── src-tauri/                       # Tauri v2 Rust shell
│   ├── src/main.rs
│   └── tauri.conf.json
├── backend/                         # Headless Python backend (sidecar)
│   ├── app_controller.py            # Orchestration logic (engine, sync, config)
│   ├── api_server.py                # FastAPI REST + WebSocket control API
│   └── main_headless.py             # Headless entry point
├── client/                          # Core transcription modules
│   ├── audio_capture.py             # Audio input handling
│   ├── transcription_engine_realtime.py  # RealtimeSTT / Whisper
│   ├── deepgram_transcription.py    # Deepgram cloud transcription
│   ├── noise_suppression.py         # VAD and noise reduction
│   ├── device_utils.py              # CPU/GPU/MPS detection
│   ├── config.py                    # Configuration management
│   ├── server_sync.py               # Multi-user server client
│   └── update_checker.py            # Auto-update functionality
├── gui/                             # Legacy PySide6/Qt GUI
│   ├── main_window_qt.py
│   ├── settings_dialog_qt.py
│   └── transcription_display_qt.py
├── server/                          # Web servers
│   ├── web_display.py               # Local FastAPI server for OBS
│   └── nodejs/                      # Multi-user sync server
├── .gitea/workflows/                # CI/CD
│   ├── release.yml                  # Tauri app builds (all platforms)
│   └── build-sidecar.yml            # Python sidecar builds (CUDA + CPU)
├── config/
│   └── default_config.yaml          # Default settings template
├── main.py                          # Legacy GUI entry point
├── main_cli.py                      # CLI version (for testing)
├── local-transcription.spec         # PyInstaller config (legacy, with PySide6)
├── local-transcription-headless.spec # PyInstaller config (headless sidecar)
├── pyproject.toml                   # Python dependencies
└── package.json                     # Node.js / Tauri dependencies
```

## Technology Stack

### Frontend (Tauri)
- **Tauri v2** — Native cross-platform shell (Rust)
- **Svelte 5** — Reactive UI framework (TypeScript)
- **Vite** — Frontend build tool

### Backend (Python Sidecar)
- **Python 3.9+**
- **FastAPI + Uvicorn** — REST API and WebSocket server
- **RealtimeSTT** — Real-time speech-to-text with advanced VAD
- **faster-whisper** — Optimized Whisper model inference (CTranslate2)
- **PyTorch** — ML framework (CUDA-enabled builds available)
- **sounddevice** — Cross-platform audio capture
- **webrtcvad + silero_vad** — Voice activity detection

### Multi-User Server (Optional)
- **Node.js + Express + WebSocket** — Real-time sync server

### Build & CI/CD
- **PyInstaller** — Python sidecar packaging
- **Tauri CLI** — App bundling (.msi, .dmg, .deb, .rpm, .AppImage)
- **Gitea Actions** — Automated cross-platform builds
- **uv** — Fast Python package manager

## CI/CD

Two Gitea Actions workflows in `.gitea/workflows/`:

| Workflow | Trigger | Produces |
|----------|---------|----------|
| `release.yml` | Push to `main` | Tauri app installers for all platforms |
| `build-sidecar.yml` | Changes to `client/`, `server/`, `backend/`, or `pyproject.toml` | Python sidecar zips (CUDA + CPU) |

Both workflows require a `BUILD_TOKEN` secret in the repo settings (Gitea API token with release write access).

### Release Artifacts

| Platform | App Installer | Sidecar (CUDA) | Sidecar (CPU) |
|----------|--------------|----------------|---------------|
| Linux x86_64 | `.deb`, `.rpm`, `.AppImage` | `sidecar-linux-x86_64-cuda.zip` | `sidecar-linux-x86_64-cpu.zip` |
| Windows x86_64 | `.msi`, `-setup.exe` | `sidecar-windows-x86_64-cuda.zip` | `sidecar-windows-x86_64-cpu.zip` |
| macOS ARM64 | `.dmg` | — | `sidecar-macos-aarch64-cpu.zip` |

## System Requirements

### Minimum
- 4GB RAM
- Any modern CPU

### Recommended (for local real-time transcription)
- 8GB+ RAM
- NVIDIA GPU with CUDA support (for GPU acceleration)

### For Building
- **Tauri app**: Node.js 20+, Rust stable, platform SDK (see [Tauri prerequisites](https://tauri.app/start/prerequisites/))
- **Python sidecar**: Python 3.9+, uv, PyInstaller
- **Linux**: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev`, `libappindicator3-dev`, `librsvg2-dev`, `patchelf`
- **Windows**: Visual Studio Build Tools, WebView2
- **macOS**: Xcode Command Line Tools

## Troubleshooting

### Model Loading Issues
- Models download automatically on first use to `~/.cache/huggingface/`
- First run requires internet connection
- Check disk space (models range from 75MB to 3GB)

### Audio Device Issues
```bash
# List available audio devices
uv run python main_cli.py --list-devices
```
- Ensure microphone permissions are granted (especially on macOS)
- Try different device indices in settings

### GPU Not Detected
```bash
# Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"
```
- Install NVIDIA drivers (CUDA toolkit is bundled in CUDA sidecar builds)
- The app automatically falls back to CPU if no GPU is available

### Web Server Port Conflicts
- Default port is 8080; the app tries ports 8080-8084 automatically
- Change in settings or edit config file
- Check for conflicts: `lsof -i :8080` (Linux/macOS) or `netstat -ano | findstr :8080` (Windows)

## Use Cases

- **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams
- **Multi-Language Translation**: Multiple translators transcribing in different languages
- **Accessibility**: Provide captions for hearing-impaired viewers
- **Podcast Recording**: Real-time transcription for multi-host shows
- **Gaming Commentary**: Track who said what in multiplayer sessions

## Contributing

Contributions are welcome! Please feel free to submit issues or pull requests at the [repository](https://repo.anhonesthost.net/streamer-tools/local-transcription).

## License

MIT License

## Acknowledgments

- [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model
- [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities
- [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference
- [Tauri](https://tauri.app/) for the cross-platform desktop framework
- [Deepgram](https://deepgram.com/) for cloud transcription API