From 47ca74e75dc22386faa3ea2566664f67b9342e94 Mon Sep 17 00:00:00 2001 From: Developer Date: Mon, 6 Apr 2026 13:34:10 -0700 Subject: [PATCH] Update README and CLAUDE.md for Tauri rewrite Update both docs to reflect the new architecture: - Tauri v2 + Svelte 5 frontend replacing PySide6/Qt - Headless Python backend with FastAPI control API - Cross-platform support (Windows, macOS, Linux) - Deepgram remote transcription (managed/BYOK) - Gitea CI/CD workflows for automated builds - New project structure with backend/, src/, src-tauri/ - Updated development commands and build instructions Co-Authored-By: Claude Opus 4.6 (1M context) --- CLAUDE.md | 413 ++++++++++++++++++++++++------------------------------ README.md | 224 ++++++++++++++++++++--------- 2 files changed, 342 insertions(+), 295 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 3f3997f..f7b0683 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,52 +4,108 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Project Overview -Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization. +Local Transcription is a cross-platform desktop application for real-time speech-to-text transcription designed for streamers. It supports local Whisper models and cloud-based Deepgram transcription, with OBS browser source integration and optional multi-user sync. + +**Architecture:** Two-process model — a Tauri v2 shell (Svelte 5 frontend) communicates with a headless Python backend (sidecar) via REST API and WebSocket. **Key Features:** -- Standalone desktop GUI (PySide6/Qt) -- Local transcription with CPU/GPU support -- Built-in web server for OBS browser source integration -- Optional Node.js-based multi-user server for syncing transcriptions across users -- Noise suppression and Voice Activity Detection (VAD) -- Cross-platform builds (Linux/Windows) with PyInstaller +- Cross-platform desktop app (Windows, macOS, Linux) via Tauri v2 + Svelte 5 +- Headless Python backend with FastAPI control API +- Dual transcription modes: local Whisper or cloud Deepgram (managed/BYOK) +- Built-in web server for OBS browser source at `http://localhost:8080` +- Optional multi-user sync via Node.js server +- CUDA, MPS (Apple Silicon), and CPU support +- Auto-updates, custom fonts, configurable colors + +> **Legacy GUI:** The original PySide6/Qt GUI (`main.py`, `gui/`) still works during the transition. New features should target the Tauri frontend and headless backend. ## Project Structure ``` local-transcription/ -├── client/ # Core transcription logic -│ ├── audio_capture.py # Audio input and buffering -│ ├── transcription_engine.py # Whisper model integration -│ ├── noise_suppression.py # VAD and noise reduction -│ ├── device_utils.py # CPU/GPU device management -│ ├── config.py # Configuration management -│ └── server_sync.py # Multi-user server sync client -├── gui/ # Desktop application UI -│ ├── main_window_qt.py # Main application window (PySide6) -│ ├── settings_dialog_qt.py # Settings dialog (PySide6) -│ └── transcription_display_qt.py # Display widget -├── server/ # Web display servers -│ ├── web_display.py # FastAPI server for OBS browser source (local) -│ └── nodejs/ # Optional multi-user Node.js server -│ ├── server.js # Multi-user sync server with WebSocket -│ ├── package.json # Node.js dependencies -│ └── README.md # Server deployment documentation -├── config/ # Example configuration files -│ └── default_config.yaml # Default settings template -├── main.py # GUI application entry point -├── main_cli.py # CLI version for testing -└── pyproject.toml # Dependencies and build config +├── src/ # Svelte 5 frontend (Tauri UI) +│ ├── App.svelte # Main app shell +│ ├── app.css # Global dark theme styles +│ ├── main.ts # Svelte mount point +│ ├── lib/components/ # UI components +│ │ ├── Header.svelte # Title bar + settings button +│ │ ├── StatusBar.svelte # State indicator, device, user info +│ │ ├── Controls.svelte # Start/Stop, Clear, Save buttons +│ │ ├── TranscriptionDisplay.svelte # Scrolling transcript view +│ │ └── Settings.svelte # Full settings modal (all sections) +│ └── lib/stores/ # Svelte 5 reactive stores ($state/$derived) +│ ├── backend.ts # WebSocket + REST API client +│ ├── config.ts # App configuration fetch/update +│ └── transcriptions.ts # Transcript data management +├── src-tauri/ # Tauri v2 Rust shell +│ ├── src/lib.rs # Plugin registration (shell, dialog, process) +│ ├── src/main.rs # Entry point +│ ├── tauri.conf.json # Window, bundle, plugin config +│ └── Cargo.toml # Rust dependencies +├── backend/ # Headless Python backend (the sidecar) +│ ├── app_controller.py # Core orchestration (engine, sync, config) +│ ├── api_server.py # FastAPI REST endpoints + /ws/control +│ └── main_headless.py # Headless entry point (prints JSON to stdout) +├── client/ # Core transcription modules (used by backend) +│ ├── audio_capture.py # Audio input handling +│ ├── transcription_engine_realtime.py # RealtimeSTT / Whisper engine +│ ├── deepgram_transcription.py # Deepgram WebSocket cloud transcription +│ ├── noise_suppression.py # VAD and noise reduction +│ ├── device_utils.py # CPU/GPU/MPS detection +│ ├── config.py # YAML config management (~/.local-transcription/) +│ ├── server_sync.py # Multi-user server sync client +│ ├── instance_lock.py # Single-instance PID lock +│ └── update_checker.py # Gitea release update checker +├── gui/ # Legacy PySide6/Qt GUI (still functional) +│ ├── main_window_qt.py # Main window (orchestration lives here in legacy) +│ ├── settings_dialog_qt.py # Settings dialog +│ └── transcription_display_qt.py # Display widget +├── server/ +│ ├── web_display.py # FastAPI OBS display server (WebSocket + HTML) +│ └── nodejs/ # Optional multi-user sync server +├── .gitea/workflows/ # CI/CD +│ ├── release.yml # Tauri app builds (Linux/Windows/macOS) +│ └── build-sidecar.yml # Python sidecar builds (CUDA + CPU) +├── config/default_config.yaml # Default settings template +├── main.py # Legacy PySide6 GUI entry point +├── main_cli.py # CLI version for testing +├── version.py # Version string (__version__) +├── local-transcription.spec # PyInstaller config (legacy, includes PySide6) +├── local-transcription-headless.spec # PyInstaller config (headless sidecar, no Qt) +├── pyproject.toml # Python deps (uv, CUDA PyTorch index) +├── package.json # Node/Tauri deps +└── vite.config.ts # Vite build config ($lib alias) ``` ## Development Commands -### Installation and Setup +### Frontend (Tauri + Svelte) ```bash -# Install dependencies (creates .venv automatically) +# Install npm dependencies +npm install + +# Run Tauri in development mode (hot-reload) +npm run tauri dev + +# Build frontend only (for testing) +npx vite build + +# Type-check Svelte +npx svelte-check + +# Check Rust compiles +cd src-tauri && cargo check +``` + +### Backend (Python) +```bash +# Install Python dependencies uv sync -# Run the GUI application +# Run the headless backend standalone (for development) +uv run python -m backend.main_headless --port 8080 + +# Run the legacy PySide6 GUI uv run python main.py # Run CLI version (headless, for testing) @@ -57,257 +113,154 @@ uv run python main_cli.py # List available audio devices uv run python main_cli.py --list-devices - -# Install with CUDA support (if needed) -uv pip install torch --index-url https://download.pytorch.org/whl/cu121 ``` -### Building Executables +### Building ```bash -# Linux (includes CUDA support - works on both GPU and CPU systems) -./build.sh +# Build Tauri app (produces platform installer) +npm run tauri build -# Windows (includes CUDA support - works on both GPU and CPU systems) -build.bat +# Build headless Python sidecar (no PySide6) +uv run pyinstaller local-transcription-headless.spec +# Output: dist/local-transcription-backend/ -# Manual build with PyInstaller -uv sync # Install dependencies (includes CUDA PyTorch) -uv pip uninstall -q enum34 # Remove incompatible enum34 package +# Build legacy PySide6 app uv run pyinstaller local-transcription.spec +# Or use: ./build.sh (Linux) / build.bat (Windows) ``` -**Important:** All builds include CUDA support via `pyproject.toml` configuration. CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available. - ### Testing ```bash -# Run component tests uv run python test_components.py - -# Check CUDA availability uv run python check_cuda.py - -# Test web server manually -uv run python -m uvicorn server.web_display:app --reload ``` -## Architecture +## Architecture Details -### Audio Processing Pipeline +### Communication: Tauri <-> Python Backend -1. **Audio Capture** ([client/audio_capture.py](client/audio_capture.py)) - - Captures audio from microphone/system using sounddevice - - Handles automatic sample rate detection and resampling - - Uses chunking with overlap for better transcription quality - - Default: 3-second chunks with 0.5s overlap +The Svelte frontend connects to the Python backend via two channels: -2. **Noise Suppression** ([client/noise_suppression.py](client/noise_suppression.py)) - - Applies noisereduce for background noise reduction - - Voice Activity Detection (VAD) using webrtcvad - - Skips silent segments to improve performance +**REST API** (on port 8081 by default): +- `GET /api/status` — app state, device info, version +- `POST /api/start` / `POST /api/stop` — transcription control +- `GET /api/config` / `PUT /api/config` — read/write settings (dot-notation keys) +- `GET /api/audio-devices` / `GET /api/compute-devices` — device enumeration +- `POST /api/reload-engine` — reload with new model/device +- `GET /api/transcriptions` / `POST /api/clear` — transcript management +- `POST /api/save-file` — write text to a file path +- `GET /api/check-update` / `POST /api/skip-version` — update management +- `POST /api/login` / `POST /api/register` / `GET /api/balance` — managed mode proxy -3. **Transcription** ([client/transcription_engine.py](client/transcription_engine.py)) - - Uses faster-whisper for efficient inference - - Supports CPU, CUDA, and Apple MPS (Mac) - - Models: tiny, base, small, medium, large - - Thread-safe model loading with locks +**WebSocket** `/ws/control`: +- Pushes real-time events: `state_changed`, `transcription`, `preview`, `error`, `credits_low` +- Client sends keepalive pings -4. **Display** ([gui/main_window_qt.py](gui/main_window_qt.py)) - - PySide6/Qt-based desktop GUI - - Real-time transcription display with scrolling - - Settings panel with live updates (no restart needed) +The OBS display server runs separately on port 8080 (`GET /` for HTML, `WebSocket /ws` for transcriptions). -### Web Server Architecture +### Backend Process Lifecycle -**Local Web Server** ([server/web_display.py](server/web_display.py)) -- Always runs when GUI starts (port 8080 by default) -- FastAPI with WebSocket for real-time updates -- Used for OBS browser source integration -- Single-user (displays only local transcriptions) +1. `main_headless.py` starts, acquires instance lock, creates `AppController` +2. `AppController.initialize()` starts the OBS web server (port 8080) and engine init thread +3. `APIServer` wraps the controller with FastAPI routes, runs on port 8081 +4. Backend prints `{"event": "ready", "port": 8080}` to stdout for Tauri to discover +5. On shutdown: engine stopped, web server stopped, lock released -**Multi-User Server** (Optional - for syncing across multiple users) +### Headless Backend vs Legacy GUI -**Node.js WebSocket Server** ([server/nodejs/](server/nodejs/)) - **RECOMMENDED** -- Real-time WebSocket support (< 100ms latency) -- Handles 100+ concurrent users -- Easy deployment to VPS/cloud hosting (Railway, Heroku, DigitalOcean, or any VPS) -- Configurable display options via URL parameters: - - `timestamps=true/false` - Show/hide timestamps - - `maxlines=50` - Maximum visible lines (prevents scroll bars in OBS) - - `fontsize=16` - Font size in pixels - - `fontfamily=Arial` - Font family - - `fade=10` - Seconds before text fades (0 = never) +The `AppController` class (`backend/app_controller.py`) extracts all orchestration logic from `gui/main_window_qt.py` into a Qt-free class. The mapping: -See [server/nodejs/README.md](server/nodejs/README.md) for deployment instructions +| Legacy (MainWindow) | Headless (AppController) | +|---------------------|--------------------------| +| `_initialize_components()` | `_initialize_engine()` | +| `_start_transcription()` | `start_transcription()` | +| `_stop_transcription()` | `stop_transcription()` | +| `_on_settings_saved()` | `apply_settings()` | +| `_reload_engine()` | `reload_engine()` | +| `_start_web_server_if_enabled()` | `_start_web_server()` | +| `_start_server_sync()` | `_start_server_sync()` | +| Qt signals | Callbacks (`on_state_changed`, `on_transcription`, etc.) | -### Configuration System +### Threading Model (Headless) -- Config stored at `~/.local-transcription/config.yaml` -- Managed by [client/config.py](client/config.py) -- Settings apply immediately without restart (except model changes) -- YAML format with nested keys (e.g., `transcription.model`) +- Main thread: Uvicorn (FastAPI) event loop +- Engine init thread: Downloads models, initializes VAD +- Web server thread: Separate asyncio loop for OBS display +- Audio capture: Runs in engine callback threads +- All results flow through `AppController` callbacks -> `APIServer` WebSocket broadcast -### Device Management +### Svelte Frontend -- [client/device_utils.py](client/device_utils.py) handles CPU/GPU detection -- Auto-detects CUDA, MPS (Mac), or falls back to CPU -- Compute types: float32 (best quality), float16 (GPU), int8 (fastest) -- Thread-safe device selection +Uses Svelte 5 runes throughout (`$state`, `$derived`, `$effect`, `$props`). No Svelte 4 patterns. -## Key Implementation Details +**Stores** (`src/lib/stores/`): +- `backend.ts` — WebSocket connection + REST helpers (`apiGet`, `apiPost`, `apiPut`), auto-reconnect +- `config.ts` — fetches/updates config from backend API +- `transcriptions.ts` — manages transcript list, listens for `CustomEvent`s from backend store -### PyInstaller Build Configuration +**Key patterns:** +- Backend store dispatches `CustomEvent`s on `window` for cross-store communication +- Settings component collects all changed values into a `Record` with dot-notation keys, sends via `PUT /api/config` +- Controls use Tauri dialog plugin for native file save, falls back to blob download -- [local-transcription.spec](local-transcription.spec) controls build -- UPX compression enabled for smaller executables -- Hidden imports required for PySide6, faster-whisper, torch -- Console mode enabled by default (set `console=False` to hide) +## CI/CD -### Threading Model +Two Gitea Actions workflows in `.gitea/workflows/`: -- Main thread: Qt GUI event loop -- Audio thread: Captures and processes audio chunks -- Web server thread: Runs FastAPI server -- Transcription: Runs in callback thread from audio capture -- All transcription results communicated via Qt signals +- **`release.yml`**: Triggers on push to `main`. Auto-bumps version, builds Tauri app on Linux/Windows/macOS, uploads `.deb`, `.rpm`, `.msi`, `.dmg` to Gitea release. +- **`build-sidecar.yml`**: Triggers on changes to `client/`, `server/`, `backend/`, `pyproject.toml`. Builds headless Python sidecar via PyInstaller. CUDA + CPU for Linux/Windows, CPU-only for macOS. -### Server Sync (Optional Multi-User Feature) - -- [client/server_sync.py](client/server_sync.py) handles server communication -- Toggle in Settings: "Enable Server Sync" -- Sends transcriptions to Node.js server via HTTP POST -- Real-time updates via WebSocket to display page -- Per-speaker font support (Web-Safe, Google Fonts, Custom uploads) -- Falls back gracefully if server unavailable +Both require a `BUILD_TOKEN` secret (Gitea API token with release write access). ## Common Patterns ### Adding a New Setting -1. Add to [config/default_config.yaml](config/default_config.yaml) -2. Update [client/config.py](client/config.py) if validation needed -3. Add UI control in [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py) -4. Apply setting in relevant component (no restart if possible) -5. Emit signal to update display if needed +1. Add default to [config/default_config.yaml](config/default_config.yaml) +2. Add UI control in [src/lib/components/Settings.svelte](src/lib/components/Settings.svelte) +3. Ensure the setting is included in the save handler's config update +4. Apply in `AppController.apply_settings()` or the relevant component +5. For legacy GUI: also update [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py) + +### Adding a New API Endpoint + +1. Add route in [backend/api_server.py](backend/api_server.py) `_setup_routes()` +2. Add supporting logic in [backend/app_controller.py](backend/app_controller.py) if needed +3. Call from Svelte via `backendStore.apiGet/apiPost/apiPut` ### Modifying Transcription Display -- Local GUI: [gui/transcription_display_qt.py](gui/transcription_display_qt.py) -- Local web display (OBS): [server/web_display.py](server/web_display.py) (HTML in `_get_html()`) +- Tauri UI: [src/lib/components/TranscriptionDisplay.svelte](src/lib/components/TranscriptionDisplay.svelte) +- OBS display: [server/web_display.py](server/web_display.py) (HTML in `_get_html()`) - Multi-user display: [server/nodejs/server.js](server/nodejs/server.js) (display page in `/display` route) -### Adding a New Model Size - -- Update [client/transcription_engine.py](client/transcription_engine.py) -- Add to model selector in [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py) -- Update CLI argument choices in [main_cli.py](main_cli.py) - ## Dependencies -**Core:** -- `faster-whisper`: Optimized Whisper inference -- `torch`: ML framework (CUDA-enabled via special index) -- `PySide6`: Qt6 bindings for GUI -- `sounddevice`: Cross-platform audio I/O -- `noisereduce`, `webrtcvad`: Audio preprocessing - -**Web Server:** -- `fastapi`, `uvicorn`: Web server and ASGI -- `websockets`: Real-time communication - -**Build:** -- `pyinstaller`: Create standalone executables -- `uv`: Fast package manager - -**PyTorch CUDA Index:** -- Configured in [pyproject.toml](pyproject.toml) under `[[tool.uv.index]]` -- Uses PyTorch's custom wheel repository for CUDA builds -- Automatically installed with `uv sync` when using CUDA build scripts +**Frontend:** Tauri v2, Svelte 5, Vite, TypeScript +**Backend:** Python 3.9+, FastAPI, Uvicorn, RealtimeSTT, faster-whisper, PyTorch (CUDA), sounddevice +**Build:** PyInstaller (sidecar), Tauri CLI (app), uv (Python packages) +**CI:** Gitea Actions with platform-specific runners ## Platform-Specific Notes ### Linux -- Uses PulseAudio/ALSA for audio -- Build scripts use bash (`.sh` files) -- Executable: `dist/LocalTranscription/LocalTranscription` +- Tauri needs: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev`, `libappindicator3-dev`, `librsvg2-dev`, `patchelf` +- Audio: PulseAudio/ALSA via sounddevice ### Windows -- Uses Windows Audio/WASAPI -- Build scripts use batch (`.bat` files) -- Executable: `dist\LocalTranscription\LocalTranscription.exe` -- Requires Visual C++ Redistributable on target systems +- Tauri needs: WebView2 (usually pre-installed on Windows 10+) +- Audio: WASAPI via sounddevice -### Cross-Building -- **Cannot cross-compile** - must build on target platform -- CI/CD should use platform-specific runners - -## Troubleshooting - -### Model Loading Issues -- Models download to `~/.cache/huggingface/` -- First run requires internet connection -- Check disk space (models: 75MB-3GB depending on size) - -### Audio Device Issues -- Run `uv run python main_cli.py --list-devices` -- Check permissions (microphone access) -- Try different device indices in settings - -### GPU Not Detected -- Run `uv run python check_cuda.py` -- Install CUDA drivers (not CUDA toolkit - bundled in build) -- Verify PyTorch sees GPU: `python -c "import torch; print(torch.cuda.is_available())"` - -### Web Server Port Conflicts -- Default port: 8080 -- Change in [gui/main_window_qt.py](gui/main_window_qt.py) or config -- Use `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows) - -## OBS Integration - -### Local Display (Single User) -1. Start Local Transcription app -2. In OBS: Add "Browser" source -3. URL: `http://localhost:8080` -4. Set dimensions (e.g., 1920x300) - -### Multi-User Display (Node.js Server) -1. Deploy Node.js server (see [server/nodejs/README.md](server/nodejs/README.md)) -2. Each user configures Server URL: `http://your-server:3000/api/send` -3. Enter same room name and passphrase -4. In OBS: Add "Browser" source -5. URL: `http://your-server:3000/display?room=ROOM&fade=10×tamps=true&maxlines=50&fontsize=16` -6. Customize URL parameters as needed: - - `timestamps=false` - Hide timestamps - - `maxlines=30` - Show max 30 lines (prevents scroll bars) - - `fontsize=18` - Larger font - - `fontfamily=Courier` - Different font - -## Performance Optimization - -**For Real-Time Transcription:** -- Use `tiny` or `base` model (faster) -- Enable GPU if available (5-10x faster) -- Increase chunk_duration for better accuracy (higher latency) -- Decrease chunk_duration for lower latency (less context) -- Enable VAD to skip silent audio - -**For Build Size Reduction:** -- Don't bundle models (download on demand) -- Use CPU-only build if no GPU users -- Enable UPX compression (already in spec) - -## Phase Status - -- ✅ **Phase 1**: Standalone desktop application (complete) -- ✅ **Web Server**: Local OBS integration (complete) -- ✅ **Builds**: PyInstaller executables (complete) -- ✅ **Phase 2**: Multi-user Node.js server (complete, optional) -- ⏸️ **Phase 3+**: Advanced features (see [NEXT_STEPS.md](NEXT_STEPS.md)) +### macOS +- Tauri needs: Xcode Command Line Tools +- Audio: CoreAudio via sounddevice +- GPU: MPS (Apple Silicon) detected by `device_utils.py` +- `Info.plist` must include `NSMicrophoneUsageDescription` for mic access +- No CUDA builds — CPU/MPS only ## Related Documentation -- [README.md](README.md) - User-facing documentation -- [BUILD.md](BUILD.md) - Detailed build instructions -- [INSTALL.md](INSTALL.md) - Installation guide -- [NEXT_STEPS.md](NEXT_STEPS.md) - Future enhancements -- [server/nodejs/README.md](server/nodejs/README.md) - Node.js server setup and deployment +- [README.md](README.md) — User-facing documentation +- [BUILD.md](BUILD.md) — Detailed build instructions +- [INSTALL.md](INSTALL.md) — Installation guide +- [server/nodejs/README.md](server/nodejs/README.md) — Node.js server setup diff --git a/README.md b/README.md index c77516a..085f0ef 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,14 @@ # Local Transcription -A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server. +A real-time speech-to-text desktop application for streamers. Runs locally on your machine with GPU or CPU, displays transcriptions via OBS browser source, and optionally syncs with other users through a multi-user server. **Version 1.4.0** ## Features - **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency -- **Standalone Desktop App**: PySide6/Qt GUI that works without any server +- **Cross-Platform**: Native desktop app for Windows, macOS, and Linux via [Tauri](https://tauri.app/) +- **Dual Transcription Modes**: Local (Whisper) or cloud (Deepgram) with managed billing or BYOK - **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback - **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection - **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080` @@ -16,36 +17,70 @@ A real-time speech-to-text desktop application for streamers. Run locally on you - **Customizable Colors**: User-configurable colors for name, text, and background - **Noise Suppression**: Built-in audio preprocessing to reduce background noise - **Auto-Updates**: Automatic update checking with release notes display -- **Cross-Platform**: Builds available for Windows and Linux + +## Architecture + +The application uses a two-process architecture: + +1. **Tauri Shell** (Svelte 5 frontend) — lightweight native window (~50MB) rendering the UI +2. **Python Backend** (sidecar) — headless process running transcription, audio capture, and the OBS web server + +The Tauri frontend communicates with the Python backend via REST API and WebSocket, following the same pattern as [voice-to-notes](https://repo.anhonesthost.net/MacroPad/voice-to-notes). + +``` +Tauri App (user launches this) + └─ Spawns Python backend as sidecar + ├─ FastAPI REST API (control endpoints) + ├─ WebSocket /ws/control (real-time state + transcriptions) + ├─ OBS web display at http://localhost:8080 + └─ Transcription engine (Whisper or Deepgram) +``` + +> **Legacy GUI**: The original PySide6/Qt desktop GUI (`main.py`) still works alongside the new Tauri frontend during the transition period. ## Quick Start ### Running from Source ```bash -# Install dependencies +# Install Python dependencies uv sync -# Run the application +# Run the Tauri app (frontend + backend) +npm install +npm run tauri dev + +# Or run just the headless backend (for development) +uv run python -m backend.main_headless + +# Or run the legacy PySide6 GUI uv run python main.py ``` ### Using Pre-Built Executables -Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases) and run the executable for your platform. +Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases): + +- **App installer** (Tauri shell): `.msi` (Windows), `.dmg` (macOS), `.deb`/`.rpm`/`.AppImage` (Linux) +- **Sidecar** (Python backend): Download the matching `sidecar-*` zip for your platform (CUDA or CPU) ### Building from Source -**Linux:** ```bash -./build.sh -# Output: dist/LocalTranscription/LocalTranscription -``` +# Build the Tauri app +npm install +npm run tauri build +# Output: src-tauri/target/release/bundle/ -**Windows:** -```cmd +# Build the Python sidecar (headless, no Qt) +uv sync +uv run pyinstaller local-transcription-headless.spec +# Output: dist/local-transcription-backend/ + +# Build the legacy PySide6 app (Linux) +./build.sh +# Build the legacy PySide6 app (Windows) build.bat -# Output: dist\LocalTranscription\LocalTranscription.exe ``` For detailed build instructions, see [BUILD.md](BUILD.md). @@ -57,14 +92,23 @@ For detailed build instructions, see [BUILD.md](BUILD.md). 1. Launch the application 2. Select your microphone from the audio device dropdown 3. Choose a Whisper model (smaller = faster, larger = more accurate): - - `tiny.en` / `tiny` - Fastest, good for quick captions - - `base.en` / `base` - Balanced speed and accuracy - - `small.en` / `small` - Better accuracy - - `medium.en` / `medium` - High accuracy - - `large-v3` - Best accuracy (requires more resources) + - `tiny.en` / `tiny` — Fastest, good for quick captions + - `base.en` / `base` — Balanced speed and accuracy + - `small.en` / `small` — Better accuracy + - `medium.en` / `medium` — High accuracy + - `large-v3` — Best accuracy (requires more resources) 4. Click **Start** to begin transcription 5. Transcriptions appear in the main window and at `http://localhost:8080` +### Remote Transcription (Deepgram) + +Instead of local Whisper models, you can use cloud-based transcription: + +- **Managed mode**: Sign up via the transcription proxy for metered billing +- **BYOK mode**: Bring your own Deepgram API key for direct access + +Configure in Settings > Remote Transcription. + ### OBS Browser Source Setup 1. Start the Local Transcription app @@ -88,7 +132,7 @@ For syncing transcriptions across multiple users (e.g., multi-host streams or tr ## Configuration -Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel. +Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel or the REST API. ### Key Settings @@ -100,6 +144,7 @@ Settings are stored at `~/.local-transcription/config.yaml` and can be modified | `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` | | `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` | | `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` | +| `remote.mode` | Transcription mode (local/managed/byok) | `local` | | `display.show_timestamps` | Show timestamps with transcriptions | `true` | | `display.fade_after_seconds` | Fade out time (0 = never) | `10` | | `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` | @@ -111,67 +156,114 @@ See [config/default_config.yaml](config/default_config.yaml) for all available o ``` local-transcription/ -├── client/ # Core transcription modules -│ ├── audio_capture.py # Audio input handling -│ ├── transcription_engine_realtime.py # RealtimeSTT integration -│ ├── noise_suppression.py # VAD and noise reduction -│ ├── device_utils.py # CPU/GPU detection -│ ├── config.py # Configuration management -│ ├── server_sync.py # Multi-user server client -│ └── update_checker.py # Auto-update functionality -├── gui/ # Desktop application UI -│ ├── main_window_qt.py # Main application window -│ ├── settings_dialog_qt.py # Settings dialog -│ └── transcription_display_qt.py # Display widget -├── server/ # Web servers -│ ├── web_display.py # Local FastAPI server for OBS -│ └── nodejs/ # Multi-user sync server -│ ├── server.js # Express + WebSocket server -│ └── README.md # Deployment instructions +├── src/ # Svelte 5 frontend (Tauri UI) +│ ├── App.svelte # Main app shell +│ ├── lib/components/ # UI components +│ │ ├── Header.svelte +│ │ ├── StatusBar.svelte +│ │ ├── Controls.svelte +│ │ ├── TranscriptionDisplay.svelte +│ │ └── Settings.svelte +│ └── lib/stores/ # Reactive state management +│ ├── backend.ts # WebSocket + REST API client +│ ├── config.ts # App configuration +│ └── transcriptions.ts # Transcription data +├── src-tauri/ # Tauri v2 Rust shell +│ ├── src/main.rs +│ └── tauri.conf.json +├── backend/ # Headless Python backend (sidecar) +│ ├── app_controller.py # Orchestration logic (engine, sync, config) +│ ├── api_server.py # FastAPI REST + WebSocket control API +│ └── main_headless.py # Headless entry point +├── client/ # Core transcription modules +│ ├── audio_capture.py # Audio input handling +│ ├── transcription_engine_realtime.py # RealtimeSTT / Whisper +│ ├── deepgram_transcription.py # Deepgram cloud transcription +│ ├── noise_suppression.py # VAD and noise reduction +│ ├── device_utils.py # CPU/GPU/MPS detection +│ ├── config.py # Configuration management +│ ├── server_sync.py # Multi-user server client +│ └── update_checker.py # Auto-update functionality +├── gui/ # Legacy PySide6/Qt GUI +│ ├── main_window_qt.py +│ ├── settings_dialog_qt.py +│ └── transcription_display_qt.py +├── server/ # Web servers +│ ├── web_display.py # Local FastAPI server for OBS +│ └── nodejs/ # Multi-user sync server +├── .gitea/workflows/ # CI/CD +│ ├── release.yml # Tauri app builds (all platforms) +│ └── build-sidecar.yml # Python sidecar builds (CUDA + CPU) ├── config/ -│ └── default_config.yaml # Default settings template -├── main.py # GUI entry point -├── main_cli.py # CLI version (for testing) -├── build.sh # Linux build script -├── build.bat # Windows build script -└── local-transcription.spec # PyInstaller configuration +│ └── default_config.yaml # Default settings template +├── main.py # Legacy GUI entry point +├── main_cli.py # CLI version (for testing) +├── local-transcription.spec # PyInstaller config (legacy, with PySide6) +├── local-transcription-headless.spec # PyInstaller config (headless sidecar) +├── pyproject.toml # Python dependencies +└── package.json # Node.js / Tauri dependencies ``` ## Technology Stack -### Desktop Application +### Frontend (Tauri) +- **Tauri v2** — Native cross-platform shell (Rust) +- **Svelte 5** — Reactive UI framework (TypeScript) +- **Vite** — Frontend build tool + +### Backend (Python Sidecar) - **Python 3.9+** -- **PySide6** - Qt6 GUI framework -- **RealtimeSTT** - Real-time speech-to-text with advanced VAD -- **faster-whisper** - Optimized Whisper model inference -- **PyTorch** - ML framework (CUDA-enabled) -- **sounddevice** - Cross-platform audio capture -- **webrtcvad + silero_vad** - Voice activity detection -- **noisereduce** - Noise suppression +- **FastAPI + Uvicorn** — REST API and WebSocket server +- **RealtimeSTT** — Real-time speech-to-text with advanced VAD +- **faster-whisper** — Optimized Whisper model inference (CTranslate2) +- **PyTorch** — ML framework (CUDA-enabled builds available) +- **sounddevice** — Cross-platform audio capture +- **webrtcvad + silero_vad** — Voice activity detection -### Web Servers -- **FastAPI + Uvicorn** - Local web display server -- **Node.js + Express + WebSocket** - Multi-user sync server +### Multi-User Server (Optional) +- **Node.js + Express + WebSocket** — Real-time sync server -### Build Tools -- **PyInstaller** - Executable packaging -- **uv** - Fast Python package manager +### Build & CI/CD +- **PyInstaller** — Python sidecar packaging +- **Tauri CLI** — App bundling (.msi, .dmg, .deb, .rpm, .AppImage) +- **Gitea Actions** — Automated cross-platform builds +- **uv** — Fast Python package manager + +## CI/CD + +Two Gitea Actions workflows in `.gitea/workflows/`: + +| Workflow | Trigger | Produces | +|----------|---------|----------| +| `release.yml` | Push to `main` | Tauri app installers for all platforms | +| `build-sidecar.yml` | Changes to `client/`, `server/`, `backend/`, or `pyproject.toml` | Python sidecar zips (CUDA + CPU) | + +Both workflows require a `BUILD_TOKEN` secret in the repo settings (Gitea API token with release write access). + +### Release Artifacts + +| Platform | App Installer | Sidecar (CUDA) | Sidecar (CPU) | +|----------|--------------|----------------|---------------| +| Linux x86_64 | `.deb`, `.rpm`, `.AppImage` | `sidecar-linux-x86_64-cuda.zip` | `sidecar-linux-x86_64-cpu.zip` | +| Windows x86_64 | `.msi`, `-setup.exe` | `sidecar-windows-x86_64-cuda.zip` | `sidecar-windows-x86_64-cpu.zip` | +| macOS ARM64 | `.dmg` | — | `sidecar-macos-aarch64-cpu.zip` | ## System Requirements ### Minimum -- Python 3.9+ - 4GB RAM - Any modern CPU -### Recommended (for real-time performance) +### Recommended (for local real-time transcription) - 8GB+ RAM - NVIDIA GPU with CUDA support (for GPU acceleration) -- FFmpeg (installed automatically with dependencies) ### For Building -- **Linux**: gcc, Python dev headers -- **Windows**: Visual Studio Build Tools, Python dev headers +- **Tauri app**: Node.js 20+, Rust stable, platform SDK (see [Tauri prerequisites](https://tauri.app/start/prerequisites/)) +- **Python sidecar**: Python 3.9+, uv, PyInstaller +- **Linux**: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev`, `libappindicator3-dev`, `librsvg2-dev`, `patchelf` +- **Windows**: Visual Studio Build Tools, WebView2 +- **macOS**: Xcode Command Line Tools ## Troubleshooting @@ -185,7 +277,7 @@ local-transcription/ # List available audio devices uv run python main_cli.py --list-devices ``` -- Ensure microphone permissions are granted +- Ensure microphone permissions are granted (especially on macOS) - Try different device indices in settings ### GPU Not Detected @@ -193,13 +285,13 @@ uv run python main_cli.py --list-devices # Check CUDA availability uv run python -c "import torch; print(torch.cuda.is_available())" ``` -- Install NVIDIA drivers (CUDA toolkit is bundled) +- Install NVIDIA drivers (CUDA toolkit is bundled in CUDA sidecar builds) - The app automatically falls back to CPU if no GPU is available ### Web Server Port Conflicts -- Default port is 8080 +- Default port is 8080; the app tries ports 8080-8084 automatically - Change in settings or edit config file -- Check for conflicts: `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows) +- Check for conflicts: `lsof -i :8080` (Linux/macOS) or `netstat -ano | findstr :8080` (Windows) ## Use Cases @@ -222,3 +314,5 @@ MIT License - [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model - [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities - [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference +- [Tauri](https://tauri.app/) for the cross-platform desktop framework +- [Deepgram](https://deepgram.com/) for cloud transcription API