# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization. **Key Features:** - Standalone desktop GUI (PySide6/Qt) - Local transcription with CPU/GPU support - Built-in web server for OBS browser source integration - Optional PHP-based multi-user server for syncing transcriptions across users - Noise suppression and Voice Activity Detection (VAD) - Cross-platform builds (Linux/Windows) with PyInstaller ## Project Structure ``` local-transcription/ ├── client/ # Core transcription logic │ ├── audio_capture.py # Audio input and buffering │ ├── transcription_engine.py # Whisper model integration │ ├── noise_suppression.py # VAD and noise reduction │ ├── device_utils.py # CPU/GPU device management │ ├── config.py # Configuration management │ └── server_sync.py # Multi-user server sync client ├── gui/ # Desktop application UI │ ├── main_window_qt.py # Main application window (PySide6) │ ├── settings_dialog_qt.py # Settings dialog (PySide6) │ └── transcription_display_qt.py # Display widget ├── server/ # Web display server │ ├── web_display.py # FastAPI server for OBS browser source │ └── php/ # Optional multi-user PHP server │ ├── server.php # Multi-user sync server │ ├── display.php # Multi-user web display │ └── README.md # PHP server documentation ├── config/ # Example configuration files │ └── default_config.yaml # Default settings template ├── main.py # GUI application entry point ├── main_cli.py # CLI version for testing └── pyproject.toml # Dependencies and build config ``` ## Development Commands ### Installation and Setup ```bash # Install dependencies (creates .venv automatically) uv sync # Run the GUI application uv run python main.py # Run CLI version (headless, for testing) uv run python main_cli.py # List available audio devices uv run python main_cli.py --list-devices # Install with CUDA support (if needed) uv pip install torch --index-url https://download.pytorch.org/whl/cu121 ``` ### Building Executables ```bash # Linux (CPU-only) ./build.sh # Linux (with CUDA support - works on both GPU and CPU systems) ./build-cuda.sh # Windows (CPU-only) build.bat # Windows (with CUDA support) build-cuda.bat # Manual build with PyInstaller uv run pyinstaller local-transcription.spec ``` **Important:** CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available. ### Testing ```bash # Run component tests uv run python test_components.py # Check CUDA availability uv run python check_cuda.py # Test web server manually uv run python -m uvicorn server.web_display:app --reload ``` ## Architecture ### Audio Processing Pipeline 1. **Audio Capture** ([client/audio_capture.py](client/audio_capture.py)) - Captures audio from microphone/system using sounddevice - Handles automatic sample rate detection and resampling - Uses chunking with overlap for better transcription quality - Default: 3-second chunks with 0.5s overlap 2. **Noise Suppression** ([client/noise_suppression.py](client/noise_suppression.py)) - Applies noisereduce for background noise reduction - Voice Activity Detection (VAD) using webrtcvad - Skips silent segments to improve performance 3. **Transcription** ([client/transcription_engine.py](client/transcription_engine.py)) - Uses faster-whisper for efficient inference - Supports CPU, CUDA, and Apple MPS (Mac) - Models: tiny, base, small, medium, large - Thread-safe model loading with locks 4. **Display** ([gui/main_window_qt.py](gui/main_window_qt.py)) - PySide6/Qt-based desktop GUI - Real-time transcription display with scrolling - Settings panel with live updates (no restart needed) ### Web Server Architecture **Local Web Server** ([server/web_display.py](server/web_display.py)) - Always runs when GUI starts (port 8080 by default) - FastAPI with WebSocket for real-time updates - Used for OBS browser source integration - Single-user (displays only local transcriptions) **Multi-User Servers** (Optional - for syncing across multiple users) Three options available: 1. **PHP with Polling** ([server/php/display-polling.php](server/php/display-polling.php)) - **RECOMMENDED for PHP** - Works on ANY shared hosting (no buffering issues) - Uses HTTP polling instead of SSE - 1-2 second latency, very reliable - File-based storage, no database needed 2. **Node.js WebSocket Server** ([server/nodejs/](server/nodejs/)) - **BEST PERFORMANCE** - Real-time WebSocket support (< 100ms latency) - Handles 100+ concurrent users - Requires VPS/cloud hosting (Railway, Heroku, DigitalOcean) - Much better than PHP for real-time applications 3. **PHP with SSE** ([server/php/display.php](server/php/display.php)) - **NOT RECOMMENDED** - Has buffering issues on most shared hosting - PHP-FPM incompatibility - Use polling or Node.js instead See [server/COMPARISON.md](server/COMPARISON.md) and [server/QUICK_FIX.md](server/QUICK_FIX.md) for details ### Configuration System - Config stored at `~/.local-transcription/config.yaml` - Managed by [client/config.py](client/config.py) - Settings apply immediately without restart (except model changes) - YAML format with nested keys (e.g., `transcription.model`) ### Device Management - [client/device_utils.py](client/device_utils.py) handles CPU/GPU detection - Auto-detects CUDA, MPS (Mac), or falls back to CPU - Compute types: float32 (best quality), float16 (GPU), int8 (fastest) - Thread-safe device selection ## Key Implementation Details ### PyInstaller Build Configuration - [local-transcription.spec](local-transcription.spec) controls build - UPX compression enabled for smaller executables - Hidden imports required for PySide6, faster-whisper, torch - Console mode enabled by default (set `console=False` to hide) ### Threading Model - Main thread: Qt GUI event loop - Audio thread: Captures and processes audio chunks - Web server thread: Runs FastAPI server - Transcription: Runs in callback thread from audio capture - All transcription results communicated via Qt signals ### Server Sync (Optional Multi-User Feature) - [client/server_sync.py](client/server_sync.py) handles server communication - Toggle in Settings: "Enable Server Sync" - Sends transcriptions to PHP server via POST - Separate web display shows merged transcriptions from all users - Falls back gracefully if server unavailable ## Common Patterns ### Adding a New Setting 1. Add to [config/default_config.yaml](config/default_config.yaml) 2. Update [client/config.py](client/config.py) if validation needed 3. Add UI control in [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py) 4. Apply setting in relevant component (no restart if possible) 5. Emit signal to update display if needed ### Modifying Transcription Display - Local GUI: [gui/transcription_display_qt.py](gui/transcription_display_qt.py) - Web display (OBS): [server/web_display.py](server/web_display.py) (HTML in `_get_html()`) - Multi-user display: [server/php/display.php](server/php/display.php) ### Adding a New Model Size - Update [client/transcription_engine.py](client/transcription_engine.py) - Add to model selector in [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py) - Update CLI argument choices in [main_cli.py](main_cli.py) ## Dependencies **Core:** - `faster-whisper`: Optimized Whisper inference - `torch`: ML framework (CUDA-enabled via special index) - `PySide6`: Qt6 bindings for GUI - `sounddevice`: Cross-platform audio I/O - `noisereduce`, `webrtcvad`: Audio preprocessing **Web Server:** - `fastapi`, `uvicorn`: Web server and ASGI - `websockets`: Real-time communication **Build:** - `pyinstaller`: Create standalone executables - `uv`: Fast package manager **PyTorch CUDA Index:** - Configured in [pyproject.toml](pyproject.toml) under `[[tool.uv.index]]` - Uses PyTorch's custom wheel repository for CUDA builds - Automatically installed with `uv sync` when using CUDA build scripts ## Platform-Specific Notes ### Linux - Uses PulseAudio/ALSA for audio - Build scripts use bash (`.sh` files) - Executable: `dist/LocalTranscription/LocalTranscription` ### Windows - Uses Windows Audio/WASAPI - Build scripts use batch (`.bat` files) - Executable: `dist\LocalTranscription\LocalTranscription.exe` - Requires Visual C++ Redistributable on target systems ### Cross-Building - **Cannot cross-compile** - must build on target platform - CI/CD should use platform-specific runners ## Troubleshooting ### Model Loading Issues - Models download to `~/.cache/huggingface/` - First run requires internet connection - Check disk space (models: 75MB-3GB depending on size) ### Audio Device Issues - Run `uv run python main_cli.py --list-devices` - Check permissions (microphone access) - Try different device indices in settings ### GPU Not Detected - Run `uv run python check_cuda.py` - Install CUDA drivers (not CUDA toolkit - bundled in build) - Verify PyTorch sees GPU: `python -c "import torch; print(torch.cuda.is_available())"` ### Web Server Port Conflicts - Default port: 8080 - Change in [gui/main_window_qt.py](gui/main_window_qt.py) or config - Use `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows) ## OBS Integration ### Local Display (Single User) 1. Start Local Transcription app 2. In OBS: Add "Browser" source 3. URL: `http://localhost:8080` 4. Set dimensions (e.g., 1920x300) ### Multi-User Display (PHP Server - Polling) 1. Deploy PHP server to web hosting 2. Each user enables "Server Sync" in settings 3. Enter same room name and passphrase 4. In OBS: Add "Browser" source 5. URL: `https://your-domain.com/transcription/display-polling.php?room=ROOM&fade=10` ### Multi-User Display (Node.js Server) 1. Deploy Node.js server (see [server/nodejs/README.md](server/nodejs/README.md)) 2. Each user configures Server URL: `http://your-server:3000/api/send` 3. Enter same room name and passphrase 4. In OBS: Add "Browser" source 5. URL: `http://your-server:3000/display?room=ROOM&fade=10` ## Performance Optimization **For Real-Time Transcription:** - Use `tiny` or `base` model (faster) - Enable GPU if available (5-10x faster) - Increase chunk_duration for better accuracy (higher latency) - Decrease chunk_duration for lower latency (less context) - Enable VAD to skip silent audio **For Build Size Reduction:** - Don't bundle models (download on demand) - Use CPU-only build if no GPU users - Enable UPX compression (already in spec) ## Phase Status - ✅ **Phase 1**: Standalone desktop application (complete) - ✅ **Web Server**: Local OBS integration (complete) - ✅ **Builds**: PyInstaller executables (complete) - 🚧 **Phase 2**: Multi-user PHP server (functional, optional) - ⏸️ **Phase 3+**: Advanced features (see [NEXT_STEPS.md](NEXT_STEPS.md)) ## Related Documentation - [README.md](README.md) - User-facing documentation - [BUILD.md](BUILD.md) - Detailed build instructions - [INSTALL.md](INSTALL.md) - Installation guide - [NEXT_STEPS.md](NEXT_STEPS.md) - Future enhancements - [server/php/README.md](server/php/README.md) - PHP server setup