Update to support sync captions

2025-12-26 16:15:52 -08:00
parent 2870d45bdc
commit c28679acb6
12 changed files with 4513 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,326 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.
+
+**Key Features:**
+- Standalone desktop GUI (PySide6/Qt)
+- Local transcription with CPU/GPU support
+- Built-in web server for OBS browser source integration
+- Optional PHP-based multi-user server for syncing transcriptions across users
+- Noise suppression and Voice Activity Detection (VAD)
+- Cross-platform builds (Linux/Windows) with PyInstaller
+
+## Project Structure
+
+```
+local-transcription/
+├── client/                   # Core transcription logic
+│   ├── audio_capture.py      # Audio input and buffering
+│   ├── transcription_engine.py # Whisper model integration
+│   ├── noise_suppression.py  # VAD and noise reduction
+│   ├── device_utils.py       # CPU/GPU device management
+│   ├── config.py             # Configuration management
+│   └── server_sync.py        # Multi-user server sync client
+├── gui/                      # Desktop application UI
+│   ├── main_window_qt.py     # Main application window (PySide6)
+│   ├── settings_dialog_qt.py # Settings dialog (PySide6)
+│   └── transcription_display_qt.py # Display widget
+├── server/                   # Web display server
+│   ├── web_display.py        # FastAPI server for OBS browser source
+│   └── php/                  # Optional multi-user PHP server
+│       ├── server.php        # Multi-user sync server
+│       ├── display.php       # Multi-user web display
+│       └── README.md         # PHP server documentation
+├── config/                   # Example configuration files
+│   └── default_config.yaml   # Default settings template
+├── main.py                   # GUI application entry point
+├── main_cli.py              # CLI version for testing
+└── pyproject.toml           # Dependencies and build config
+```
+
+## Development Commands
+
+### Installation and Setup
+```bash
+# Install dependencies (creates .venv automatically)
+uv sync
+
+# Run the GUI application
+uv run python main.py
+
+# Run CLI version (headless, for testing)
+uv run python main_cli.py
+
+# List available audio devices
+uv run python main_cli.py --list-devices
+
+# Install with CUDA support (if needed)
+uv pip install torch --index-url https://download.pytorch.org/whl/cu121
+```
+
+### Building Executables
+```bash
+# Linux (CPU-only)
+./build.sh
+
+# Linux (with CUDA support - works on both GPU and CPU systems)
+./build-cuda.sh
+
+# Windows (CPU-only)
+build.bat
+
+# Windows (with CUDA support)
+build-cuda.bat
+
+# Manual build with PyInstaller
+uv run pyinstaller local-transcription.spec
+```
+
+**Important:** CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.
+
+### Testing
+```bash
+# Run component tests
+uv run python test_components.py
+
+# Check CUDA availability
+uv run python check_cuda.py
+
+# Test web server manually
+uv run python -m uvicorn server.web_display:app --reload
+```
+
+## Architecture
+
+### Audio Processing Pipeline
+
+1. **Audio Capture** ([client/audio_capture.py](client/audio_capture.py))
+   - Captures audio from microphone/system using sounddevice
+   - Handles automatic sample rate detection and resampling
+   - Uses chunking with overlap for better transcription quality
+   - Default: 3-second chunks with 0.5s overlap
+
+2. **Noise Suppression** ([client/noise_suppression.py](client/noise_suppression.py))
+   - Applies noisereduce for background noise reduction
+   - Voice Activity Detection (VAD) using webrtcvad
+   - Skips silent segments to improve performance
+
+3. **Transcription** ([client/transcription_engine.py](client/transcription_engine.py))
+   - Uses faster-whisper for efficient inference
+   - Supports CPU, CUDA, and Apple MPS (Mac)
+   - Models: tiny, base, small, medium, large
+   - Thread-safe model loading with locks
+
+4. **Display** ([gui/main_window_qt.py](gui/main_window_qt.py))
+   - PySide6/Qt-based desktop GUI
+   - Real-time transcription display with scrolling
+   - Settings panel with live updates (no restart needed)
+
+### Web Server Architecture
+
+**Local Web Server** ([server/web_display.py](server/web_display.py))
+- Always runs when GUI starts (port 8080 by default)
+- FastAPI with WebSocket for real-time updates
+- Used for OBS browser source integration
+- Single-user (displays only local transcriptions)
+
+**Multi-User Servers** (Optional - for syncing across multiple users)
+
+Three options available:
+
+1. **PHP with Polling** ([server/php/display-polling.php](server/php/display-polling.php)) - **RECOMMENDED for PHP**
+   - Works on ANY shared hosting (no buffering issues)
+   - Uses HTTP polling instead of SSE
+   - 1-2 second latency, very reliable
+   - File-based storage, no database needed
+
+2. **Node.js WebSocket Server** ([server/nodejs/](server/nodejs/)) - **BEST PERFORMANCE**
+   - Real-time WebSocket support (< 100ms latency)
+   - Handles 100+ concurrent users
+   - Requires VPS/cloud hosting (Railway, Heroku, DigitalOcean)
+   - Much better than PHP for real-time applications
+
+3. **PHP with SSE** ([server/php/display.php](server/php/display.php)) - **NOT RECOMMENDED**
+   - Has buffering issues on most shared hosting
+   - PHP-FPM incompatibility
+   - Use polling or Node.js instead
+
+See [server/COMPARISON.md](server/COMPARISON.md) and [server/QUICK_FIX.md](server/QUICK_FIX.md) for details
+
+### Configuration System
+
+- Config stored at `~/.local-transcription/config.yaml`
+- Managed by [client/config.py](client/config.py)
+- Settings apply immediately without restart (except model changes)
+- YAML format with nested keys (e.g., `transcription.model`)
+
+### Device Management
+
+- [client/device_utils.py](client/device_utils.py) handles CPU/GPU detection
+- Auto-detects CUDA, MPS (Mac), or falls back to CPU
+- Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
+- Thread-safe device selection
+
+## Key Implementation Details
+
+### PyInstaller Build Configuration
+
+- [local-transcription.spec](local-transcription.spec) controls build
+- UPX compression enabled for smaller executables
+- Hidden imports required for PySide6, faster-whisper, torch
+- Console mode enabled by default (set `console=False` to hide)
+
+### Threading Model
+
+- Main thread: Qt GUI event loop
+- Audio thread: Captures and processes audio chunks
+- Web server thread: Runs FastAPI server
+- Transcription: Runs in callback thread from audio capture
+- All transcription results communicated via Qt signals
+
+### Server Sync (Optional Multi-User Feature)
+
+- [client/server_sync.py](client/server_sync.py) handles server communication
+- Toggle in Settings: "Enable Server Sync"
+- Sends transcriptions to PHP server via POST
+- Separate web display shows merged transcriptions from all users
+- Falls back gracefully if server unavailable
+
+## Common Patterns
+
+### Adding a New Setting
+
+1. Add to [config/default_config.yaml](config/default_config.yaml)
+2. Update [client/config.py](client/config.py) if validation needed
+3. Add UI control in [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py)
+4. Apply setting in relevant component (no restart if possible)
+5. Emit signal to update display if needed
+
+### Modifying Transcription Display
+
+- Local GUI: [gui/transcription_display_qt.py](gui/transcription_display_qt.py)
+- Web display (OBS): [server/web_display.py](server/web_display.py) (HTML in `_get_html()`)
+- Multi-user display: [server/php/display.php](server/php/display.php)
+
+### Adding a New Model Size
+
+- Update [client/transcription_engine.py](client/transcription_engine.py)
+- Add to model selector in [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py)
+- Update CLI argument choices in [main_cli.py](main_cli.py)
+
+## Dependencies
+
+**Core:**
+- `faster-whisper`: Optimized Whisper inference
+- `torch`: ML framework (CUDA-enabled via special index)
+- `PySide6`: Qt6 bindings for GUI
+- `sounddevice`: Cross-platform audio I/O
+- `noisereduce`, `webrtcvad`: Audio preprocessing
+
+**Web Server:**
+- `fastapi`, `uvicorn`: Web server and ASGI
+- `websockets`: Real-time communication
+
+**Build:**
+- `pyinstaller`: Create standalone executables
+- `uv`: Fast package manager
+
+**PyTorch CUDA Index:**
+- Configured in [pyproject.toml](pyproject.toml) under `[[tool.uv.index]]`
+- Uses PyTorch's custom wheel repository for CUDA builds
+- Automatically installed with `uv sync` when using CUDA build scripts
+
+## Platform-Specific Notes
+
+### Linux
+- Uses PulseAudio/ALSA for audio
+- Build scripts use bash (`.sh` files)
+- Executable: `dist/LocalTranscription/LocalTranscription`
+
+### Windows
+- Uses Windows Audio/WASAPI
+- Build scripts use batch (`.bat` files)
+- Executable: `dist\LocalTranscription\LocalTranscription.exe`
+- Requires Visual C++ Redistributable on target systems
+
+### Cross-Building
+- **Cannot cross-compile** - must build on target platform
+- CI/CD should use platform-specific runners
+
+## Troubleshooting
+
+### Model Loading Issues
+- Models download to `~/.cache/huggingface/`
+- First run requires internet connection
+- Check disk space (models: 75MB-3GB depending on size)
+
+### Audio Device Issues
+- Run `uv run python main_cli.py --list-devices`
+- Check permissions (microphone access)
+- Try different device indices in settings
+
+### GPU Not Detected
+- Run `uv run python check_cuda.py`
+- Install CUDA drivers (not CUDA toolkit - bundled in build)
+- Verify PyTorch sees GPU: `python -c "import torch; print(torch.cuda.is_available())"`
+
+### Web Server Port Conflicts
+- Default port: 8080
+- Change in [gui/main_window_qt.py](gui/main_window_qt.py) or config
+- Use `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows)
+
+## OBS Integration
+
+### Local Display (Single User)
+1. Start Local Transcription app
+2. In OBS: Add "Browser" source
+3. URL: `http://localhost:8080`
+4. Set dimensions (e.g., 1920x300)
+
+### Multi-User Display (PHP Server - Polling)
+1. Deploy PHP server to web hosting
+2. Each user enables "Server Sync" in settings
+3. Enter same room name and passphrase
+4. In OBS: Add "Browser" source
+5. URL: `https://your-domain.com/transcription/display-polling.php?room=ROOM&fade=10`
+
+### Multi-User Display (Node.js Server)
+1. Deploy Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
+2. Each user configures Server URL: `http://your-server:3000/api/send`
+3. Enter same room name and passphrase
+4. In OBS: Add "Browser" source
+5. URL: `http://your-server:3000/display?room=ROOM&fade=10`
+
+## Performance Optimization
+
+**For Real-Time Transcription:**
+- Use `tiny` or `base` model (faster)
+- Enable GPU if available (5-10x faster)
+- Increase chunk_duration for better accuracy (higher latency)
+- Decrease chunk_duration for lower latency (less context)
+- Enable VAD to skip silent audio
+
+**For Build Size Reduction:**
+- Don't bundle models (download on demand)
+- Use CPU-only build if no GPU users
+- Enable UPX compression (already in spec)
+
+## Phase Status
+
+- ✅ **Phase 1**: Standalone desktop application (complete)
+- ✅ **Web Server**: Local OBS integration (complete)
+- ✅ **Builds**: PyInstaller executables (complete)
+- 🚧 **Phase 2**: Multi-user PHP server (functional, optional)
+- ⏸️ **Phase 3+**: Advanced features (see [NEXT_STEPS.md](NEXT_STEPS.md))
+
+## Related Documentation
+
+- [README.md](README.md) - User-facing documentation
+- [BUILD.md](BUILD.md) - Detailed build instructions
+- [INSTALL.md](INSTALL.md) - Installation guide
+- [NEXT_STEPS.md](NEXT_STEPS.md) - Future enhancements
+- [server/php/README.md](server/php/README.md) - PHP server setup