2025-12-26 16:15:52 -08:00
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.
**Key Features:**
- Standalone desktop GUI (PySide6/Qt)
- Local transcription with CPU/GPU support
- Built-in web server for OBS browser source integration
2025-12-27 06:15:55 -08:00
- Optional Node.js-based multi-user server for syncing transcriptions across users
2025-12-26 16:15:52 -08:00
- Noise suppression and Voice Activity Detection (VAD)
- Cross-platform builds (Linux/Windows) with PyInstaller
## Project Structure
```
local-transcription/
├── client/ # Core transcription logic
│ ├── audio_capture.py # Audio input and buffering
│ ├── transcription_engine.py # Whisper model integration
│ ├── noise_suppression.py # VAD and noise reduction
│ ├── device_utils.py # CPU/GPU device management
│ ├── config.py # Configuration management
│ └── server_sync.py # Multi-user server sync client
├── gui/ # Desktop application UI
│ ├── main_window_qt.py # Main application window (PySide6)
│ ├── settings_dialog_qt.py # Settings dialog (PySide6)
│ └── transcription_display_qt.py # Display widget
2025-12-27 06:15:55 -08:00
├── server/ # Web display servers
│ ├── web_display.py # FastAPI server for OBS browser source (local)
│ └── nodejs/ # Optional multi-user Node.js server
│ ├── server.js # Multi-user sync server with WebSocket
│ ├── package.json # Node.js dependencies
│ └── README.md # Server deployment documentation
2025-12-26 16:15:52 -08:00
├── config/ # Example configuration files
│ └── default_config.yaml # Default settings template
├── main.py # GUI application entry point
├── main_cli.py # CLI version for testing
└── pyproject.toml # Dependencies and build config
```
## Development Commands
### Installation and Setup
```bash
# Install dependencies (creates .venv automatically)
uv sync
# Run the GUI application
uv run python main.py
# Run CLI version (headless, for testing)
uv run python main_cli.py
# List available audio devices
uv run python main_cli.py --list-devices
# Install with CUDA support (if needed)
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
```
### Building Executables
```bash
2025-12-28 19:09:36 -08:00
# Linux (includes CUDA support - works on both GPU and CPU systems)
2025-12-26 16:15:52 -08:00
./build.sh
2025-12-28 19:09:36 -08:00
# Windows (includes CUDA support - works on both GPU and CPU systems)
2025-12-26 16:15:52 -08:00
build.bat
# Manual build with PyInstaller
2025-12-28 19:09:36 -08:00
uv sync # Install dependencies (includes CUDA PyTorch)
uv pip uninstall -q enum34 # Remove incompatible enum34 package
2025-12-26 16:15:52 -08:00
uv run pyinstaller local-transcription.spec
```
2025-12-28 19:09:36 -08:00
**Important:** All builds include CUDA support via `pyproject.toml` configuration. CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.
2025-12-26 16:15:52 -08:00
### Testing
```bash
# Run component tests
uv run python test_components.py
# Check CUDA availability
uv run python check_cuda.py
# Test web server manually
uv run python -m uvicorn server.web_display:app --reload
```
## Architecture
### Audio Processing Pipeline
1. **Audio Capture ** ([client/audio_capture.py ](client/audio_capture.py ))
- Captures audio from microphone/system using sounddevice
- Handles automatic sample rate detection and resampling
- Uses chunking with overlap for better transcription quality
- Default: 3-second chunks with 0.5s overlap
2. **Noise Suppression ** ([client/noise_suppression.py ](client/noise_suppression.py ))
- Applies noisereduce for background noise reduction
- Voice Activity Detection (VAD) using webrtcvad
- Skips silent segments to improve performance
3. **Transcription ** ([client/transcription_engine.py ](client/transcription_engine.py ))
- Uses faster-whisper for efficient inference
- Supports CPU, CUDA, and Apple MPS (Mac)
- Models: tiny, base, small, medium, large
- Thread-safe model loading with locks
4. **Display ** ([gui/main_window_qt.py ](gui/main_window_qt.py ))
- PySide6/Qt-based desktop GUI
- Real-time transcription display with scrolling
- Settings panel with live updates (no restart needed)
### Web Server Architecture
**Local Web Server** ([server/web_display.py ](server/web_display.py ))
- Always runs when GUI starts (port 8080 by default)
- FastAPI with WebSocket for real-time updates
- Used for OBS browser source integration
- Single-user (displays only local transcriptions)
2025-12-27 06:15:55 -08:00
**Multi-User Server** (Optional - for syncing across multiple users)
2025-12-26 16:15:52 -08:00
2025-12-27 06:15:55 -08:00
**Node.js WebSocket Server** ([server/nodejs/ ](server/nodejs/ )) - **RECOMMENDED **
- Real-time WebSocket support (< 100ms latency)
- Handles 100+ concurrent users
- Easy deployment to VPS/cloud hosting (Railway, Heroku, DigitalOcean, or any VPS)
- Configurable display options via URL parameters:
- `timestamps=true/false` - Show/hide timestamps
- `maxlines=50` - Maximum visible lines (prevents scroll bars in OBS)
- `fontsize=16` - Font size in pixels
- `fontfamily=Arial` - Font family
- `fade=10` - Seconds before text fades (0 = never)
2025-12-26 16:15:52 -08:00
2025-12-27 06:15:55 -08:00
See [server/nodejs/README.md ](server/nodejs/README.md ) for deployment instructions
2025-12-26 16:15:52 -08:00
### Configuration System
- Config stored at `~/.local-transcription/config.yaml`
- Managed by [client/config.py ](client/config.py )
- Settings apply immediately without restart (except model changes)
- YAML format with nested keys (e.g., `transcription.model` )
### Device Management
- [client/device_utils.py ](client/device_utils.py ) handles CPU/GPU detection
- Auto-detects CUDA, MPS (Mac), or falls back to CPU
- Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
- Thread-safe device selection
## Key Implementation Details
### PyInstaller Build Configuration
- [local-transcription.spec ](local-transcription.spec ) controls build
- UPX compression enabled for smaller executables
- Hidden imports required for PySide6, faster-whisper, torch
- Console mode enabled by default (set `console=False` to hide)
### Threading Model
- Main thread: Qt GUI event loop
- Audio thread: Captures and processes audio chunks
- Web server thread: Runs FastAPI server
- Transcription: Runs in callback thread from audio capture
- All transcription results communicated via Qt signals
### Server Sync (Optional Multi-User Feature)
- [client/server_sync.py ](client/server_sync.py ) handles server communication
- Toggle in Settings: "Enable Server Sync"
- Sends transcriptions to PHP server via POST
- Separate web display shows merged transcriptions from all users
- Falls back gracefully if server unavailable
## Common Patterns
### Adding a New Setting
1. Add to [config/default_config.yaml ](config/default_config.yaml )
2. Update [client/config.py ](client/config.py ) if validation needed
3. Add UI control in [gui/settings_dialog_qt.py ](gui/settings_dialog_qt.py )
4. Apply setting in relevant component (no restart if possible)
5. Emit signal to update display if needed
### Modifying Transcription Display
- Local GUI: [gui/transcription_display_qt.py ](gui/transcription_display_qt.py )
- Web display (OBS): [server/web_display.py ](server/web_display.py ) (HTML in `_get_html()` )
- Multi-user display: [server/php/display.php ](server/php/display.php )
### Adding a New Model Size
- Update [client/transcription_engine.py ](client/transcription_engine.py )
- Add to model selector in [gui/settings_dialog_qt.py ](gui/settings_dialog_qt.py )
- Update CLI argument choices in [main_cli.py ](main_cli.py )
## Dependencies
**Core:**
- `faster-whisper` : Optimized Whisper inference
- `torch` : ML framework (CUDA-enabled via special index)
- `PySide6` : Qt6 bindings for GUI
- `sounddevice` : Cross-platform audio I/O
- `noisereduce` , `webrtcvad` : Audio preprocessing
**Web Server:**
- `fastapi` , `uvicorn` : Web server and ASGI
- `websockets` : Real-time communication
**Build:**
- `pyinstaller` : Create standalone executables
- `uv` : Fast package manager
**PyTorch CUDA Index:**
- Configured in [pyproject.toml ](pyproject.toml ) under `[[tool.uv.index]]`
- Uses PyTorch's custom wheel repository for CUDA builds
- Automatically installed with `uv sync` when using CUDA build scripts
## Platform-Specific Notes
### Linux
- Uses PulseAudio/ALSA for audio
- Build scripts use bash (`.sh` files)
- Executable: `dist/LocalTranscription/LocalTranscription`
### Windows
- Uses Windows Audio/WASAPI
- Build scripts use batch (`.bat` files)
- Executable: `dist\LocalTranscription\LocalTranscription.exe`
- Requires Visual C++ Redistributable on target systems
### Cross-Building
- **Cannot cross-compile** - must build on target platform
- CI/CD should use platform-specific runners
## Troubleshooting
### Model Loading Issues
- Models download to `~/.cache/huggingface/`
- First run requires internet connection
- Check disk space (models: 75MB-3GB depending on size)
### Audio Device Issues
- Run `uv run python main_cli.py --list-devices`
- Check permissions (microphone access)
- Try different device indices in settings
### GPU Not Detected
- Run `uv run python check_cuda.py`
- Install CUDA drivers (not CUDA toolkit - bundled in build)
- Verify PyTorch sees GPU: `python -c "import torch; print(torch.cuda.is_available())"`
### Web Server Port Conflicts
- Default port: 8080
- Change in [gui/main_window_qt.py ](gui/main_window_qt.py ) or config
- Use `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows)
## OBS Integration
### Local Display (Single User)
1. Start Local Transcription app
2. In OBS: Add "Browser" source
3. URL: `http://localhost:8080`
4. Set dimensions (e.g., 1920x300)
### Multi-User Display (Node.js Server)
1. Deploy Node.js server (see [server/nodejs/README.md ](server/nodejs/README.md ))
2. Each user configures Server URL: `http://your-server:3000/api/send`
3. Enter same room name and passphrase
4. In OBS: Add "Browser" source
2025-12-27 06:15:55 -08:00
5. URL: `http://your-server:3000/display?room=ROOM&fade=10×tamps=true&maxlines=50&fontsize=16`
6. Customize URL parameters as needed:
- `timestamps=false` - Hide timestamps
- `maxlines=30` - Show max 30 lines (prevents scroll bars)
- `fontsize=18` - Larger font
- `fontfamily=Courier` - Different font
2025-12-26 16:15:52 -08:00
## Performance Optimization
**For Real-Time Transcription:**
- Use `tiny` or `base` model (faster)
- Enable GPU if available (5-10x faster)
- Increase chunk_duration for better accuracy (higher latency)
- Decrease chunk_duration for lower latency (less context)
- Enable VAD to skip silent audio
**For Build Size Reduction:**
- Don't bundle models (download on demand)
- Use CPU-only build if no GPU users
- Enable UPX compression (already in spec)
## Phase Status
- ✅ **Phase 1 ** : Standalone desktop application (complete)
- ✅ **Web Server ** : Local OBS integration (complete)
- ✅ **Builds ** : PyInstaller executables (complete)
2025-12-27 06:15:55 -08:00
- ✅ **Phase 2 ** : Multi-user Node.js server (complete, optional)
2025-12-26 16:15:52 -08:00
- ⏸️ **Phase 3+ ** : Advanced features (see [NEXT_STEPS.md ](NEXT_STEPS.md ))
## Related Documentation
- [README.md ](README.md ) - User-facing documentation
- [BUILD.md ](BUILD.md ) - Detailed build instructions
- [INSTALL.md ](INSTALL.md ) - Installation guide
- [NEXT_STEPS.md ](NEXT_STEPS.md ) - Future enhancements
2025-12-27 06:15:55 -08:00
- [server/nodejs/README.md ](server/nodejs/README.md ) - Node.js server setup and deployment