Major improvements to display configuration and server architecture: **Display Enhancements:** - Add URL parameters for display customization (timestamps, maxlines, fontsize, fontfamily) - Fix max lines enforcement to prevent scroll bars in OBS - Apply font family and size settings to both local and sync displays - Remove auto-scroll, enforce overflow:hidden for clean OBS integration **Node.js Server:** - Add timestamps toggle: timestamps=true/false - Add max lines limit: maxlines=50 - Add font configuration: fontsize=16, fontfamily=Arial - Update index page with URL parameters documentation - Improve display URLs in room generation **Local Web Server:** - Add max_lines, font_family, font_size configuration - Respect settings from GUI configuration - Apply changes immediately without restart **Architecture:** - Remove PHP server implementation (Node.js recommended) - Update all documentation to reference Node.js server - Update default config URLs to Node.js endpoints - Clean up 1700+ lines of PHP code 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
317 lines
11 KiB
Markdown
317 lines
11 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.
|
|
|
|
**Key Features:**
|
|
- Standalone desktop GUI (PySide6/Qt)
|
|
- Local transcription with CPU/GPU support
|
|
- Built-in web server for OBS browser source integration
|
|
- Optional Node.js-based multi-user server for syncing transcriptions across users
|
|
- Noise suppression and Voice Activity Detection (VAD)
|
|
- Cross-platform builds (Linux/Windows) with PyInstaller
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
local-transcription/
|
|
├── client/ # Core transcription logic
|
|
│ ├── audio_capture.py # Audio input and buffering
|
|
│ ├── transcription_engine.py # Whisper model integration
|
|
│ ├── noise_suppression.py # VAD and noise reduction
|
|
│ ├── device_utils.py # CPU/GPU device management
|
|
│ ├── config.py # Configuration management
|
|
│ └── server_sync.py # Multi-user server sync client
|
|
├── gui/ # Desktop application UI
|
|
│ ├── main_window_qt.py # Main application window (PySide6)
|
|
│ ├── settings_dialog_qt.py # Settings dialog (PySide6)
|
|
│ └── transcription_display_qt.py # Display widget
|
|
├── server/ # Web display servers
|
|
│ ├── web_display.py # FastAPI server for OBS browser source (local)
|
|
│ └── nodejs/ # Optional multi-user Node.js server
|
|
│ ├── server.js # Multi-user sync server with WebSocket
|
|
│ ├── package.json # Node.js dependencies
|
|
│ └── README.md # Server deployment documentation
|
|
├── config/ # Example configuration files
|
|
│ └── default_config.yaml # Default settings template
|
|
├── main.py # GUI application entry point
|
|
├── main_cli.py # CLI version for testing
|
|
└── pyproject.toml # Dependencies and build config
|
|
```
|
|
|
|
## Development Commands
|
|
|
|
### Installation and Setup
|
|
```bash
|
|
# Install dependencies (creates .venv automatically)
|
|
uv sync
|
|
|
|
# Run the GUI application
|
|
uv run python main.py
|
|
|
|
# Run CLI version (headless, for testing)
|
|
uv run python main_cli.py
|
|
|
|
# List available audio devices
|
|
uv run python main_cli.py --list-devices
|
|
|
|
# Install with CUDA support (if needed)
|
|
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
|
|
```
|
|
|
|
### Building Executables
|
|
```bash
|
|
# Linux (CPU-only)
|
|
./build.sh
|
|
|
|
# Linux (with CUDA support - works on both GPU and CPU systems)
|
|
./build-cuda.sh
|
|
|
|
# Windows (CPU-only)
|
|
build.bat
|
|
|
|
# Windows (with CUDA support)
|
|
build-cuda.bat
|
|
|
|
# Manual build with PyInstaller
|
|
uv run pyinstaller local-transcription.spec
|
|
```
|
|
|
|
**Important:** CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.
|
|
|
|
### Testing
|
|
```bash
|
|
# Run component tests
|
|
uv run python test_components.py
|
|
|
|
# Check CUDA availability
|
|
uv run python check_cuda.py
|
|
|
|
# Test web server manually
|
|
uv run python -m uvicorn server.web_display:app --reload
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Audio Processing Pipeline
|
|
|
|
1. **Audio Capture** ([client/audio_capture.py](client/audio_capture.py))
|
|
- Captures audio from microphone/system using sounddevice
|
|
- Handles automatic sample rate detection and resampling
|
|
- Uses chunking with overlap for better transcription quality
|
|
- Default: 3-second chunks with 0.5s overlap
|
|
|
|
2. **Noise Suppression** ([client/noise_suppression.py](client/noise_suppression.py))
|
|
- Applies noisereduce for background noise reduction
|
|
- Voice Activity Detection (VAD) using webrtcvad
|
|
- Skips silent segments to improve performance
|
|
|
|
3. **Transcription** ([client/transcription_engine.py](client/transcription_engine.py))
|
|
- Uses faster-whisper for efficient inference
|
|
- Supports CPU, CUDA, and Apple MPS (Mac)
|
|
- Models: tiny, base, small, medium, large
|
|
- Thread-safe model loading with locks
|
|
|
|
4. **Display** ([gui/main_window_qt.py](gui/main_window_qt.py))
|
|
- PySide6/Qt-based desktop GUI
|
|
- Real-time transcription display with scrolling
|
|
- Settings panel with live updates (no restart needed)
|
|
|
|
### Web Server Architecture
|
|
|
|
**Local Web Server** ([server/web_display.py](server/web_display.py))
|
|
- Always runs when GUI starts (port 8080 by default)
|
|
- FastAPI with WebSocket for real-time updates
|
|
- Used for OBS browser source integration
|
|
- Single-user (displays only local transcriptions)
|
|
|
|
**Multi-User Server** (Optional - for syncing across multiple users)
|
|
|
|
**Node.js WebSocket Server** ([server/nodejs/](server/nodejs/)) - **RECOMMENDED**
|
|
- Real-time WebSocket support (< 100ms latency)
|
|
- Handles 100+ concurrent users
|
|
- Easy deployment to VPS/cloud hosting (Railway, Heroku, DigitalOcean, or any VPS)
|
|
- Configurable display options via URL parameters:
|
|
- `timestamps=true/false` - Show/hide timestamps
|
|
- `maxlines=50` - Maximum visible lines (prevents scroll bars in OBS)
|
|
- `fontsize=16` - Font size in pixels
|
|
- `fontfamily=Arial` - Font family
|
|
- `fade=10` - Seconds before text fades (0 = never)
|
|
|
|
See [server/nodejs/README.md](server/nodejs/README.md) for deployment instructions
|
|
|
|
### Configuration System
|
|
|
|
- Config stored at `~/.local-transcription/config.yaml`
|
|
- Managed by [client/config.py](client/config.py)
|
|
- Settings apply immediately without restart (except model changes)
|
|
- YAML format with nested keys (e.g., `transcription.model`)
|
|
|
|
### Device Management
|
|
|
|
- [client/device_utils.py](client/device_utils.py) handles CPU/GPU detection
|
|
- Auto-detects CUDA, MPS (Mac), or falls back to CPU
|
|
- Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
|
|
- Thread-safe device selection
|
|
|
|
## Key Implementation Details
|
|
|
|
### PyInstaller Build Configuration
|
|
|
|
- [local-transcription.spec](local-transcription.spec) controls build
|
|
- UPX compression enabled for smaller executables
|
|
- Hidden imports required for PySide6, faster-whisper, torch
|
|
- Console mode enabled by default (set `console=False` to hide)
|
|
|
|
### Threading Model
|
|
|
|
- Main thread: Qt GUI event loop
|
|
- Audio thread: Captures and processes audio chunks
|
|
- Web server thread: Runs FastAPI server
|
|
- Transcription: Runs in callback thread from audio capture
|
|
- All transcription results communicated via Qt signals
|
|
|
|
### Server Sync (Optional Multi-User Feature)
|
|
|
|
- [client/server_sync.py](client/server_sync.py) handles server communication
|
|
- Toggle in Settings: "Enable Server Sync"
|
|
- Sends transcriptions to PHP server via POST
|
|
- Separate web display shows merged transcriptions from all users
|
|
- Falls back gracefully if server unavailable
|
|
|
|
## Common Patterns
|
|
|
|
### Adding a New Setting
|
|
|
|
1. Add to [config/default_config.yaml](config/default_config.yaml)
|
|
2. Update [client/config.py](client/config.py) if validation needed
|
|
3. Add UI control in [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py)
|
|
4. Apply setting in relevant component (no restart if possible)
|
|
5. Emit signal to update display if needed
|
|
|
|
### Modifying Transcription Display
|
|
|
|
- Local GUI: [gui/transcription_display_qt.py](gui/transcription_display_qt.py)
|
|
- Web display (OBS): [server/web_display.py](server/web_display.py) (HTML in `_get_html()`)
|
|
- Multi-user display: [server/php/display.php](server/php/display.php)
|
|
|
|
### Adding a New Model Size
|
|
|
|
- Update [client/transcription_engine.py](client/transcription_engine.py)
|
|
- Add to model selector in [gui/settings_dialog_qt.py](gui/settings_dialog_qt.py)
|
|
- Update CLI argument choices in [main_cli.py](main_cli.py)
|
|
|
|
## Dependencies
|
|
|
|
**Core:**
|
|
- `faster-whisper`: Optimized Whisper inference
|
|
- `torch`: ML framework (CUDA-enabled via special index)
|
|
- `PySide6`: Qt6 bindings for GUI
|
|
- `sounddevice`: Cross-platform audio I/O
|
|
- `noisereduce`, `webrtcvad`: Audio preprocessing
|
|
|
|
**Web Server:**
|
|
- `fastapi`, `uvicorn`: Web server and ASGI
|
|
- `websockets`: Real-time communication
|
|
|
|
**Build:**
|
|
- `pyinstaller`: Create standalone executables
|
|
- `uv`: Fast package manager
|
|
|
|
**PyTorch CUDA Index:**
|
|
- Configured in [pyproject.toml](pyproject.toml) under `[[tool.uv.index]]`
|
|
- Uses PyTorch's custom wheel repository for CUDA builds
|
|
- Automatically installed with `uv sync` when using CUDA build scripts
|
|
|
|
## Platform-Specific Notes
|
|
|
|
### Linux
|
|
- Uses PulseAudio/ALSA for audio
|
|
- Build scripts use bash (`.sh` files)
|
|
- Executable: `dist/LocalTranscription/LocalTranscription`
|
|
|
|
### Windows
|
|
- Uses Windows Audio/WASAPI
|
|
- Build scripts use batch (`.bat` files)
|
|
- Executable: `dist\LocalTranscription\LocalTranscription.exe`
|
|
- Requires Visual C++ Redistributable on target systems
|
|
|
|
### Cross-Building
|
|
- **Cannot cross-compile** - must build on target platform
|
|
- CI/CD should use platform-specific runners
|
|
|
|
## Troubleshooting
|
|
|
|
### Model Loading Issues
|
|
- Models download to `~/.cache/huggingface/`
|
|
- First run requires internet connection
|
|
- Check disk space (models: 75MB-3GB depending on size)
|
|
|
|
### Audio Device Issues
|
|
- Run `uv run python main_cli.py --list-devices`
|
|
- Check permissions (microphone access)
|
|
- Try different device indices in settings
|
|
|
|
### GPU Not Detected
|
|
- Run `uv run python check_cuda.py`
|
|
- Install CUDA drivers (not CUDA toolkit - bundled in build)
|
|
- Verify PyTorch sees GPU: `python -c "import torch; print(torch.cuda.is_available())"`
|
|
|
|
### Web Server Port Conflicts
|
|
- Default port: 8080
|
|
- Change in [gui/main_window_qt.py](gui/main_window_qt.py) or config
|
|
- Use `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows)
|
|
|
|
## OBS Integration
|
|
|
|
### Local Display (Single User)
|
|
1. Start Local Transcription app
|
|
2. In OBS: Add "Browser" source
|
|
3. URL: `http://localhost:8080`
|
|
4. Set dimensions (e.g., 1920x300)
|
|
|
|
### Multi-User Display (Node.js Server)
|
|
1. Deploy Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
|
|
2. Each user configures Server URL: `http://your-server:3000/api/send`
|
|
3. Enter same room name and passphrase
|
|
4. In OBS: Add "Browser" source
|
|
5. URL: `http://your-server:3000/display?room=ROOM&fade=10×tamps=true&maxlines=50&fontsize=16`
|
|
6. Customize URL parameters as needed:
|
|
- `timestamps=false` - Hide timestamps
|
|
- `maxlines=30` - Show max 30 lines (prevents scroll bars)
|
|
- `fontsize=18` - Larger font
|
|
- `fontfamily=Courier` - Different font
|
|
|
|
## Performance Optimization
|
|
|
|
**For Real-Time Transcription:**
|
|
- Use `tiny` or `base` model (faster)
|
|
- Enable GPU if available (5-10x faster)
|
|
- Increase chunk_duration for better accuracy (higher latency)
|
|
- Decrease chunk_duration for lower latency (less context)
|
|
- Enable VAD to skip silent audio
|
|
|
|
**For Build Size Reduction:**
|
|
- Don't bundle models (download on demand)
|
|
- Use CPU-only build if no GPU users
|
|
- Enable UPX compression (already in spec)
|
|
|
|
## Phase Status
|
|
|
|
- ✅ **Phase 1**: Standalone desktop application (complete)
|
|
- ✅ **Web Server**: Local OBS integration (complete)
|
|
- ✅ **Builds**: PyInstaller executables (complete)
|
|
- ✅ **Phase 2**: Multi-user Node.js server (complete, optional)
|
|
- ⏸️ **Phase 3+**: Advanced features (see [NEXT_STEPS.md](NEXT_STEPS.md))
|
|
|
|
## Related Documentation
|
|
|
|
- [README.md](README.md) - User-facing documentation
|
|
- [BUILD.md](BUILD.md) - Detailed build instructions
|
|
- [INSTALL.md](INSTALL.md) - Installation guide
|
|
- [NEXT_STEPS.md](NEXT_STEPS.md) - Future enhancements
|
|
- [server/nodejs/README.md](server/nodejs/README.md) - Node.js server setup and deployment
|