streamer-tools/local-transcription

Fork 0

Files

jknapp c28679acb6 Update to support sync captions

2025-12-26 16:15:52 -08:00

12 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.

Key Features:

Standalone desktop GUI (PySide6/Qt)
Local transcription with CPU/GPU support
Built-in web server for OBS browser source integration
Optional PHP-based multi-user server for syncing transcriptions across users
Noise suppression and Voice Activity Detection (VAD)
Cross-platform builds (Linux/Windows) with PyInstaller

Project Structure

local-transcription/
├── client/                   # Core transcription logic
│   ├── audio_capture.py      # Audio input and buffering
│   ├── transcription_engine.py # Whisper model integration
│   ├── noise_suppression.py  # VAD and noise reduction
│   ├── device_utils.py       # CPU/GPU device management
│   ├── config.py             # Configuration management
│   └── server_sync.py        # Multi-user server sync client
├── gui/                      # Desktop application UI
│   ├── main_window_qt.py     # Main application window (PySide6)
│   ├── settings_dialog_qt.py # Settings dialog (PySide6)
│   └── transcription_display_qt.py # Display widget
├── server/                   # Web display server
│   ├── web_display.py        # FastAPI server for OBS browser source
│   └── php/                  # Optional multi-user PHP server
│       ├── server.php        # Multi-user sync server
│       ├── display.php       # Multi-user web display
│       └── README.md         # PHP server documentation
├── config/                   # Example configuration files
│   └── default_config.yaml   # Default settings template
├── main.py                   # GUI application entry point
├── main_cli.py              # CLI version for testing
└── pyproject.toml           # Dependencies and build config

Development Commands

Installation and Setup

# Install dependencies (creates .venv automatically)
uv sync

# Run the GUI application
uv run python main.py

# Run CLI version (headless, for testing)
uv run python main_cli.py

# List available audio devices
uv run python main_cli.py --list-devices

# Install with CUDA support (if needed)
uv pip install torch --index-url https://download.pytorch.org/whl/cu121

Building Executables

# Linux (CPU-only)
./build.sh

# Linux (with CUDA support - works on both GPU and CPU systems)
./build-cuda.sh

# Windows (CPU-only)
build.bat

# Windows (with CUDA support)
build-cuda.bat

# Manual build with PyInstaller
uv run pyinstaller local-transcription.spec

Important: CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.

Testing

# Run component tests
uv run python test_components.py

# Check CUDA availability
uv run python check_cuda.py

# Test web server manually
uv run python -m uvicorn server.web_display:app --reload

Architecture

Audio Processing Pipeline

Audio Capture (client/audio_capture.py)
- Captures audio from microphone/system using sounddevice
- Handles automatic sample rate detection and resampling
- Uses chunking with overlap for better transcription quality
- Default: 3-second chunks with 0.5s overlap
Noise Suppression (client/noise_suppression.py)
- Applies noisereduce for background noise reduction
- Voice Activity Detection (VAD) using webrtcvad
- Skips silent segments to improve performance
Transcription (client/transcription_engine.py)
- Uses faster-whisper for efficient inference
- Supports CPU, CUDA, and Apple MPS (Mac)
- Models: tiny, base, small, medium, large
- Thread-safe model loading with locks
Display (gui/main_window_qt.py)
- PySide6/Qt-based desktop GUI
- Real-time transcription display with scrolling
- Settings panel with live updates (no restart needed)

Web Server Architecture

Local Web Server (server/web_display.py)

Always runs when GUI starts (port 8080 by default)
FastAPI with WebSocket for real-time updates
Used for OBS browser source integration
Single-user (displays only local transcriptions)

Multi-User Servers (Optional - for syncing across multiple users)

Three options available:

PHP with Polling (server/php/display-polling.php) - RECOMMENDED for PHP
- Works on ANY shared hosting (no buffering issues)
- Uses HTTP polling instead of SSE
- 1-2 second latency, very reliable
- File-based storage, no database needed
Node.js WebSocket Server (server/nodejs/) - BEST PERFORMANCE
- Real-time WebSocket support (< 100ms latency)
- Handles 100+ concurrent users
- Requires VPS/cloud hosting (Railway, Heroku, DigitalOcean)
- Much better than PHP for real-time applications
PHP with SSE (server/php/display.php) - NOT RECOMMENDED
- Has buffering issues on most shared hosting
- PHP-FPM incompatibility
- Use polling or Node.js instead

See server/COMPARISON.md and server/QUICK_FIX.md for details

Configuration System

Config stored at ~/.local-transcription/config.yaml
Managed by client/config.py
Settings apply immediately without restart (except model changes)
YAML format with nested keys (e.g., transcription.model)

Device Management

client/device_utils.py handles CPU/GPU detection
Auto-detects CUDA, MPS (Mac), or falls back to CPU
Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
Thread-safe device selection

Key Implementation Details

PyInstaller Build Configuration

local-transcription.spec controls build
UPX compression enabled for smaller executables
Hidden imports required for PySide6, faster-whisper, torch
Console mode enabled by default (set console=False to hide)

Threading Model

Main thread: Qt GUI event loop
Audio thread: Captures and processes audio chunks
Web server thread: Runs FastAPI server
Transcription: Runs in callback thread from audio capture
All transcription results communicated via Qt signals

Server Sync (Optional Multi-User Feature)

client/server_sync.py handles server communication
Toggle in Settings: "Enable Server Sync"
Sends transcriptions to PHP server via POST
Separate web display shows merged transcriptions from all users
Falls back gracefully if server unavailable

Common Patterns

Adding a New Setting

Add to config/default_config.yaml
Update client/config.py if validation needed
Add UI control in gui/settings_dialog_qt.py
Apply setting in relevant component (no restart if possible)
Emit signal to update display if needed

Modifying Transcription Display

Local GUI: gui/transcription_display_qt.py
Web display (OBS): server/web_display.py (HTML in _get_html())
Multi-user display: server/php/display.php

Adding a New Model Size

Update client/transcription_engine.py
Add to model selector in gui/settings_dialog_qt.py
Update CLI argument choices in main_cli.py

Dependencies

Core:

faster-whisper: Optimized Whisper inference
torch: ML framework (CUDA-enabled via special index)
PySide6: Qt6 bindings for GUI
sounddevice: Cross-platform audio I/O
noisereduce, webrtcvad: Audio preprocessing

Web Server:

fastapi, uvicorn: Web server and ASGI
websockets: Real-time communication

Build:

pyinstaller: Create standalone executables
uv: Fast package manager

PyTorch CUDA Index:

Configured in pyproject.toml under [[tool.uv.index]]
Uses PyTorch's custom wheel repository for CUDA builds
Automatically installed with uv sync when using CUDA build scripts

Platform-Specific Notes

Linux

Uses PulseAudio/ALSA for audio
Build scripts use bash (.sh files)
Executable: dist/LocalTranscription/LocalTranscription

Windows

Uses Windows Audio/WASAPI
Build scripts use batch (.bat files)
Executable: dist\LocalTranscription\LocalTranscription.exe
Requires Visual C++ Redistributable on target systems

Cross-Building

Cannot cross-compile - must build on target platform
CI/CD should use platform-specific runners

Troubleshooting

Model Loading Issues

Models download to ~/.cache/huggingface/
First run requires internet connection
Check disk space (models: 75MB-3GB depending on size)

Audio Device Issues

Run uv run python main_cli.py --list-devices
Check permissions (microphone access)
Try different device indices in settings

GPU Not Detected

Run uv run python check_cuda.py
Install CUDA drivers (not CUDA toolkit - bundled in build)
Verify PyTorch sees GPU: python -c "import torch; print(torch.cuda.is_available())"

Web Server Port Conflicts

Default port: 8080
Change in gui/main_window_qt.py or config
Use lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

OBS Integration

Local Display (Single User)

Start Local Transcription app
In OBS: Add "Browser" source
URL: http://localhost:8080
Set dimensions (e.g., 1920x300)

Multi-User Display (PHP Server - Polling)

Deploy PHP server to web hosting
Each user enables "Server Sync" in settings
Enter same room name and passphrase
In OBS: Add "Browser" source
URL: https://your-domain.com/transcription/display-polling.php?room=ROOM&fade=10

Multi-User Display (Node.js Server)

Deploy Node.js server (see server/nodejs/README.md)
Each user configures Server URL: http://your-server:3000/api/send
Enter same room name and passphrase
In OBS: Add "Browser" source
URL: http://your-server:3000/display?room=ROOM&fade=10

Performance Optimization

For Real-Time Transcription:

Use tiny or base model (faster)
Enable GPU if available (5-10x faster)
Increase chunk_duration for better accuracy (higher latency)
Decrease chunk_duration for lower latency (less context)
Enable VAD to skip silent audio

For Build Size Reduction:

Don't bundle models (download on demand)
Use CPU-only build if no GPU users
Enable UPX compression (already in spec)

Phase Status

✅ Phase 1: Standalone desktop application (complete)
✅ Web Server: Local OBS integration (complete)
✅ Builds: PyInstaller executables (complete)
🚧 Phase 2: Multi-user PHP server (functional, optional)
⏸️ Phase 3+: Advanced features (see NEXT_STEPS.md)

README.md - User-facing documentation
BUILD.md - Detailed build instructions
INSTALL.md - Installation guide
NEXT_STEPS.md - Future enhancements
server/php/README.md - PHP server setup

12 KiB Raw Blame History