Files
local-transcription/CLAUDE.md

12 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.

Key Features:

  • Standalone desktop GUI (PySide6/Qt)
  • Local transcription with CPU/GPU support
  • Built-in web server for OBS browser source integration
  • Optional PHP-based multi-user server for syncing transcriptions across users
  • Noise suppression and Voice Activity Detection (VAD)
  • Cross-platform builds (Linux/Windows) with PyInstaller

Project Structure

local-transcription/
├── client/                   # Core transcription logic
│   ├── audio_capture.py      # Audio input and buffering
│   ├── transcription_engine.py # Whisper model integration
│   ├── noise_suppression.py  # VAD and noise reduction
│   ├── device_utils.py       # CPU/GPU device management
│   ├── config.py             # Configuration management
│   └── server_sync.py        # Multi-user server sync client
├── gui/                      # Desktop application UI
│   ├── main_window_qt.py     # Main application window (PySide6)
│   ├── settings_dialog_qt.py # Settings dialog (PySide6)
│   └── transcription_display_qt.py # Display widget
├── server/                   # Web display server
│   ├── web_display.py        # FastAPI server for OBS browser source
│   └── php/                  # Optional multi-user PHP server
│       ├── server.php        # Multi-user sync server
│       ├── display.php       # Multi-user web display
│       └── README.md         # PHP server documentation
├── config/                   # Example configuration files
│   └── default_config.yaml   # Default settings template
├── main.py                   # GUI application entry point
├── main_cli.py              # CLI version for testing
└── pyproject.toml           # Dependencies and build config

Development Commands

Installation and Setup

# Install dependencies (creates .venv automatically)
uv sync

# Run the GUI application
uv run python main.py

# Run CLI version (headless, for testing)
uv run python main_cli.py

# List available audio devices
uv run python main_cli.py --list-devices

# Install with CUDA support (if needed)
uv pip install torch --index-url https://download.pytorch.org/whl/cu121

Building Executables

# Linux (CPU-only)
./build.sh

# Linux (with CUDA support - works on both GPU and CPU systems)
./build-cuda.sh

# Windows (CPU-only)
build.bat

# Windows (with CUDA support)
build-cuda.bat

# Manual build with PyInstaller
uv run pyinstaller local-transcription.spec

Important: CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.

Testing

# Run component tests
uv run python test_components.py

# Check CUDA availability
uv run python check_cuda.py

# Test web server manually
uv run python -m uvicorn server.web_display:app --reload

Architecture

Audio Processing Pipeline

  1. Audio Capture (client/audio_capture.py)

    • Captures audio from microphone/system using sounddevice
    • Handles automatic sample rate detection and resampling
    • Uses chunking with overlap for better transcription quality
    • Default: 3-second chunks with 0.5s overlap
  2. Noise Suppression (client/noise_suppression.py)

    • Applies noisereduce for background noise reduction
    • Voice Activity Detection (VAD) using webrtcvad
    • Skips silent segments to improve performance
  3. Transcription (client/transcription_engine.py)

    • Uses faster-whisper for efficient inference
    • Supports CPU, CUDA, and Apple MPS (Mac)
    • Models: tiny, base, small, medium, large
    • Thread-safe model loading with locks
  4. Display (gui/main_window_qt.py)

    • PySide6/Qt-based desktop GUI
    • Real-time transcription display with scrolling
    • Settings panel with live updates (no restart needed)

Web Server Architecture

Local Web Server (server/web_display.py)

  • Always runs when GUI starts (port 8080 by default)
  • FastAPI with WebSocket for real-time updates
  • Used for OBS browser source integration
  • Single-user (displays only local transcriptions)

Multi-User Servers (Optional - for syncing across multiple users)

Three options available:

  1. PHP with Polling (server/php/display-polling.php) - RECOMMENDED for PHP

    • Works on ANY shared hosting (no buffering issues)
    • Uses HTTP polling instead of SSE
    • 1-2 second latency, very reliable
    • File-based storage, no database needed
  2. Node.js WebSocket Server (server/nodejs/) - BEST PERFORMANCE

    • Real-time WebSocket support (< 100ms latency)
    • Handles 100+ concurrent users
    • Requires VPS/cloud hosting (Railway, Heroku, DigitalOcean)
    • Much better than PHP for real-time applications
  3. PHP with SSE (server/php/display.php) - NOT RECOMMENDED

    • Has buffering issues on most shared hosting
    • PHP-FPM incompatibility
    • Use polling or Node.js instead

See server/COMPARISON.md and server/QUICK_FIX.md for details

Configuration System

  • Config stored at ~/.local-transcription/config.yaml
  • Managed by client/config.py
  • Settings apply immediately without restart (except model changes)
  • YAML format with nested keys (e.g., transcription.model)

Device Management

  • client/device_utils.py handles CPU/GPU detection
  • Auto-detects CUDA, MPS (Mac), or falls back to CPU
  • Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
  • Thread-safe device selection

Key Implementation Details

PyInstaller Build Configuration

  • local-transcription.spec controls build
  • UPX compression enabled for smaller executables
  • Hidden imports required for PySide6, faster-whisper, torch
  • Console mode enabled by default (set console=False to hide)

Threading Model

  • Main thread: Qt GUI event loop
  • Audio thread: Captures and processes audio chunks
  • Web server thread: Runs FastAPI server
  • Transcription: Runs in callback thread from audio capture
  • All transcription results communicated via Qt signals

Server Sync (Optional Multi-User Feature)

  • client/server_sync.py handles server communication
  • Toggle in Settings: "Enable Server Sync"
  • Sends transcriptions to PHP server via POST
  • Separate web display shows merged transcriptions from all users
  • Falls back gracefully if server unavailable

Common Patterns

Adding a New Setting

  1. Add to config/default_config.yaml
  2. Update client/config.py if validation needed
  3. Add UI control in gui/settings_dialog_qt.py
  4. Apply setting in relevant component (no restart if possible)
  5. Emit signal to update display if needed

Modifying Transcription Display

Adding a New Model Size

Dependencies

Core:

  • faster-whisper: Optimized Whisper inference
  • torch: ML framework (CUDA-enabled via special index)
  • PySide6: Qt6 bindings for GUI
  • sounddevice: Cross-platform audio I/O
  • noisereduce, webrtcvad: Audio preprocessing

Web Server:

  • fastapi, uvicorn: Web server and ASGI
  • websockets: Real-time communication

Build:

  • pyinstaller: Create standalone executables
  • uv: Fast package manager

PyTorch CUDA Index:

  • Configured in pyproject.toml under [[tool.uv.index]]
  • Uses PyTorch's custom wheel repository for CUDA builds
  • Automatically installed with uv sync when using CUDA build scripts

Platform-Specific Notes

Linux

  • Uses PulseAudio/ALSA for audio
  • Build scripts use bash (.sh files)
  • Executable: dist/LocalTranscription/LocalTranscription

Windows

  • Uses Windows Audio/WASAPI
  • Build scripts use batch (.bat files)
  • Executable: dist\LocalTranscription\LocalTranscription.exe
  • Requires Visual C++ Redistributable on target systems

Cross-Building

  • Cannot cross-compile - must build on target platform
  • CI/CD should use platform-specific runners

Troubleshooting

Model Loading Issues

  • Models download to ~/.cache/huggingface/
  • First run requires internet connection
  • Check disk space (models: 75MB-3GB depending on size)

Audio Device Issues

  • Run uv run python main_cli.py --list-devices
  • Check permissions (microphone access)
  • Try different device indices in settings

GPU Not Detected

  • Run uv run python check_cuda.py
  • Install CUDA drivers (not CUDA toolkit - bundled in build)
  • Verify PyTorch sees GPU: python -c "import torch; print(torch.cuda.is_available())"

Web Server Port Conflicts

  • Default port: 8080
  • Change in gui/main_window_qt.py or config
  • Use lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

OBS Integration

Local Display (Single User)

  1. Start Local Transcription app
  2. In OBS: Add "Browser" source
  3. URL: http://localhost:8080
  4. Set dimensions (e.g., 1920x300)

Multi-User Display (PHP Server - Polling)

  1. Deploy PHP server to web hosting
  2. Each user enables "Server Sync" in settings
  3. Enter same room name and passphrase
  4. In OBS: Add "Browser" source
  5. URL: https://your-domain.com/transcription/display-polling.php?room=ROOM&fade=10

Multi-User Display (Node.js Server)

  1. Deploy Node.js server (see server/nodejs/README.md)
  2. Each user configures Server URL: http://your-server:3000/api/send
  3. Enter same room name and passphrase
  4. In OBS: Add "Browser" source
  5. URL: http://your-server:3000/display?room=ROOM&fade=10

Performance Optimization

For Real-Time Transcription:

  • Use tiny or base model (faster)
  • Enable GPU if available (5-10x faster)
  • Increase chunk_duration for better accuracy (higher latency)
  • Decrease chunk_duration for lower latency (less context)
  • Enable VAD to skip silent audio

For Build Size Reduction:

  • Don't bundle models (download on demand)
  • Use CPU-only build if no GPU users
  • Enable UPX compression (already in spec)

Phase Status

  • Phase 1: Standalone desktop application (complete)
  • Web Server: Local OBS integration (complete)
  • Builds: PyInstaller executables (complete)
  • 🚧 Phase 2: Multi-user PHP server (functional, optional)
  • ⏸️ Phase 3+: Advanced features (see NEXT_STEPS.md)