Files
local-transcription/CLAUDE.md
jknapp ff067b3368 Add unified per-speaker font support and remote transcription service
Font changes:
- Consolidate font settings into single Display Settings section
- Support Web-Safe, Google Fonts, and Custom File uploads for both displays
- Fix Google Fonts URL encoding (use + instead of %2B for spaces)
- Fix per-speaker font inline style quote escaping in Node.js display
- Add font debug logging to help diagnose font issues
- Update web server to sync all font settings on settings change
- Remove deprecated PHP server documentation files

New features:
- Add remote transcription service for GPU offloading
- Add instance lock to prevent multiple app instances
- Add version tracking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 19:09:57 -08:00

12 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.

Key Features:

  • Standalone desktop GUI (PySide6/Qt)
  • Local transcription with CPU/GPU support
  • Built-in web server for OBS browser source integration
  • Optional Node.js-based multi-user server for syncing transcriptions across users
  • Noise suppression and Voice Activity Detection (VAD)
  • Cross-platform builds (Linux/Windows) with PyInstaller

Project Structure

local-transcription/
├── client/                   # Core transcription logic
│   ├── audio_capture.py      # Audio input and buffering
│   ├── transcription_engine.py # Whisper model integration
│   ├── noise_suppression.py  # VAD and noise reduction
│   ├── device_utils.py       # CPU/GPU device management
│   ├── config.py             # Configuration management
│   └── server_sync.py        # Multi-user server sync client
├── gui/                      # Desktop application UI
│   ├── main_window_qt.py     # Main application window (PySide6)
│   ├── settings_dialog_qt.py # Settings dialog (PySide6)
│   └── transcription_display_qt.py # Display widget
├── server/                   # Web display servers
│   ├── web_display.py        # FastAPI server for OBS browser source (local)
│   └── nodejs/               # Optional multi-user Node.js server
│       ├── server.js         # Multi-user sync server with WebSocket
│       ├── package.json      # Node.js dependencies
│       └── README.md         # Server deployment documentation
├── config/                   # Example configuration files
│   └── default_config.yaml   # Default settings template
├── main.py                   # GUI application entry point
├── main_cli.py              # CLI version for testing
└── pyproject.toml           # Dependencies and build config

Development Commands

Installation and Setup

# Install dependencies (creates .venv automatically)
uv sync

# Run the GUI application
uv run python main.py

# Run CLI version (headless, for testing)
uv run python main_cli.py

# List available audio devices
uv run python main_cli.py --list-devices

# Install with CUDA support (if needed)
uv pip install torch --index-url https://download.pytorch.org/whl/cu121

Building Executables

# Linux (includes CUDA support - works on both GPU and CPU systems)
./build.sh

# Windows (includes CUDA support - works on both GPU and CPU systems)
build.bat

# Manual build with PyInstaller
uv sync                          # Install dependencies (includes CUDA PyTorch)
uv pip uninstall -q enum34       # Remove incompatible enum34 package
uv run pyinstaller local-transcription.spec

Important: All builds include CUDA support via pyproject.toml configuration. CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.

Testing

# Run component tests
uv run python test_components.py

# Check CUDA availability
uv run python check_cuda.py

# Test web server manually
uv run python -m uvicorn server.web_display:app --reload

Architecture

Audio Processing Pipeline

  1. Audio Capture (client/audio_capture.py)

    • Captures audio from microphone/system using sounddevice
    • Handles automatic sample rate detection and resampling
    • Uses chunking with overlap for better transcription quality
    • Default: 3-second chunks with 0.5s overlap
  2. Noise Suppression (client/noise_suppression.py)

    • Applies noisereduce for background noise reduction
    • Voice Activity Detection (VAD) using webrtcvad
    • Skips silent segments to improve performance
  3. Transcription (client/transcription_engine.py)

    • Uses faster-whisper for efficient inference
    • Supports CPU, CUDA, and Apple MPS (Mac)
    • Models: tiny, base, small, medium, large
    • Thread-safe model loading with locks
  4. Display (gui/main_window_qt.py)

    • PySide6/Qt-based desktop GUI
    • Real-time transcription display with scrolling
    • Settings panel with live updates (no restart needed)

Web Server Architecture

Local Web Server (server/web_display.py)

  • Always runs when GUI starts (port 8080 by default)
  • FastAPI with WebSocket for real-time updates
  • Used for OBS browser source integration
  • Single-user (displays only local transcriptions)

Multi-User Server (Optional - for syncing across multiple users)

Node.js WebSocket Server (server/nodejs/) - RECOMMENDED

  • Real-time WebSocket support (< 100ms latency)
  • Handles 100+ concurrent users
  • Easy deployment to VPS/cloud hosting (Railway, Heroku, DigitalOcean, or any VPS)
  • Configurable display options via URL parameters:
    • timestamps=true/false - Show/hide timestamps
    • maxlines=50 - Maximum visible lines (prevents scroll bars in OBS)
    • fontsize=16 - Font size in pixels
    • fontfamily=Arial - Font family
    • fade=10 - Seconds before text fades (0 = never)

See server/nodejs/README.md for deployment instructions

Configuration System

  • Config stored at ~/.local-transcription/config.yaml
  • Managed by client/config.py
  • Settings apply immediately without restart (except model changes)
  • YAML format with nested keys (e.g., transcription.model)

Device Management

  • client/device_utils.py handles CPU/GPU detection
  • Auto-detects CUDA, MPS (Mac), or falls back to CPU
  • Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
  • Thread-safe device selection

Key Implementation Details

PyInstaller Build Configuration

  • local-transcription.spec controls build
  • UPX compression enabled for smaller executables
  • Hidden imports required for PySide6, faster-whisper, torch
  • Console mode enabled by default (set console=False to hide)

Threading Model

  • Main thread: Qt GUI event loop
  • Audio thread: Captures and processes audio chunks
  • Web server thread: Runs FastAPI server
  • Transcription: Runs in callback thread from audio capture
  • All transcription results communicated via Qt signals

Server Sync (Optional Multi-User Feature)

  • client/server_sync.py handles server communication
  • Toggle in Settings: "Enable Server Sync"
  • Sends transcriptions to Node.js server via HTTP POST
  • Real-time updates via WebSocket to display page
  • Per-speaker font support (Web-Safe, Google Fonts, Custom uploads)
  • Falls back gracefully if server unavailable

Common Patterns

Adding a New Setting

  1. Add to config/default_config.yaml
  2. Update client/config.py if validation needed
  3. Add UI control in gui/settings_dialog_qt.py
  4. Apply setting in relevant component (no restart if possible)
  5. Emit signal to update display if needed

Modifying Transcription Display

Adding a New Model Size

Dependencies

Core:

  • faster-whisper: Optimized Whisper inference
  • torch: ML framework (CUDA-enabled via special index)
  • PySide6: Qt6 bindings for GUI
  • sounddevice: Cross-platform audio I/O
  • noisereduce, webrtcvad: Audio preprocessing

Web Server:

  • fastapi, uvicorn: Web server and ASGI
  • websockets: Real-time communication

Build:

  • pyinstaller: Create standalone executables
  • uv: Fast package manager

PyTorch CUDA Index:

  • Configured in pyproject.toml under [[tool.uv.index]]
  • Uses PyTorch's custom wheel repository for CUDA builds
  • Automatically installed with uv sync when using CUDA build scripts

Platform-Specific Notes

Linux

  • Uses PulseAudio/ALSA for audio
  • Build scripts use bash (.sh files)
  • Executable: dist/LocalTranscription/LocalTranscription

Windows

  • Uses Windows Audio/WASAPI
  • Build scripts use batch (.bat files)
  • Executable: dist\LocalTranscription\LocalTranscription.exe
  • Requires Visual C++ Redistributable on target systems

Cross-Building

  • Cannot cross-compile - must build on target platform
  • CI/CD should use platform-specific runners

Troubleshooting

Model Loading Issues

  • Models download to ~/.cache/huggingface/
  • First run requires internet connection
  • Check disk space (models: 75MB-3GB depending on size)

Audio Device Issues

  • Run uv run python main_cli.py --list-devices
  • Check permissions (microphone access)
  • Try different device indices in settings

GPU Not Detected

  • Run uv run python check_cuda.py
  • Install CUDA drivers (not CUDA toolkit - bundled in build)
  • Verify PyTorch sees GPU: python -c "import torch; print(torch.cuda.is_available())"

Web Server Port Conflicts

  • Default port: 8080
  • Change in gui/main_window_qt.py or config
  • Use lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

OBS Integration

Local Display (Single User)

  1. Start Local Transcription app
  2. In OBS: Add "Browser" source
  3. URL: http://localhost:8080
  4. Set dimensions (e.g., 1920x300)

Multi-User Display (Node.js Server)

  1. Deploy Node.js server (see server/nodejs/README.md)
  2. Each user configures Server URL: http://your-server:3000/api/send
  3. Enter same room name and passphrase
  4. In OBS: Add "Browser" source
  5. URL: http://your-server:3000/display?room=ROOM&fade=10&timestamps=true&maxlines=50&fontsize=16
  6. Customize URL parameters as needed:
    • timestamps=false - Hide timestamps
    • maxlines=30 - Show max 30 lines (prevents scroll bars)
    • fontsize=18 - Larger font
    • fontfamily=Courier - Different font

Performance Optimization

For Real-Time Transcription:

  • Use tiny or base model (faster)
  • Enable GPU if available (5-10x faster)
  • Increase chunk_duration for better accuracy (higher latency)
  • Decrease chunk_duration for lower latency (less context)
  • Enable VAD to skip silent audio

For Build Size Reduction:

  • Don't bundle models (download on demand)
  • Use CPU-only build if no GPU users
  • Enable UPX compression (already in spec)

Phase Status

  • Phase 1: Standalone desktop application (complete)
  • Web Server: Local OBS integration (complete)
  • Builds: PyInstaller executables (complete)
  • Phase 2: Multi-user Node.js server (complete, optional)
  • ⏸️ Phase 3+: Advanced features (see NEXT_STEPS.md)