Files
local-transcription/CLAUDE.md
jknapp d34d272cf0 Simplify build process: CUDA support now included by default
Since pyproject.toml is configured to use PyTorch CUDA index by default,
all builds automatically include CUDA support. Removed redundant separate
CUDA build scripts and updated documentation.

Changes:
- Removed build-cuda.sh and build-cuda.bat (no longer needed)
- Updated build.sh and build.bat to include CUDA by default
  - Added "uv sync" step to ensure CUDA PyTorch is installed
  - Updated messages to clarify CUDA support is included
- Updated BUILD.md to reflect simplified build process
  - Removed separate CUDA build sections
  - Clarified all builds include CUDA support
  - Updated GPU support section
- Updated CLAUDE.md with simplified build commands

Benefits:
- Simpler build process (one script per platform instead of two)
- Less confusion about which script to use
- All builds work on any system (GPU or CPU)
- Automatic fallback to CPU if no GPU available
- pyproject.toml is single source of truth for dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-28 19:09:36 -08:00

12 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.

Key Features:

  • Standalone desktop GUI (PySide6/Qt)
  • Local transcription with CPU/GPU support
  • Built-in web server for OBS browser source integration
  • Optional Node.js-based multi-user server for syncing transcriptions across users
  • Noise suppression and Voice Activity Detection (VAD)
  • Cross-platform builds (Linux/Windows) with PyInstaller

Project Structure

local-transcription/
├── client/                   # Core transcription logic
│   ├── audio_capture.py      # Audio input and buffering
│   ├── transcription_engine.py # Whisper model integration
│   ├── noise_suppression.py  # VAD and noise reduction
│   ├── device_utils.py       # CPU/GPU device management
│   ├── config.py             # Configuration management
│   └── server_sync.py        # Multi-user server sync client
├── gui/                      # Desktop application UI
│   ├── main_window_qt.py     # Main application window (PySide6)
│   ├── settings_dialog_qt.py # Settings dialog (PySide6)
│   └── transcription_display_qt.py # Display widget
├── server/                   # Web display servers
│   ├── web_display.py        # FastAPI server for OBS browser source (local)
│   └── nodejs/               # Optional multi-user Node.js server
│       ├── server.js         # Multi-user sync server with WebSocket
│       ├── package.json      # Node.js dependencies
│       └── README.md         # Server deployment documentation
├── config/                   # Example configuration files
│   └── default_config.yaml   # Default settings template
├── main.py                   # GUI application entry point
├── main_cli.py              # CLI version for testing
└── pyproject.toml           # Dependencies and build config

Development Commands

Installation and Setup

# Install dependencies (creates .venv automatically)
uv sync

# Run the GUI application
uv run python main.py

# Run CLI version (headless, for testing)
uv run python main_cli.py

# List available audio devices
uv run python main_cli.py --list-devices

# Install with CUDA support (if needed)
uv pip install torch --index-url https://download.pytorch.org/whl/cu121

Building Executables

# Linux (includes CUDA support - works on both GPU and CPU systems)
./build.sh

# Windows (includes CUDA support - works on both GPU and CPU systems)
build.bat

# Manual build with PyInstaller
uv sync                          # Install dependencies (includes CUDA PyTorch)
uv pip uninstall -q enum34       # Remove incompatible enum34 package
uv run pyinstaller local-transcription.spec

Important: All builds include CUDA support via pyproject.toml configuration. CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.

Testing

# Run component tests
uv run python test_components.py

# Check CUDA availability
uv run python check_cuda.py

# Test web server manually
uv run python -m uvicorn server.web_display:app --reload

Architecture

Audio Processing Pipeline

  1. Audio Capture (client/audio_capture.py)

    • Captures audio from microphone/system using sounddevice
    • Handles automatic sample rate detection and resampling
    • Uses chunking with overlap for better transcription quality
    • Default: 3-second chunks with 0.5s overlap
  2. Noise Suppression (client/noise_suppression.py)

    • Applies noisereduce for background noise reduction
    • Voice Activity Detection (VAD) using webrtcvad
    • Skips silent segments to improve performance
  3. Transcription (client/transcription_engine.py)

    • Uses faster-whisper for efficient inference
    • Supports CPU, CUDA, and Apple MPS (Mac)
    • Models: tiny, base, small, medium, large
    • Thread-safe model loading with locks
  4. Display (gui/main_window_qt.py)

    • PySide6/Qt-based desktop GUI
    • Real-time transcription display with scrolling
    • Settings panel with live updates (no restart needed)

Web Server Architecture

Local Web Server (server/web_display.py)

  • Always runs when GUI starts (port 8080 by default)
  • FastAPI with WebSocket for real-time updates
  • Used for OBS browser source integration
  • Single-user (displays only local transcriptions)

Multi-User Server (Optional - for syncing across multiple users)

Node.js WebSocket Server (server/nodejs/) - RECOMMENDED

  • Real-time WebSocket support (< 100ms latency)
  • Handles 100+ concurrent users
  • Easy deployment to VPS/cloud hosting (Railway, Heroku, DigitalOcean, or any VPS)
  • Configurable display options via URL parameters:
    • timestamps=true/false - Show/hide timestamps
    • maxlines=50 - Maximum visible lines (prevents scroll bars in OBS)
    • fontsize=16 - Font size in pixels
    • fontfamily=Arial - Font family
    • fade=10 - Seconds before text fades (0 = never)

See server/nodejs/README.md for deployment instructions

Configuration System

  • Config stored at ~/.local-transcription/config.yaml
  • Managed by client/config.py
  • Settings apply immediately without restart (except model changes)
  • YAML format with nested keys (e.g., transcription.model)

Device Management

  • client/device_utils.py handles CPU/GPU detection
  • Auto-detects CUDA, MPS (Mac), or falls back to CPU
  • Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
  • Thread-safe device selection

Key Implementation Details

PyInstaller Build Configuration

  • local-transcription.spec controls build
  • UPX compression enabled for smaller executables
  • Hidden imports required for PySide6, faster-whisper, torch
  • Console mode enabled by default (set console=False to hide)

Threading Model

  • Main thread: Qt GUI event loop
  • Audio thread: Captures and processes audio chunks
  • Web server thread: Runs FastAPI server
  • Transcription: Runs in callback thread from audio capture
  • All transcription results communicated via Qt signals

Server Sync (Optional Multi-User Feature)

  • client/server_sync.py handles server communication
  • Toggle in Settings: "Enable Server Sync"
  • Sends transcriptions to PHP server via POST
  • Separate web display shows merged transcriptions from all users
  • Falls back gracefully if server unavailable

Common Patterns

Adding a New Setting

  1. Add to config/default_config.yaml
  2. Update client/config.py if validation needed
  3. Add UI control in gui/settings_dialog_qt.py
  4. Apply setting in relevant component (no restart if possible)
  5. Emit signal to update display if needed

Modifying Transcription Display

Adding a New Model Size

Dependencies

Core:

  • faster-whisper: Optimized Whisper inference
  • torch: ML framework (CUDA-enabled via special index)
  • PySide6: Qt6 bindings for GUI
  • sounddevice: Cross-platform audio I/O
  • noisereduce, webrtcvad: Audio preprocessing

Web Server:

  • fastapi, uvicorn: Web server and ASGI
  • websockets: Real-time communication

Build:

  • pyinstaller: Create standalone executables
  • uv: Fast package manager

PyTorch CUDA Index:

  • Configured in pyproject.toml under [[tool.uv.index]]
  • Uses PyTorch's custom wheel repository for CUDA builds
  • Automatically installed with uv sync when using CUDA build scripts

Platform-Specific Notes

Linux

  • Uses PulseAudio/ALSA for audio
  • Build scripts use bash (.sh files)
  • Executable: dist/LocalTranscription/LocalTranscription

Windows

  • Uses Windows Audio/WASAPI
  • Build scripts use batch (.bat files)
  • Executable: dist\LocalTranscription\LocalTranscription.exe
  • Requires Visual C++ Redistributable on target systems

Cross-Building

  • Cannot cross-compile - must build on target platform
  • CI/CD should use platform-specific runners

Troubleshooting

Model Loading Issues

  • Models download to ~/.cache/huggingface/
  • First run requires internet connection
  • Check disk space (models: 75MB-3GB depending on size)

Audio Device Issues

  • Run uv run python main_cli.py --list-devices
  • Check permissions (microphone access)
  • Try different device indices in settings

GPU Not Detected

  • Run uv run python check_cuda.py
  • Install CUDA drivers (not CUDA toolkit - bundled in build)
  • Verify PyTorch sees GPU: python -c "import torch; print(torch.cuda.is_available())"

Web Server Port Conflicts

  • Default port: 8080
  • Change in gui/main_window_qt.py or config
  • Use lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

OBS Integration

Local Display (Single User)

  1. Start Local Transcription app
  2. In OBS: Add "Browser" source
  3. URL: http://localhost:8080
  4. Set dimensions (e.g., 1920x300)

Multi-User Display (Node.js Server)

  1. Deploy Node.js server (see server/nodejs/README.md)
  2. Each user configures Server URL: http://your-server:3000/api/send
  3. Enter same room name and passphrase
  4. In OBS: Add "Browser" source
  5. URL: http://your-server:3000/display?room=ROOM&fade=10&timestamps=true&maxlines=50&fontsize=16
  6. Customize URL parameters as needed:
    • timestamps=false - Hide timestamps
    • maxlines=30 - Show max 30 lines (prevents scroll bars)
    • fontsize=18 - Larger font
    • fontfamily=Courier - Different font

Performance Optimization

For Real-Time Transcription:

  • Use tiny or base model (faster)
  • Enable GPU if available (5-10x faster)
  • Increase chunk_duration for better accuracy (higher latency)
  • Decrease chunk_duration for lower latency (less context)
  • Enable VAD to skip silent audio

For Build Size Reduction:

  • Don't bundle models (download on demand)
  • Use CPU-only build if no GPU users
  • Enable UPX compression (already in spec)

Phase Status

  • Phase 1: Standalone desktop application (complete)
  • Web Server: Local OBS integration (complete)
  • Builds: PyInstaller executables (complete)
  • Phase 2: Multi-user Node.js server (complete, optional)
  • ⏸️ Phase 3+: Advanced features (see NEXT_STEPS.md)