Files

jknapp d34d272cf0 Simplify build process: CUDA support now included by default

Since pyproject.toml is configured to use PyTorch CUDA index by default,
all builds automatically include CUDA support. Removed redundant separate
CUDA build scripts and updated documentation.

Changes:
- Removed build-cuda.sh and build-cuda.bat (no longer needed)
- Updated build.sh and build.bat to include CUDA by default
  - Added "uv sync" step to ensure CUDA PyTorch is installed
  - Updated messages to clarify CUDA support is included
- Updated BUILD.md to reflect simplified build process
  - Removed separate CUDA build sections
  - Clarified all builds include CUDA support
  - Updated GPU support section
- Updated CLAUDE.md with simplified build commands

Benefits:
- Simpler build process (one script per platform instead of two)
- Less confusion about which script to use
- All builds work on any system (GPU or CPU)
- Automatic fallback to CPU if no GPU available
- pyproject.toml is single source of truth for dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-28 19:09:36 -08:00

12 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.

Key Features:

Standalone desktop GUI (PySide6/Qt)
Local transcription with CPU/GPU support
Built-in web server for OBS browser source integration
Optional Node.js-based multi-user server for syncing transcriptions across users
Noise suppression and Voice Activity Detection (VAD)
Cross-platform builds (Linux/Windows) with PyInstaller

Project Structure

local-transcription/
├── client/                   # Core transcription logic
│   ├── audio_capture.py      # Audio input and buffering
│   ├── transcription_engine.py # Whisper model integration
│   ├── noise_suppression.py  # VAD and noise reduction
│   ├── device_utils.py       # CPU/GPU device management
│   ├── config.py             # Configuration management
│   └── server_sync.py        # Multi-user server sync client
├── gui/                      # Desktop application UI
│   ├── main_window_qt.py     # Main application window (PySide6)
│   ├── settings_dialog_qt.py # Settings dialog (PySide6)
│   └── transcription_display_qt.py # Display widget
├── server/                   # Web display servers
│   ├── web_display.py        # FastAPI server for OBS browser source (local)
│   └── nodejs/               # Optional multi-user Node.js server
│       ├── server.js         # Multi-user sync server with WebSocket
│       ├── package.json      # Node.js dependencies
│       └── README.md         # Server deployment documentation
├── config/                   # Example configuration files
│   └── default_config.yaml   # Default settings template
├── main.py                   # GUI application entry point
├── main_cli.py              # CLI version for testing
└── pyproject.toml           # Dependencies and build config

Development Commands

Installation and Setup

# Install dependencies (creates .venv automatically)
uv sync

# Run the GUI application
uv run python main.py

# Run CLI version (headless, for testing)
uv run python main_cli.py

# List available audio devices
uv run python main_cli.py --list-devices

# Install with CUDA support (if needed)
uv pip install torch --index-url https://download.pytorch.org/whl/cu121

Building Executables

# Linux (includes CUDA support - works on both GPU and CPU systems)
./build.sh

# Windows (includes CUDA support - works on both GPU and CPU systems)
build.bat

# Manual build with PyInstaller
uv sync                          # Install dependencies (includes CUDA PyTorch)
uv pip uninstall -q enum34       # Remove incompatible enum34 package
uv run pyinstaller local-transcription.spec

Important: All builds include CUDA support via pyproject.toml configuration. CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.

Testing

# Run component tests
uv run python test_components.py

# Check CUDA availability
uv run python check_cuda.py

# Test web server manually
uv run python -m uvicorn server.web_display:app --reload

Architecture

Audio Processing Pipeline

Audio Capture (client/audio_capture.py)
- Captures audio from microphone/system using sounddevice
- Handles automatic sample rate detection and resampling
- Uses chunking with overlap for better transcription quality
- Default: 3-second chunks with 0.5s overlap
Noise Suppression (client/noise_suppression.py)
- Applies noisereduce for background noise reduction
- Voice Activity Detection (VAD) using webrtcvad
- Skips silent segments to improve performance
Transcription (client/transcription_engine.py)
- Uses faster-whisper for efficient inference
- Supports CPU, CUDA, and Apple MPS (Mac)
- Models: tiny, base, small, medium, large
- Thread-safe model loading with locks
Display (gui/main_window_qt.py)
- PySide6/Qt-based desktop GUI
- Real-time transcription display with scrolling
- Settings panel with live updates (no restart needed)

Web Server Architecture

Local Web Server (server/web_display.py)

Always runs when GUI starts (port 8080 by default)
FastAPI with WebSocket for real-time updates
Used for OBS browser source integration
Single-user (displays only local transcriptions)

Multi-User Server (Optional - for syncing across multiple users)

Node.js WebSocket Server (server/nodejs/) - RECOMMENDED

Real-time WebSocket support (< 100ms latency)
Handles 100+ concurrent users
Easy deployment to VPS/cloud hosting (Railway, Heroku, DigitalOcean, or any VPS)
Configurable display options via URL parameters:
- timestamps=true/false - Show/hide timestamps
- maxlines=50 - Maximum visible lines (prevents scroll bars in OBS)
- fontsize=16 - Font size in pixels
- fontfamily=Arial - Font family
- fade=10 - Seconds before text fades (0 = never)

See server/nodejs/README.md for deployment instructions

Configuration System

Config stored at ~/.local-transcription/config.yaml
Managed by client/config.py
Settings apply immediately without restart (except model changes)
YAML format with nested keys (e.g., transcription.model)

Device Management

client/device_utils.py handles CPU/GPU detection
Auto-detects CUDA, MPS (Mac), or falls back to CPU
Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
Thread-safe device selection

Key Implementation Details

PyInstaller Build Configuration

local-transcription.spec controls build
UPX compression enabled for smaller executables
Hidden imports required for PySide6, faster-whisper, torch
Console mode enabled by default (set console=False to hide)

Threading Model

Main thread: Qt GUI event loop
Audio thread: Captures and processes audio chunks
Web server thread: Runs FastAPI server
Transcription: Runs in callback thread from audio capture
All transcription results communicated via Qt signals

Server Sync (Optional Multi-User Feature)

client/server_sync.py handles server communication
Toggle in Settings: "Enable Server Sync"
Sends transcriptions to PHP server via POST
Separate web display shows merged transcriptions from all users
Falls back gracefully if server unavailable

Common Patterns

Adding a New Setting

Add to config/default_config.yaml
Update client/config.py if validation needed
Add UI control in gui/settings_dialog_qt.py
Apply setting in relevant component (no restart if possible)
Emit signal to update display if needed

Modifying Transcription Display

Local GUI: gui/transcription_display_qt.py
Web display (OBS): server/web_display.py (HTML in _get_html())
Multi-user display: server/php/display.php

Adding a New Model Size

Update client/transcription_engine.py
Add to model selector in gui/settings_dialog_qt.py
Update CLI argument choices in main_cli.py

Dependencies

Core:

faster-whisper: Optimized Whisper inference
torch: ML framework (CUDA-enabled via special index)
PySide6: Qt6 bindings for GUI
sounddevice: Cross-platform audio I/O
noisereduce, webrtcvad: Audio preprocessing

Web Server:

fastapi, uvicorn: Web server and ASGI
websockets: Real-time communication

Build:

pyinstaller: Create standalone executables
uv: Fast package manager

PyTorch CUDA Index:

Configured in pyproject.toml under [[tool.uv.index]]
Uses PyTorch's custom wheel repository for CUDA builds
Automatically installed with uv sync when using CUDA build scripts

Platform-Specific Notes

Linux

Uses PulseAudio/ALSA for audio
Build scripts use bash (.sh files)
Executable: dist/LocalTranscription/LocalTranscription

Windows

Uses Windows Audio/WASAPI
Build scripts use batch (.bat files)
Executable: dist\LocalTranscription\LocalTranscription.exe
Requires Visual C++ Redistributable on target systems

Cross-Building

Cannot cross-compile - must build on target platform
CI/CD should use platform-specific runners

Troubleshooting

Model Loading Issues

Models download to ~/.cache/huggingface/
First run requires internet connection
Check disk space (models: 75MB-3GB depending on size)

Audio Device Issues

Run uv run python main_cli.py --list-devices
Check permissions (microphone access)
Try different device indices in settings

GPU Not Detected

Run uv run python check_cuda.py
Install CUDA drivers (not CUDA toolkit - bundled in build)
Verify PyTorch sees GPU: python -c "import torch; print(torch.cuda.is_available())"

Web Server Port Conflicts

Default port: 8080
Change in gui/main_window_qt.py or config
Use lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

OBS Integration

Local Display (Single User)

Start Local Transcription app
In OBS: Add "Browser" source
URL: http://localhost:8080
Set dimensions (e.g., 1920x300)

Multi-User Display (Node.js Server)

Deploy Node.js server (see server/nodejs/README.md)
Each user configures Server URL: http://your-server:3000/api/send
Enter same room name and passphrase
In OBS: Add "Browser" source
URL: http://your-server:3000/display?room=ROOM&fade=10&timestamps=true&maxlines=50&fontsize=16
Customize URL parameters as needed:
- timestamps=false - Hide timestamps
- maxlines=30 - Show max 30 lines (prevents scroll bars)
- fontsize=18 - Larger font
- fontfamily=Courier - Different font

Performance Optimization

For Real-Time Transcription:

Use tiny or base model (faster)
Enable GPU if available (5-10x faster)
Increase chunk_duration for better accuracy (higher latency)
Decrease chunk_duration for lower latency (less context)
Enable VAD to skip silent audio

For Build Size Reduction:

Don't bundle models (download on demand)
Use CPU-only build if no GPU users
Enable UPX compression (already in spec)

Phase Status

✅ Phase 1: Standalone desktop application (complete)
✅ Web Server: Local OBS integration (complete)
✅ Builds: PyInstaller executables (complete)
✅ Phase 2: Multi-user Node.js server (complete, optional)
⏸️ Phase 3+: Advanced features (see NEXT_STEPS.md)

README.md - User-facing documentation
BUILD.md - Detailed build instructions
INSTALL.md - Installation guide
NEXT_STEPS.md - Future enhancements
server/nodejs/README.md - Node.js server setup and deployment

12 KiB Raw Permalink Blame History