Files

jknapp 146a8c8beb Enhance display customization and remove PHP server

Major improvements to display configuration and server architecture:

**Display Enhancements:**
- Add URL parameters for display customization (timestamps, maxlines, fontsize, fontfamily)
- Fix max lines enforcement to prevent scroll bars in OBS
- Apply font family and size settings to both local and sync displays
- Remove auto-scroll, enforce overflow:hidden for clean OBS integration

**Node.js Server:**
- Add timestamps toggle: timestamps=true/false
- Add max lines limit: maxlines=50
- Add font configuration: fontsize=16, fontfamily=Arial
- Update index page with URL parameters documentation
- Improve display URLs in room generation

**Local Web Server:**
- Add max_lines, font_family, font_size configuration
- Respect settings from GUI configuration
- Apply changes immediately without restart

**Architecture:**
- Remove PHP server implementation (Node.js recommended)
- Update all documentation to reference Node.js server
- Update default config URLs to Node.js endpoints
- Clean up 1700+ lines of PHP code

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-27 06:15:55 -08:00

11 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Local Transcription is a desktop application for real-time speech-to-text transcription designed for streamers. It uses Whisper models (via faster-whisper) to transcribe audio locally with optional multi-user server synchronization.

Key Features:

Standalone desktop GUI (PySide6/Qt)
Local transcription with CPU/GPU support
Built-in web server for OBS browser source integration
Optional Node.js-based multi-user server for syncing transcriptions across users
Noise suppression and Voice Activity Detection (VAD)
Cross-platform builds (Linux/Windows) with PyInstaller

Project Structure

local-transcription/
├── client/                   # Core transcription logic
│   ├── audio_capture.py      # Audio input and buffering
│   ├── transcription_engine.py # Whisper model integration
│   ├── noise_suppression.py  # VAD and noise reduction
│   ├── device_utils.py       # CPU/GPU device management
│   ├── config.py             # Configuration management
│   └── server_sync.py        # Multi-user server sync client
├── gui/                      # Desktop application UI
│   ├── main_window_qt.py     # Main application window (PySide6)
│   ├── settings_dialog_qt.py # Settings dialog (PySide6)
│   └── transcription_display_qt.py # Display widget
├── server/                   # Web display servers
│   ├── web_display.py        # FastAPI server for OBS browser source (local)
│   └── nodejs/               # Optional multi-user Node.js server
│       ├── server.js         # Multi-user sync server with WebSocket
│       ├── package.json      # Node.js dependencies
│       └── README.md         # Server deployment documentation
├── config/                   # Example configuration files
│   └── default_config.yaml   # Default settings template
├── main.py                   # GUI application entry point
├── main_cli.py              # CLI version for testing
└── pyproject.toml           # Dependencies and build config

Development Commands

Installation and Setup

# Install dependencies (creates .venv automatically)
uv sync

# Run the GUI application
uv run python main.py

# Run CLI version (headless, for testing)
uv run python main_cli.py

# List available audio devices
uv run python main_cli.py --list-devices

# Install with CUDA support (if needed)
uv pip install torch --index-url https://download.pytorch.org/whl/cu121

Building Executables

# Linux (CPU-only)
./build.sh

# Linux (with CUDA support - works on both GPU and CPU systems)
./build-cuda.sh

# Windows (CPU-only)
build.bat

# Windows (with CUDA support)
build-cuda.bat

# Manual build with PyInstaller
uv run pyinstaller local-transcription.spec

Important: CUDA builds can be created on systems without NVIDIA GPUs. The PyTorch CUDA runtime is bundled, and the app automatically falls back to CPU if no GPU is available.

Testing

# Run component tests
uv run python test_components.py

# Check CUDA availability
uv run python check_cuda.py

# Test web server manually
uv run python -m uvicorn server.web_display:app --reload

Architecture

Audio Processing Pipeline

Audio Capture (client/audio_capture.py)
- Captures audio from microphone/system using sounddevice
- Handles automatic sample rate detection and resampling
- Uses chunking with overlap for better transcription quality
- Default: 3-second chunks with 0.5s overlap
Noise Suppression (client/noise_suppression.py)
- Applies noisereduce for background noise reduction
- Voice Activity Detection (VAD) using webrtcvad
- Skips silent segments to improve performance
Transcription (client/transcription_engine.py)
- Uses faster-whisper for efficient inference
- Supports CPU, CUDA, and Apple MPS (Mac)
- Models: tiny, base, small, medium, large
- Thread-safe model loading with locks
Display (gui/main_window_qt.py)
- PySide6/Qt-based desktop GUI
- Real-time transcription display with scrolling
- Settings panel with live updates (no restart needed)

Web Server Architecture

Local Web Server (server/web_display.py)

Always runs when GUI starts (port 8080 by default)
FastAPI with WebSocket for real-time updates
Used for OBS browser source integration
Single-user (displays only local transcriptions)

Multi-User Server (Optional - for syncing across multiple users)

Node.js WebSocket Server (server/nodejs/) - RECOMMENDED

Real-time WebSocket support (< 100ms latency)
Handles 100+ concurrent users
Easy deployment to VPS/cloud hosting (Railway, Heroku, DigitalOcean, or any VPS)
Configurable display options via URL parameters:
- timestamps=true/false - Show/hide timestamps
- maxlines=50 - Maximum visible lines (prevents scroll bars in OBS)
- fontsize=16 - Font size in pixels
- fontfamily=Arial - Font family
- fade=10 - Seconds before text fades (0 = never)

See server/nodejs/README.md for deployment instructions

Configuration System

Config stored at ~/.local-transcription/config.yaml
Managed by client/config.py
Settings apply immediately without restart (except model changes)
YAML format with nested keys (e.g., transcription.model)

Device Management

client/device_utils.py handles CPU/GPU detection
Auto-detects CUDA, MPS (Mac), or falls back to CPU
Compute types: float32 (best quality), float16 (GPU), int8 (fastest)
Thread-safe device selection

Key Implementation Details

PyInstaller Build Configuration

local-transcription.spec controls build
UPX compression enabled for smaller executables
Hidden imports required for PySide6, faster-whisper, torch
Console mode enabled by default (set console=False to hide)

Threading Model

Main thread: Qt GUI event loop
Audio thread: Captures and processes audio chunks
Web server thread: Runs FastAPI server
Transcription: Runs in callback thread from audio capture
All transcription results communicated via Qt signals

Server Sync (Optional Multi-User Feature)

client/server_sync.py handles server communication
Toggle in Settings: "Enable Server Sync"
Sends transcriptions to PHP server via POST
Separate web display shows merged transcriptions from all users
Falls back gracefully if server unavailable

Common Patterns

Adding a New Setting

Add to config/default_config.yaml
Update client/config.py if validation needed
Add UI control in gui/settings_dialog_qt.py
Apply setting in relevant component (no restart if possible)
Emit signal to update display if needed

Modifying Transcription Display

Local GUI: gui/transcription_display_qt.py
Web display (OBS): server/web_display.py (HTML in _get_html())
Multi-user display: server/php/display.php

Adding a New Model Size

Update client/transcription_engine.py
Add to model selector in gui/settings_dialog_qt.py
Update CLI argument choices in main_cli.py

Dependencies

Core:

faster-whisper: Optimized Whisper inference
torch: ML framework (CUDA-enabled via special index)
PySide6: Qt6 bindings for GUI
sounddevice: Cross-platform audio I/O
noisereduce, webrtcvad: Audio preprocessing

Web Server:

fastapi, uvicorn: Web server and ASGI
websockets: Real-time communication

Build:

pyinstaller: Create standalone executables
uv: Fast package manager

PyTorch CUDA Index:

Configured in pyproject.toml under [[tool.uv.index]]
Uses PyTorch's custom wheel repository for CUDA builds
Automatically installed with uv sync when using CUDA build scripts

Platform-Specific Notes

Linux

Uses PulseAudio/ALSA for audio
Build scripts use bash (.sh files)
Executable: dist/LocalTranscription/LocalTranscription

Windows

Uses Windows Audio/WASAPI
Build scripts use batch (.bat files)
Executable: dist\LocalTranscription\LocalTranscription.exe
Requires Visual C++ Redistributable on target systems

Cross-Building

Cannot cross-compile - must build on target platform
CI/CD should use platform-specific runners

Troubleshooting

Model Loading Issues

Models download to ~/.cache/huggingface/
First run requires internet connection
Check disk space (models: 75MB-3GB depending on size)

Audio Device Issues

Run uv run python main_cli.py --list-devices
Check permissions (microphone access)
Try different device indices in settings

GPU Not Detected

Run uv run python check_cuda.py
Install CUDA drivers (not CUDA toolkit - bundled in build)
Verify PyTorch sees GPU: python -c "import torch; print(torch.cuda.is_available())"

Web Server Port Conflicts

Default port: 8080
Change in gui/main_window_qt.py or config
Use lsof -i :8080 (Linux) or netstat -ano | findstr :8080 (Windows)

OBS Integration

Local Display (Single User)

Start Local Transcription app
In OBS: Add "Browser" source
URL: http://localhost:8080
Set dimensions (e.g., 1920x300)

Multi-User Display (Node.js Server)

Deploy Node.js server (see server/nodejs/README.md)
Each user configures Server URL: http://your-server:3000/api/send
Enter same room name and passphrase
In OBS: Add "Browser" source
URL: http://your-server:3000/display?room=ROOM&fade=10&timestamps=true&maxlines=50&fontsize=16
Customize URL parameters as needed:
- timestamps=false - Hide timestamps
- maxlines=30 - Show max 30 lines (prevents scroll bars)
- fontsize=18 - Larger font
- fontfamily=Courier - Different font

Performance Optimization

For Real-Time Transcription:

Use tiny or base model (faster)
Enable GPU if available (5-10x faster)
Increase chunk_duration for better accuracy (higher latency)
Decrease chunk_duration for lower latency (less context)
Enable VAD to skip silent audio

For Build Size Reduction:

Don't bundle models (download on demand)
Use CPU-only build if no GPU users
Enable UPX compression (already in spec)

Phase Status

✅ Phase 1: Standalone desktop application (complete)
✅ Web Server: Local OBS integration (complete)
✅ Builds: PyInstaller executables (complete)
✅ Phase 2: Multi-user Node.js server (complete, optional)
⏸️ Phase 3+: Advanced features (see NEXT_STEPS.md)

README.md - User-facing documentation
BUILD.md - Detailed build instructions
INSTALL.md - Installation guide
NEXT_STEPS.md - Future enhancements
server/nodejs/README.md - Node.js server setup and deployment

11 KiB Raw Blame History