jknapp d34d272cf0 Simplify build process: CUDA support now included by default
Since pyproject.toml is configured to use PyTorch CUDA index by default,
all builds automatically include CUDA support. Removed redundant separate
CUDA build scripts and updated documentation.

Changes:
- Removed build-cuda.sh and build-cuda.bat (no longer needed)
- Updated build.sh and build.bat to include CUDA by default
  - Added "uv sync" step to ensure CUDA PyTorch is installed
  - Updated messages to clarify CUDA support is included
- Updated BUILD.md to reflect simplified build process
  - Removed separate CUDA build sections
  - Clarified all builds include CUDA support
  - Updated GPU support section
- Updated CLAUDE.md with simplified build commands

Benefits:
- Simpler build process (one script per platform instead of two)
- Less confusion about which script to use
- All builds work on any system (GPU or CPU)
- Automatic fallback to CPU if no GPU available
- pyproject.toml is single source of truth for dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-28 19:09:36 -08:00

Local Transcription for Streamers

A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.

Features

  • Standalone Desktop Application: Use locally with built-in GUI display - no server required
  • Local Transcription: Run Whisper (or compatible models) locally on your machine
  • CPU/GPU Support: Choose between CPU or GPU processing based on your hardware
  • Real-time Processing: Live audio transcription with minimal latency
  • Noise Suppression: Built-in audio preprocessing to reduce background noise
  • User Configuration: Set your display name and preferences through the GUI
  • Optional Multi-user Sync: Connect to a server to sync transcriptions with other users
  • OBS Integration: Web-based output designed for easy browser source capture
  • Privacy-First: All processing happens locally; only transcription text is shared
  • Customizable: Configure model size, language, and streaming settings

Quick Start

Running from Source

# Install dependencies
uv sync

# Run the application
uv run python main.py

Building Standalone Executables

To create standalone executables for distribution:

Linux:

./build.sh

Windows:

build.bat

For detailed build instructions, see BUILD.md.

Architecture Overview

The application can run in two modes:

Standalone Mode (No Server Required):

  1. Desktop Application: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window

Multi-user Sync Mode (Optional):

  1. Local Transcription Client: Captures audio, performs speech-to-text, and sends results to the web server
  2. Centralized Web Server: Aggregates transcriptions from multiple clients and serves a web stream
  3. Web Stream Interface: Browser-accessible page displaying synchronized transcriptions (for OBS capture)

Use Cases

  • Multi-language Streams: Multiple translators transcribing in different languages
  • Accessibility: Provide real-time captions for viewers
  • Collaborative Podcasts: Multiple hosts with separate transcriptions
  • Gaming Commentary: Track who said what in multiplayer sessions

Implementation Plan

Phase 1: Standalone Desktop Application

Objective: Build a fully functional standalone transcription app with GUI that works without any server

Components:

  1. Audio Capture Module

    • Capture system audio or microphone input
    • Support multiple audio sources (virtual audio cables, physical devices)
    • Real-time audio buffering with configurable chunk sizes
    • Noise Suppression: Preprocess audio to reduce background noise
    • Libraries: pyaudio, sounddevice, noisereduce, webrtcvad
  2. Noise Suppression Engine

    • Real-time noise reduction using RNNoise or noisereduce
    • Adjustable noise reduction strength
    • Optional VAD (Voice Activity Detection) to skip silent segments
    • Libraries: noisereduce, rnnoise-python, webrtcvad
  3. Transcription Engine

    • Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
    • Support multiple model sizes (tiny, base, small, medium, large)
    • CPU and GPU inference options
    • Model management and automatic downloading
    • Libraries: openai-whisper, faster-whisper, torch
  4. Device Selection

    • Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
    • Allow user to specify preferred device via GUI
    • Graceful fallback if GPU unavailable
    • Display device status and performance metrics
  5. Desktop GUI Application

    • Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
    • Main transcription display window (scrolling text area)
    • Settings panel for configuration
    • User name input field
    • Audio input device selector
    • Model size selector
    • CPU/GPU toggle
    • Start/Stop transcription button
    • Optional: System tray integration
    • Libraries: PyQt6, customtkinter, or tkinter
  6. Local Display

    • Real-time transcription display in GUI window
    • Scrolling text with timestamps
    • User name/label shown with transcriptions
    • Copy transcription to clipboard
    • Optional: Save transcription to file (TXT, SRT, VTT)

Tasks:

  • Set up project structure and dependencies
  • Implement audio capture with device selection
  • Add noise suppression and VAD preprocessing
  • Integrate Whisper model loading and inference
  • Add CPU/GPU device detection and selection logic
  • Create real-time audio buffer processing pipeline
  • Design and implement GUI layout (main window)
  • Add settings panel with user name configuration
  • Implement local transcription display area
  • Add start/stop controls and status indicators
  • Test transcription accuracy and latency
  • Test noise suppression effectiveness

Phase 2: Web Server and Sync System

Objective: Create a centralized server to aggregate and serve transcriptions

Components:

  1. Web Server

    • FastAPI or Flask-based REST API
    • WebSocket support for real-time updates
    • User/client registration and management
    • Libraries: fastapi, uvicorn, websockets
  2. Transcription Aggregator

    • Receive transcription chunks from multiple clients
    • Associate transcriptions with user IDs/names
    • Timestamp management and synchronization
    • Buffer management for smooth streaming
  3. Database/Storage (Optional)

    • Store transcription history (SQLite for simplicity)
    • Session management
    • Export functionality (SRT, VTT, TXT formats)

API Endpoints:

  • POST /api/register - Register a new client
  • POST /api/transcription - Submit transcription chunk
  • WS /api/stream - WebSocket for real-time transcription stream
  • GET /stream - Web page for OBS browser source

Tasks:

  • Set up FastAPI server with CORS support
  • Implement WebSocket handler for real-time streaming
  • Create client registration system
  • Build transcription aggregation logic
  • Add timestamp synchronization
  • Create data models for clients and transcriptions

Phase 3: Client-Server Communication (Optional Multi-user Mode)

Objective: Add optional server connectivity to enable multi-user transcription sync

Components:

  1. HTTP/WebSocket Client

    • Register client with server on startup
    • Send transcription chunks as they're generated
    • Handle connection drops and reconnection
    • Libraries: requests, websockets
  2. Configuration System

    • Config file for server URL, API keys, user settings
    • Model preferences (size, language)
    • Audio input settings
    • Format: YAML or JSON
  3. Status Monitoring

    • Connection status indicator
    • Transcription queue health
    • Error handling and logging

Tasks:

  • Add "Enable Server Sync" toggle to GUI
  • Add server URL configuration field in settings
  • Implement WebSocket client for sending transcriptions
  • Add configuration file support (YAML/JSON)
  • Create connection management with auto-reconnect
  • Add local logging and error handling
  • Add server connection status indicator to GUI
  • Allow app to function normally if server is unavailable

Phase 4: Web Stream Interface (OBS Integration)

Objective: Create a web page that displays synchronized transcriptions for OBS

Components:

  1. Web Frontend

    • HTML/CSS/JavaScript page for displaying transcriptions
    • Responsive design with customizable styling
    • Auto-scroll with configurable retention window
    • Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
  2. Styling Options

    • Customizable fonts, colors, sizes
    • Background transparency for OBS chroma key
    • User name/ID display options
    • Timestamp display (optional)
  3. Display Modes

    • Scrolling captions (like live TV captions)
    • Multi-user panel view (separate sections per user)
    • Overlay mode (minimal UI for transparency)

Tasks:

  • Create HTML template for transcription display
  • Implement WebSocket client in JavaScript
  • Add CSS styling with OBS-friendly transparency
  • Create customization controls (URL parameters or UI)
  • Test with OBS browser source
  • Add configurable retention/scroll behavior

Phase 5: Advanced Features

Objective: Enhance functionality and user experience

Features:

  1. Language Detection

    • Auto-detect spoken language
    • Multi-language support in single stream
    • Language selector in GUI
  2. Speaker Diarization (Optional)

    • Identify different speakers
    • Label transcriptions by speaker
    • Useful for multi-host streams
  3. Profanity Filtering

    • Optional word filtering/replacement
    • Customizable filter lists
    • Toggle in GUI settings
  4. Advanced Noise Profiles

    • Save and load custom noise profiles
    • Adaptive noise suppression
    • Different profiles for different environments
  5. Export Functionality

    • Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
    • Export button in GUI
    • Automatic session saving
  6. Hotkey Support

    • Global hotkeys to start/stop transcription
    • Mute/unmute hotkey
    • Quick save hotkey
  7. Docker Support

    • Containerized server deployment
    • Docker Compose for easy multi-component setup
    • Pre-built images for easy deployment
  8. Themes and Customization

    • Dark/light theme toggle
    • Customizable font sizes and colors for display
    • OBS-friendly transparent overlay mode

Tasks:

  • Add language detection and multi-language support
  • Implement speaker diarization
  • Create optional profanity filter
  • Add export functionality (SRT, VTT, plain text, JSON)
  • Implement global hotkey support
  • Create Docker containers for server component
  • Add theme customization options
  • Create advanced noise profile management

Technology Stack

Local Client:

  • Python 3.9+
  • GUI: PyQt6 / CustomTkinter / tkinter
  • Audio: PyAudio / sounddevice
  • Noise Suppression: noisereduce / rnnoise-python
  • VAD: webrtcvad
  • ML Framework: PyTorch (for Whisper)
  • Transcription: openai-whisper / faster-whisper
  • Networking: websockets, requests (optional for server sync)
  • Config: PyYAML / json

Server:

  • Backend: FastAPI / Flask
  • WebSocket: python-websockets / FastAPI WebSockets
  • Server: Uvicorn / Gunicorn
  • Database (optional): SQLite / PostgreSQL
  • CORS: fastapi-cors

Web Interface:

  • Frontend: HTML5, CSS3, JavaScript (ES6+)
  • Real-time: WebSocket API
  • Styling: CSS Grid/Flexbox for layout

Project Structure

local-transcription/
 client/                      # Local transcription client
    __init__.py
    audio_capture.py         # Audio input handling
    transcription_engine.py  # Whisper integration
    network_client.py        # Server communication
    config.py                # Configuration management
    main.py                  # Client entry point
 server/                      # Centralized web server
    __init__.py
    api.py                   # FastAPI routes
    websocket_handler.py     # WebSocket management
    models.py                # Data models
    database.py              # Optional DB layer
    main.py                  # Server entry point
 web/                         # Web stream interface
    index.html               # OBS browser source page
    styles.css               # Customizable styling
    app.js                   # WebSocket client & UI logic
 config/
    client_config.example.yaml
    server_config.example.yaml
 tests/
    test_audio.py
    test_transcription.py
    test_server.py
 requirements.txt             # Python dependencies
 README.md
 main.py                      # Combined launcher (optional)

Installation (Planned)

Prerequisites:

  • Python 3.9 or higher
  • CUDA-capable GPU (optional, for GPU acceleration)
  • FFmpeg (required by Whisper)

Steps:

  1. Clone the repository

    git clone <repository-url>
    cd local-transcription
    
  2. Install dependencies

    pip install -r requirements.txt
    
  3. Download Whisper models

    # Models will be auto-downloaded on first run
    # Or manually download:
    python -c "import whisper; whisper.load_model('base')"
    
  4. Configure client

    cp config/client_config.example.yaml config/client_config.yaml
    # Edit config/client_config.yaml with your settings
    
  5. Run the server (one instance)

    python server/main.py
    
  6. Run the client (on each user's machine)

    python client/main.py
    
  7. Add to OBS

    • Add a Browser Source
    • URL: http://<server-ip>:8000/stream
    • Set width/height as needed
    • Check "Shutdown source when not visible" for performance

Configuration (Planned)

Client Configuration:

user:
  name: "Streamer1"          # Display name for transcriptions
  id: "unique-user-id"       # Optional unique identifier

audio:
  input_device: "default"    # or specific device index
  sample_rate: 16000
  chunk_duration: 2.0        # seconds

noise_suppression:
  enabled: true              # Enable/disable noise reduction
  strength: 0.7              # 0.0 to 1.0 - reduction strength
  method: "noisereduce"      # "noisereduce" or "rnnoise"

transcription:
  model: "base"              # tiny, base, small, medium, large
  device: "cuda"             # cpu, cuda, mps
  language: "en"             # or "auto" for detection
  task: "transcribe"         # or "translate"

processing:
  use_vad: true              # Voice Activity Detection
  min_confidence: 0.5        # Minimum transcription confidence

server_sync:
  enabled: false             # Enable multi-user server sync
  url: "ws://localhost:8000" # Server URL (when enabled)
  api_key: ""                # Optional API key

display:
  show_timestamps: true      # Show timestamps in local display
  max_lines: 100             # Maximum lines to keep in display
  font_size: 12              # GUI font size

Server Configuration:

server:
  host: "0.0.0.0"
  port: 8000
  api_key_required: false

stream:
  max_clients: 10
  buffer_size: 100         # messages to buffer
  retention_time: 300      # seconds

database:
  enabled: false
  path: "transcriptions.db"

Roadmap

  • Project planning and architecture design
  • Phase 1: Standalone desktop application with GUI
  • Phase 2: Web server and sync system (optional multi-user mode)
  • Phase 3: Client-server communication (optional)
  • Phase 4: Web stream interface for OBS (optional)
  • Phase 5: Advanced features (hotkeys, themes, Docker, etc.)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.


License

[Choose appropriate license - MIT, Apache 2.0, etc.]


Acknowledgments

  • OpenAI Whisper for the excellent speech recognition model
  • The streaming community for inspiration and use cases
Description
Run local speech to text transcription for captioning streamers.
Readme 670 KiB
Languages
Python 81.8%
JavaScript 14.4%
Shell 3.2%
Batchfile 0.6%