Josh Knapp 8604662262 Add CUDA diagnostic script for troubleshooting GPU detection
- Checks PyTorch installation and version
- Detects CUDA availability and GPU info
- Tests CUDA with simple tensor operation
- Shows device manager detection results
- Provides troubleshooting hints for CPU-only builds

Usage: python check_cuda.py or uv run check_cuda.py
2025-12-26 12:00:37 -08:00

Local Transcription for Streamers

A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.

Features

  • Standalone Desktop Application: Use locally with built-in GUI display - no server required
  • Local Transcription: Run Whisper (or compatible models) locally on your machine
  • CPU/GPU Support: Choose between CPU or GPU processing based on your hardware
  • Real-time Processing: Live audio transcription with minimal latency
  • Noise Suppression: Built-in audio preprocessing to reduce background noise
  • User Configuration: Set your display name and preferences through the GUI
  • Optional Multi-user Sync: Connect to a server to sync transcriptions with other users
  • OBS Integration: Web-based output designed for easy browser source capture
  • Privacy-First: All processing happens locally; only transcription text is shared
  • Customizable: Configure model size, language, and streaming settings

Quick Start

Running from Source

# Install dependencies
uv sync

# Run the application
uv run python main.py

Building Standalone Executables

To create standalone executables for distribution:

Linux:

./build.sh

Windows:

build.bat

For detailed build instructions, see BUILD.md.

Architecture Overview

The application can run in two modes:

Standalone Mode (No Server Required):

  1. Desktop Application: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window

Multi-user Sync Mode (Optional):

  1. Local Transcription Client: Captures audio, performs speech-to-text, and sends results to the web server
  2. Centralized Web Server: Aggregates transcriptions from multiple clients and serves a web stream
  3. Web Stream Interface: Browser-accessible page displaying synchronized transcriptions (for OBS capture)

Use Cases

  • Multi-language Streams: Multiple translators transcribing in different languages
  • Accessibility: Provide real-time captions for viewers
  • Collaborative Podcasts: Multiple hosts with separate transcriptions
  • Gaming Commentary: Track who said what in multiplayer sessions

Implementation Plan

Phase 1: Standalone Desktop Application

Objective: Build a fully functional standalone transcription app with GUI that works without any server

Components:

  1. Audio Capture Module

    • Capture system audio or microphone input
    • Support multiple audio sources (virtual audio cables, physical devices)
    • Real-time audio buffering with configurable chunk sizes
    • Noise Suppression: Preprocess audio to reduce background noise
    • Libraries: pyaudio, sounddevice, noisereduce, webrtcvad
  2. Noise Suppression Engine

    • Real-time noise reduction using RNNoise or noisereduce
    • Adjustable noise reduction strength
    • Optional VAD (Voice Activity Detection) to skip silent segments
    • Libraries: noisereduce, rnnoise-python, webrtcvad
  3. Transcription Engine

    • Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
    • Support multiple model sizes (tiny, base, small, medium, large)
    • CPU and GPU inference options
    • Model management and automatic downloading
    • Libraries: openai-whisper, faster-whisper, torch
  4. Device Selection

    • Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
    • Allow user to specify preferred device via GUI
    • Graceful fallback if GPU unavailable
    • Display device status and performance metrics
  5. Desktop GUI Application

    • Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
    • Main transcription display window (scrolling text area)
    • Settings panel for configuration
    • User name input field
    • Audio input device selector
    • Model size selector
    • CPU/GPU toggle
    • Start/Stop transcription button
    • Optional: System tray integration
    • Libraries: PyQt6, customtkinter, or tkinter
  6. Local Display

    • Real-time transcription display in GUI window
    • Scrolling text with timestamps
    • User name/label shown with transcriptions
    • Copy transcription to clipboard
    • Optional: Save transcription to file (TXT, SRT, VTT)

Tasks:

  • Set up project structure and dependencies
  • Implement audio capture with device selection
  • Add noise suppression and VAD preprocessing
  • Integrate Whisper model loading and inference
  • Add CPU/GPU device detection and selection logic
  • Create real-time audio buffer processing pipeline
  • Design and implement GUI layout (main window)
  • Add settings panel with user name configuration
  • Implement local transcription display area
  • Add start/stop controls and status indicators
  • Test transcription accuracy and latency
  • Test noise suppression effectiveness

Phase 2: Web Server and Sync System

Objective: Create a centralized server to aggregate and serve transcriptions

Components:

  1. Web Server

    • FastAPI or Flask-based REST API
    • WebSocket support for real-time updates
    • User/client registration and management
    • Libraries: fastapi, uvicorn, websockets
  2. Transcription Aggregator

    • Receive transcription chunks from multiple clients
    • Associate transcriptions with user IDs/names
    • Timestamp management and synchronization
    • Buffer management for smooth streaming
  3. Database/Storage (Optional)

    • Store transcription history (SQLite for simplicity)
    • Session management
    • Export functionality (SRT, VTT, TXT formats)

API Endpoints:

  • POST /api/register - Register a new client
  • POST /api/transcription - Submit transcription chunk
  • WS /api/stream - WebSocket for real-time transcription stream
  • GET /stream - Web page for OBS browser source

Tasks:

  • Set up FastAPI server with CORS support
  • Implement WebSocket handler for real-time streaming
  • Create client registration system
  • Build transcription aggregation logic
  • Add timestamp synchronization
  • Create data models for clients and transcriptions

Phase 3: Client-Server Communication (Optional Multi-user Mode)

Objective: Add optional server connectivity to enable multi-user transcription sync

Components:

  1. HTTP/WebSocket Client

    • Register client with server on startup
    • Send transcription chunks as they're generated
    • Handle connection drops and reconnection
    • Libraries: requests, websockets
  2. Configuration System

    • Config file for server URL, API keys, user settings
    • Model preferences (size, language)
    • Audio input settings
    • Format: YAML or JSON
  3. Status Monitoring

    • Connection status indicator
    • Transcription queue health
    • Error handling and logging

Tasks:

  • Add "Enable Server Sync" toggle to GUI
  • Add server URL configuration field in settings
  • Implement WebSocket client for sending transcriptions
  • Add configuration file support (YAML/JSON)
  • Create connection management with auto-reconnect
  • Add local logging and error handling
  • Add server connection status indicator to GUI
  • Allow app to function normally if server is unavailable

Phase 4: Web Stream Interface (OBS Integration)

Objective: Create a web page that displays synchronized transcriptions for OBS

Components:

  1. Web Frontend

    • HTML/CSS/JavaScript page for displaying transcriptions
    • Responsive design with customizable styling
    • Auto-scroll with configurable retention window
    • Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
  2. Styling Options

    • Customizable fonts, colors, sizes
    • Background transparency for OBS chroma key
    • User name/ID display options
    • Timestamp display (optional)
  3. Display Modes

    • Scrolling captions (like live TV captions)
    • Multi-user panel view (separate sections per user)
    • Overlay mode (minimal UI for transparency)

Tasks:

  • Create HTML template for transcription display
  • Implement WebSocket client in JavaScript
  • Add CSS styling with OBS-friendly transparency
  • Create customization controls (URL parameters or UI)
  • Test with OBS browser source
  • Add configurable retention/scroll behavior

Phase 5: Advanced Features

Objective: Enhance functionality and user experience

Features:

  1. Language Detection

    • Auto-detect spoken language
    • Multi-language support in single stream
    • Language selector in GUI
  2. Speaker Diarization (Optional)

    • Identify different speakers
    • Label transcriptions by speaker
    • Useful for multi-host streams
  3. Profanity Filtering

    • Optional word filtering/replacement
    • Customizable filter lists
    • Toggle in GUI settings
  4. Advanced Noise Profiles

    • Save and load custom noise profiles
    • Adaptive noise suppression
    • Different profiles for different environments
  5. Export Functionality

    • Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
    • Export button in GUI
    • Automatic session saving
  6. Hotkey Support

    • Global hotkeys to start/stop transcription
    • Mute/unmute hotkey
    • Quick save hotkey
  7. Docker Support

    • Containerized server deployment
    • Docker Compose for easy multi-component setup
    • Pre-built images for easy deployment
  8. Themes and Customization

    • Dark/light theme toggle
    • Customizable font sizes and colors for display
    • OBS-friendly transparent overlay mode

Tasks:

  • Add language detection and multi-language support
  • Implement speaker diarization
  • Create optional profanity filter
  • Add export functionality (SRT, VTT, plain text, JSON)
  • Implement global hotkey support
  • Create Docker containers for server component
  • Add theme customization options
  • Create advanced noise profile management

Technology Stack

Local Client:

  • Python 3.9+
  • GUI: PyQt6 / CustomTkinter / tkinter
  • Audio: PyAudio / sounddevice
  • Noise Suppression: noisereduce / rnnoise-python
  • VAD: webrtcvad
  • ML Framework: PyTorch (for Whisper)
  • Transcription: openai-whisper / faster-whisper
  • Networking: websockets, requests (optional for server sync)
  • Config: PyYAML / json

Server:

  • Backend: FastAPI / Flask
  • WebSocket: python-websockets / FastAPI WebSockets
  • Server: Uvicorn / Gunicorn
  • Database (optional): SQLite / PostgreSQL
  • CORS: fastapi-cors

Web Interface:

  • Frontend: HTML5, CSS3, JavaScript (ES6+)
  • Real-time: WebSocket API
  • Styling: CSS Grid/Flexbox for layout

Project Structure

local-transcription/
 client/                      # Local transcription client
    __init__.py
    audio_capture.py         # Audio input handling
    transcription_engine.py  # Whisper integration
    network_client.py        # Server communication
    config.py                # Configuration management
    main.py                  # Client entry point
 server/                      # Centralized web server
    __init__.py
    api.py                   # FastAPI routes
    websocket_handler.py     # WebSocket management
    models.py                # Data models
    database.py              # Optional DB layer
    main.py                  # Server entry point
 web/                         # Web stream interface
    index.html               # OBS browser source page
    styles.css               # Customizable styling
    app.js                   # WebSocket client & UI logic
 config/
    client_config.example.yaml
    server_config.example.yaml
 tests/
    test_audio.py
    test_transcription.py
    test_server.py
 requirements.txt             # Python dependencies
 README.md
 main.py                      # Combined launcher (optional)

Installation (Planned)

Prerequisites:

  • Python 3.9 or higher
  • CUDA-capable GPU (optional, for GPU acceleration)
  • FFmpeg (required by Whisper)

Steps:

  1. Clone the repository

    git clone <repository-url>
    cd local-transcription
    
  2. Install dependencies

    pip install -r requirements.txt
    
  3. Download Whisper models

    # Models will be auto-downloaded on first run
    # Or manually download:
    python -c "import whisper; whisper.load_model('base')"
    
  4. Configure client

    cp config/client_config.example.yaml config/client_config.yaml
    # Edit config/client_config.yaml with your settings
    
  5. Run the server (one instance)

    python server/main.py
    
  6. Run the client (on each user's machine)

    python client/main.py
    
  7. Add to OBS

    • Add a Browser Source
    • URL: http://<server-ip>:8000/stream
    • Set width/height as needed
    • Check "Shutdown source when not visible" for performance

Configuration (Planned)

Client Configuration:

user:
  name: "Streamer1"          # Display name for transcriptions
  id: "unique-user-id"       # Optional unique identifier

audio:
  input_device: "default"    # or specific device index
  sample_rate: 16000
  chunk_duration: 2.0        # seconds

noise_suppression:
  enabled: true              # Enable/disable noise reduction
  strength: 0.7              # 0.0 to 1.0 - reduction strength
  method: "noisereduce"      # "noisereduce" or "rnnoise"

transcription:
  model: "base"              # tiny, base, small, medium, large
  device: "cuda"             # cpu, cuda, mps
  language: "en"             # or "auto" for detection
  task: "transcribe"         # or "translate"

processing:
  use_vad: true              # Voice Activity Detection
  min_confidence: 0.5        # Minimum transcription confidence

server_sync:
  enabled: false             # Enable multi-user server sync
  url: "ws://localhost:8000" # Server URL (when enabled)
  api_key: ""                # Optional API key

display:
  show_timestamps: true      # Show timestamps in local display
  max_lines: 100             # Maximum lines to keep in display
  font_size: 12              # GUI font size

Server Configuration:

server:
  host: "0.0.0.0"
  port: 8000
  api_key_required: false

stream:
  max_clients: 10
  buffer_size: 100         # messages to buffer
  retention_time: 300      # seconds

database:
  enabled: false
  path: "transcriptions.db"

Roadmap

  • Project planning and architecture design
  • Phase 1: Standalone desktop application with GUI
  • Phase 2: Web server and sync system (optional multi-user mode)
  • Phase 3: Client-server communication (optional)
  • Phase 4: Web stream interface for OBS (optional)
  • Phase 5: Advanced features (hotkeys, themes, Docker, etc.)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.


License

[Choose appropriate license - MIT, Apache 2.0, etc.]


Acknowledgments

  • OpenAI Whisper for the excellent speech recognition model
  • The streaming community for inspiration and use cases
Description
Run local speech to text transcription for captioning streamers.
Readme 670 KiB
Languages
Python 81.8%
JavaScript 14.4%
Shell 3.2%
Batchfile 0.6%