Go to file

Josh Knapp 8604662262 Add CUDA diagnostic script for troubleshooting GPU detection

- Checks PyTorch installation and version
- Detects CUDA availability and GPU info
- Tests CUDA with simple tensor operation
- Shows device manager detection results
- Provides troubleshooting hints for CPU-only builds

Usage: python check_cuda.py or uv run check_cuda.py

2025-12-26 12:00:37 -08:00

client

Add multi-user server sync (PHP server + client)

2025-12-26 10:09:12 -08:00

config

Add multi-user server sync (PHP server + client)

2025-12-26 10:09:12 -08:00

gui

Add multi-user server sync (PHP server + client)

2025-12-26 10:09:12 -08:00

server

Add index page with URL generator and remove passphrase from display

2025-12-26 10:18:40 -08:00

.gitignore

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

build-cuda.bat

Fix CUDA build scripts: Remove unsupported -y flag from uv pip uninstall

2025-12-26 11:43:46 -08:00

build-cuda.sh

Fix CUDA build scripts: Remove unsupported -y flag from uv pip uninstall

2025-12-26 11:43:46 -08:00

build.bat

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

BUILD.md

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

build.sh

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

check_cuda.py

Add CUDA diagnostic script for troubleshooting GPU detection

2025-12-26 12:00:37 -08:00

INSTALL.md

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

local-transcription.spec

Fix Windows FastAPI import: Replace collect_all with collect_submodules

2025-12-26 11:30:29 -08:00

main_cli.py

Improve transcription accuracy with overlapping audio chunks

2025-12-26 08:47:19 -08:00

main.py

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

NEXT_STEPS.md

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

pyproject.toml

Move FastAPI and uvicorn to main dependencies

2025-12-26 11:57:50 -08:00

README.md

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

requirements.txt

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

test_components.py

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

README.md

Local Transcription for Streamers

A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.

Features

Standalone Desktop Application: Use locally with built-in GUI display - no server required
Local Transcription: Run Whisper (or compatible models) locally on your machine
CPU/GPU Support: Choose between CPU or GPU processing based on your hardware
Real-time Processing: Live audio transcription with minimal latency
Noise Suppression: Built-in audio preprocessing to reduce background noise
User Configuration: Set your display name and preferences through the GUI
Optional Multi-user Sync: Connect to a server to sync transcriptions with other users
OBS Integration: Web-based output designed for easy browser source capture
Privacy-First: All processing happens locally; only transcription text is shared
Customizable: Configure model size, language, and streaming settings

Quick Start

Running from Source

# Install dependencies
uv sync

# Run the application
uv run python main.py

Building Standalone Executables

To create standalone executables for distribution:

Linux:

./build.sh

Windows:

build.bat

For detailed build instructions, see BUILD.md.

Architecture Overview

The application can run in two modes:

Standalone Mode (No Server Required):

Desktop Application: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window

Multi-user Sync Mode (Optional):

Local Transcription Client: Captures audio, performs speech-to-text, and sends results to the web server
Centralized Web Server: Aggregates transcriptions from multiple clients and serves a web stream
Web Stream Interface: Browser-accessible page displaying synchronized transcriptions (for OBS capture)

Use Cases

Multi-language Streams: Multiple translators transcribing in different languages
Accessibility: Provide real-time captions for viewers
Collaborative Podcasts: Multiple hosts with separate transcriptions
Gaming Commentary: Track who said what in multiplayer sessions

Implementation Plan

Phase 1: Standalone Desktop Application

Objective: Build a fully functional standalone transcription app with GUI that works without any server

Components:

Audio Capture Module
- Capture system audio or microphone input
- Support multiple audio sources (virtual audio cables, physical devices)
- Real-time audio buffering with configurable chunk sizes
- Noise Suppression: Preprocess audio to reduce background noise
- Libraries: pyaudio, sounddevice, noisereduce, webrtcvad
Noise Suppression Engine
- Real-time noise reduction using RNNoise or noisereduce
- Adjustable noise reduction strength
- Optional VAD (Voice Activity Detection) to skip silent segments
- Libraries: noisereduce, rnnoise-python, webrtcvad
Transcription Engine
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
- Support multiple model sizes (tiny, base, small, medium, large)
- CPU and GPU inference options
- Model management and automatic downloading
- Libraries: openai-whisper, faster-whisper, torch
Device Selection
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
- Allow user to specify preferred device via GUI
- Graceful fallback if GPU unavailable
- Display device status and performance metrics
Desktop GUI Application
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
- Main transcription display window (scrolling text area)
- Settings panel for configuration
- User name input field
- Audio input device selector
- Model size selector
- CPU/GPU toggle
- Start/Stop transcription button
- Optional: System tray integration
- Libraries: PyQt6, customtkinter, or tkinter
Local Display
- Real-time transcription display in GUI window
- Scrolling text with timestamps
- User name/label shown with transcriptions
- Copy transcription to clipboard
- Optional: Save transcription to file (TXT, SRT, VTT)

Tasks:

Set up project structure and dependencies
Implement audio capture with device selection
Add noise suppression and VAD preprocessing
Integrate Whisper model loading and inference
Add CPU/GPU device detection and selection logic
Create real-time audio buffer processing pipeline
Design and implement GUI layout (main window)
Add settings panel with user name configuration
Implement local transcription display area
Add start/stop controls and status indicators
Test transcription accuracy and latency
Test noise suppression effectiveness

Phase 2: Web Server and Sync System

Objective: Create a centralized server to aggregate and serve transcriptions

Components:

Web Server
- FastAPI or Flask-based REST API
- WebSocket support for real-time updates
- User/client registration and management
- Libraries: fastapi, uvicorn, websockets
Transcription Aggregator
- Receive transcription chunks from multiple clients
- Associate transcriptions with user IDs/names
- Timestamp management and synchronization
- Buffer management for smooth streaming
Database/Storage (Optional)
- Store transcription history (SQLite for simplicity)
- Session management
- Export functionality (SRT, VTT, TXT formats)

API Endpoints:

POST /api/register - Register a new client
POST /api/transcription - Submit transcription chunk
WS /api/stream - WebSocket for real-time transcription stream
GET /stream - Web page for OBS browser source

Tasks:

Set up FastAPI server with CORS support
Implement WebSocket handler for real-time streaming
Create client registration system
Build transcription aggregation logic
Add timestamp synchronization
Create data models for clients and transcriptions

Phase 3: Client-Server Communication (Optional Multi-user Mode)

Objective: Add optional server connectivity to enable multi-user transcription sync

Components:

HTTP/WebSocket Client
- Register client with server on startup
- Send transcription chunks as they're generated
- Handle connection drops and reconnection
- Libraries: requests, websockets
Configuration System
- Config file for server URL, API keys, user settings
- Model preferences (size, language)
- Audio input settings
- Format: YAML or JSON
Status Monitoring
- Connection status indicator
- Transcription queue health
- Error handling and logging

Tasks:

Add "Enable Server Sync" toggle to GUI
Add server URL configuration field in settings
Implement WebSocket client for sending transcriptions
Add configuration file support (YAML/JSON)
Create connection management with auto-reconnect
Add local logging and error handling
Add server connection status indicator to GUI
Allow app to function normally if server is unavailable

Phase 4: Web Stream Interface (OBS Integration)

Objective: Create a web page that displays synchronized transcriptions for OBS

Components:

Web Frontend
- HTML/CSS/JavaScript page for displaying transcriptions
- Responsive design with customizable styling
- Auto-scroll with configurable retention window
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
Styling Options
- Customizable fonts, colors, sizes
- Background transparency for OBS chroma key
- User name/ID display options
- Timestamp display (optional)
Display Modes
- Scrolling captions (like live TV captions)
- Multi-user panel view (separate sections per user)
- Overlay mode (minimal UI for transparency)

Tasks:

Create HTML template for transcription display
Implement WebSocket client in JavaScript
Add CSS styling with OBS-friendly transparency
Create customization controls (URL parameters or UI)
Test with OBS browser source
Add configurable retention/scroll behavior

Phase 5: Advanced Features

Objective: Enhance functionality and user experience

Features:

Language Detection
- Auto-detect spoken language
- Multi-language support in single stream
- Language selector in GUI
Speaker Diarization (Optional)
- Identify different speakers
- Label transcriptions by speaker
- Useful for multi-host streams
Profanity Filtering
- Optional word filtering/replacement
- Customizable filter lists
- Toggle in GUI settings
Advanced Noise Profiles
- Save and load custom noise profiles
- Adaptive noise suppression
- Different profiles for different environments
Export Functionality
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
- Export button in GUI
- Automatic session saving
Hotkey Support
- Global hotkeys to start/stop transcription
- Mute/unmute hotkey
- Quick save hotkey
Docker Support
- Containerized server deployment
- Docker Compose for easy multi-component setup
- Pre-built images for easy deployment
Themes and Customization
- Dark/light theme toggle
- Customizable font sizes and colors for display
- OBS-friendly transparent overlay mode

Tasks:

Add language detection and multi-language support
Implement speaker diarization
Create optional profanity filter
Add export functionality (SRT, VTT, plain text, JSON)
Implement global hotkey support
Create Docker containers for server component
Add theme customization options
Create advanced noise profile management

Technology Stack

Local Client:

Python 3.9+
GUI: PyQt6 / CustomTkinter / tkinter
Audio: PyAudio / sounddevice
Noise Suppression: noisereduce / rnnoise-python
VAD: webrtcvad
ML Framework: PyTorch (for Whisper)
Transcription: openai-whisper / faster-whisper
Networking: websockets, requests (optional for server sync)
Config: PyYAML / json

Server:

Backend: FastAPI / Flask
WebSocket: python-websockets / FastAPI WebSockets
Server: Uvicorn / Gunicorn
Database (optional): SQLite / PostgreSQL
CORS: fastapi-cors

Web Interface:

Frontend: HTML5, CSS3, JavaScript (ES6+)
Real-time: WebSocket API
Styling: CSS Grid/Flexbox for layout

Project Structure

local-transcription/
 client/                      # Local transcription client
    __init__.py
    audio_capture.py         # Audio input handling
    transcription_engine.py  # Whisper integration
    network_client.py        # Server communication
    config.py                # Configuration management
    main.py                  # Client entry point
 server/                      # Centralized web server
    __init__.py
    api.py                   # FastAPI routes
    websocket_handler.py     # WebSocket management
    models.py                # Data models
    database.py              # Optional DB layer
    main.py                  # Server entry point
 web/                         # Web stream interface
    index.html               # OBS browser source page
    styles.css               # Customizable styling
    app.js                   # WebSocket client & UI logic
 config/
    client_config.example.yaml
    server_config.example.yaml
 tests/
    test_audio.py
    test_transcription.py
    test_server.py
 requirements.txt             # Python dependencies
 README.md
 main.py                      # Combined launcher (optional)

Installation (Planned)

Prerequisites:

Python 3.9 or higher
CUDA-capable GPU (optional, for GPU acceleration)
FFmpeg (required by Whisper)

Steps:

Clone the repository

git clone <repository-url>
cd local-transcription

Install dependencies
```
pip install -r requirements.txt
```

Download Whisper models

# Models will be auto-downloaded on first run
# Or manually download:
python -c "import whisper; whisper.load_model('base')"

Configure client

cp config/client_config.example.yaml config/client_config.yaml
# Edit config/client_config.yaml with your settings

Run the server (one instance)
```
python server/main.py
```
Run the client (on each user's machine)
```
python client/main.py
```
Add to OBS
- Add a Browser Source
- URL: http://<server-ip>:8000/stream
- Set width/height as needed
- Check "Shutdown source when not visible" for performance

Configuration (Planned)

Client Configuration:

user:
  name: "Streamer1"          # Display name for transcriptions
  id: "unique-user-id"       # Optional unique identifier

audio:
  input_device: "default"    # or specific device index
  sample_rate: 16000
  chunk_duration: 2.0        # seconds

noise_suppression:
  enabled: true              # Enable/disable noise reduction
  strength: 0.7              # 0.0 to 1.0 - reduction strength
  method: "noisereduce"      # "noisereduce" or "rnnoise"

transcription:
  model: "base"              # tiny, base, small, medium, large
  device: "cuda"             # cpu, cuda, mps
  language: "en"             # or "auto" for detection
  task: "transcribe"         # or "translate"

processing:
  use_vad: true              # Voice Activity Detection
  min_confidence: 0.5        # Minimum transcription confidence

server_sync:
  enabled: false             # Enable multi-user server sync
  url: "ws://localhost:8000" # Server URL (when enabled)
  api_key: ""                # Optional API key

display:
  show_timestamps: true      # Show timestamps in local display
  max_lines: 100             # Maximum lines to keep in display
  font_size: 12              # GUI font size

Server Configuration:

server:
  host: "0.0.0.0"
  port: 8000
  api_key_required: false

stream:
  max_clients: 10
  buffer_size: 100         # messages to buffer
  retention_time: 300      # seconds

database:
  enabled: false
  path: "transcriptions.db"

Roadmap

Project planning and architecture design
Phase 1: Standalone desktop application with GUI
Phase 2: Web server and sync system (optional multi-user mode)
Phase 3: Client-server communication (optional)
Phase 4: Web stream interface for OBS (optional)
Phase 5: Advanced features (hotkeys, themes, Docker, etc.)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

[Choose appropriate license - MIT, Apache 2.0, etc.]

Acknowledgments

OpenAI Whisper for the excellent speech recognition model
The streaming community for inspiration and use cases

Releases 1

v1.4.0 - Auto-Update Notifications Latest

2026-01-23 02:13:56 +00:00

Languages

Python 84%

JavaScript 14.8%

Shell 0.8%

Batchfile 0.4%

README.md Unescape Escape

Local Transcription for Streamers

Features

Quick Start

Running from Source

Building Standalone Executables

Architecture Overview

Standalone Mode (No Server Required):

Multi-user Sync Mode (Optional):

Use Cases

Implementation Plan

Phase 1: Standalone Desktop Application

Components:

Tasks:

Phase 2: Web Server and Sync System

Components:

API Endpoints:

Tasks:

Phase 3: Client-Server Communication (Optional Multi-user Mode)

Components:

Tasks:

Phase 4: Web Stream Interface (OBS Integration)

Components:

Tasks:

Phase 5: Advanced Features

Features:

Tasks:

Technology Stack

Local Client:

Server:

Web Interface:

Project Structure

Installation (Planned)

Prerequisites:

Steps:

Configuration (Planned)

Client Configuration:

Server Configuration:

Roadmap

Contributing

License

Acknowledgments

README.md