Go to file

jknapp d34d272cf0 Simplify build process: CUDA support now included by default

Since pyproject.toml is configured to use PyTorch CUDA index by default,
all builds automatically include CUDA support. Removed redundant separate
CUDA build scripts and updated documentation.

Changes:
- Removed build-cuda.sh and build-cuda.bat (no longer needed)
- Updated build.sh and build.bat to include CUDA by default
  - Added "uv sync" step to ensure CUDA PyTorch is installed
  - Updated messages to clarify CUDA support is included
- Updated BUILD.md to reflect simplified build process
  - Removed separate CUDA build sections
  - Clarified all builds include CUDA support
  - Updated GPU support section
- Updated CLAUDE.md with simplified build commands

Benefits:
- Simpler build process (one script per platform instead of two)
- Less confusion about which script to use
- All builds work on any system (GPU or CPU)
- Automatic fallback to CPU if no GPU available
- pyproject.toml is single source of truth for dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-28 19:09:36 -08:00

client

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

config

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

gui

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

LocalTranscription.iconset

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

server

Enhance display customization and remove PHP server

2025-12-27 06:15:55 -08:00

.gitignore

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

2025-live-transcription-research.md

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

2025-live-transcription-research.md:Zone.Identifier

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

build.bat

Simplify build process: CUDA support now included by default

2025-12-28 19:09:36 -08:00

BUILD.md

Simplify build process: CUDA support now included by default

2025-12-28 19:09:36 -08:00

build.sh

Simplify build process: CUDA support now included by default

2025-12-28 19:09:36 -08:00

check_cuda.py

Add CUDA diagnostic script for troubleshooting GPU detection

2025-12-26 12:00:37 -08:00

CLAUDE.md

Simplify build process: CUDA support now included by default

2025-12-28 19:09:36 -08:00

create_icons.py

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

DEBUG_4_SECOND_LAG.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

FIX_2_SECOND_HTTP_DELAY.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

FIXES_APPLIED.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

INSTALL_REALTIMESTT.md

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

INSTALL.md

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

LATENCY_GUIDE.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

local-transcription.spec

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

LocalTranscription.ico

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

LocalTranscription.png

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

main_cli.py

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

main.py

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

NEXT_STEPS.md

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

PERFORMANCE_FIX.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

pyproject.toml

Fix PyInstaller build failure caused by enum34 package

2025-12-28 19:06:33 -08:00

README.md

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

requirements.txt

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

SESSION_SUMMARY.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

test_components.py

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

test-server-timing.sh

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

README.md

Local Transcription for Streamers

A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.

Features

Standalone Desktop Application: Use locally with built-in GUI display - no server required
Local Transcription: Run Whisper (or compatible models) locally on your machine
CPU/GPU Support: Choose between CPU or GPU processing based on your hardware
Real-time Processing: Live audio transcription with minimal latency
Noise Suppression: Built-in audio preprocessing to reduce background noise
User Configuration: Set your display name and preferences through the GUI
Optional Multi-user Sync: Connect to a server to sync transcriptions with other users
OBS Integration: Web-based output designed for easy browser source capture
Privacy-First: All processing happens locally; only transcription text is shared
Customizable: Configure model size, language, and streaming settings

Quick Start

Running from Source

# Install dependencies
uv sync

# Run the application
uv run python main.py

Building Standalone Executables

To create standalone executables for distribution:

Linux:

./build.sh

Windows:

build.bat

For detailed build instructions, see BUILD.md.

Architecture Overview

The application can run in two modes:

Standalone Mode (No Server Required):

Desktop Application: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window

Multi-user Sync Mode (Optional):

Local Transcription Client: Captures audio, performs speech-to-text, and sends results to the web server
Centralized Web Server: Aggregates transcriptions from multiple clients and serves a web stream
Web Stream Interface: Browser-accessible page displaying synchronized transcriptions (for OBS capture)

Use Cases

Multi-language Streams: Multiple translators transcribing in different languages
Accessibility: Provide real-time captions for viewers
Collaborative Podcasts: Multiple hosts with separate transcriptions
Gaming Commentary: Track who said what in multiplayer sessions

Implementation Plan

Phase 1: Standalone Desktop Application

Objective: Build a fully functional standalone transcription app with GUI that works without any server

Components:

Audio Capture Module
- Capture system audio or microphone input
- Support multiple audio sources (virtual audio cables, physical devices)
- Real-time audio buffering with configurable chunk sizes
- Noise Suppression: Preprocess audio to reduce background noise
- Libraries: pyaudio, sounddevice, noisereduce, webrtcvad
Noise Suppression Engine
- Real-time noise reduction using RNNoise or noisereduce
- Adjustable noise reduction strength
- Optional VAD (Voice Activity Detection) to skip silent segments
- Libraries: noisereduce, rnnoise-python, webrtcvad
Transcription Engine
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
- Support multiple model sizes (tiny, base, small, medium, large)
- CPU and GPU inference options
- Model management and automatic downloading
- Libraries: openai-whisper, faster-whisper, torch
Device Selection
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
- Allow user to specify preferred device via GUI
- Graceful fallback if GPU unavailable
- Display device status and performance metrics
Desktop GUI Application
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
- Main transcription display window (scrolling text area)
- Settings panel for configuration
- User name input field
- Audio input device selector
- Model size selector
- CPU/GPU toggle
- Start/Stop transcription button
- Optional: System tray integration
- Libraries: PyQt6, customtkinter, or tkinter
Local Display
- Real-time transcription display in GUI window
- Scrolling text with timestamps
- User name/label shown with transcriptions
- Copy transcription to clipboard
- Optional: Save transcription to file (TXT, SRT, VTT)

Tasks:

Set up project structure and dependencies
Implement audio capture with device selection
Add noise suppression and VAD preprocessing
Integrate Whisper model loading and inference
Add CPU/GPU device detection and selection logic
Create real-time audio buffer processing pipeline
Design and implement GUI layout (main window)
Add settings panel with user name configuration
Implement local transcription display area
Add start/stop controls and status indicators
Test transcription accuracy and latency
Test noise suppression effectiveness

Phase 2: Web Server and Sync System

Objective: Create a centralized server to aggregate and serve transcriptions

Components:

Web Server
- FastAPI or Flask-based REST API
- WebSocket support for real-time updates
- User/client registration and management
- Libraries: fastapi, uvicorn, websockets
Transcription Aggregator
- Receive transcription chunks from multiple clients
- Associate transcriptions with user IDs/names
- Timestamp management and synchronization
- Buffer management for smooth streaming
Database/Storage (Optional)
- Store transcription history (SQLite for simplicity)
- Session management
- Export functionality (SRT, VTT, TXT formats)

API Endpoints:

POST /api/register - Register a new client
POST /api/transcription - Submit transcription chunk
WS /api/stream - WebSocket for real-time transcription stream
GET /stream - Web page for OBS browser source

Tasks:

Set up FastAPI server with CORS support
Implement WebSocket handler for real-time streaming
Create client registration system
Build transcription aggregation logic
Add timestamp synchronization
Create data models for clients and transcriptions

Phase 3: Client-Server Communication (Optional Multi-user Mode)

Objective: Add optional server connectivity to enable multi-user transcription sync

Components:

HTTP/WebSocket Client
- Register client with server on startup
- Send transcription chunks as they're generated
- Handle connection drops and reconnection
- Libraries: requests, websockets
Configuration System
- Config file for server URL, API keys, user settings
- Model preferences (size, language)
- Audio input settings
- Format: YAML or JSON
Status Monitoring
- Connection status indicator
- Transcription queue health
- Error handling and logging

Tasks:

Add "Enable Server Sync" toggle to GUI
Add server URL configuration field in settings
Implement WebSocket client for sending transcriptions
Add configuration file support (YAML/JSON)
Create connection management with auto-reconnect
Add local logging and error handling
Add server connection status indicator to GUI
Allow app to function normally if server is unavailable

Phase 4: Web Stream Interface (OBS Integration)

Objective: Create a web page that displays synchronized transcriptions for OBS

Components:

Web Frontend
- HTML/CSS/JavaScript page for displaying transcriptions
- Responsive design with customizable styling
- Auto-scroll with configurable retention window
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
Styling Options
- Customizable fonts, colors, sizes
- Background transparency for OBS chroma key
- User name/ID display options
- Timestamp display (optional)
Display Modes
- Scrolling captions (like live TV captions)
- Multi-user panel view (separate sections per user)
- Overlay mode (minimal UI for transparency)

Tasks:

Create HTML template for transcription display
Implement WebSocket client in JavaScript
Add CSS styling with OBS-friendly transparency
Create customization controls (URL parameters or UI)
Test with OBS browser source
Add configurable retention/scroll behavior

Phase 5: Advanced Features

Objective: Enhance functionality and user experience

Features:

Language Detection
- Auto-detect spoken language
- Multi-language support in single stream
- Language selector in GUI
Speaker Diarization (Optional)
- Identify different speakers
- Label transcriptions by speaker
- Useful for multi-host streams
Profanity Filtering
- Optional word filtering/replacement
- Customizable filter lists
- Toggle in GUI settings
Advanced Noise Profiles
- Save and load custom noise profiles
- Adaptive noise suppression
- Different profiles for different environments
Export Functionality
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
- Export button in GUI
- Automatic session saving
Hotkey Support
- Global hotkeys to start/stop transcription
- Mute/unmute hotkey
- Quick save hotkey
Docker Support
- Containerized server deployment
- Docker Compose for easy multi-component setup
- Pre-built images for easy deployment
Themes and Customization
- Dark/light theme toggle
- Customizable font sizes and colors for display
- OBS-friendly transparent overlay mode

Tasks:

Add language detection and multi-language support
Implement speaker diarization
Create optional profanity filter
Add export functionality (SRT, VTT, plain text, JSON)
Implement global hotkey support
Create Docker containers for server component
Add theme customization options
Create advanced noise profile management

Technology Stack

Local Client:

Python 3.9+
GUI: PyQt6 / CustomTkinter / tkinter
Audio: PyAudio / sounddevice
Noise Suppression: noisereduce / rnnoise-python
VAD: webrtcvad
ML Framework: PyTorch (for Whisper)
Transcription: openai-whisper / faster-whisper
Networking: websockets, requests (optional for server sync)
Config: PyYAML / json

Server:

Backend: FastAPI / Flask
WebSocket: python-websockets / FastAPI WebSockets
Server: Uvicorn / Gunicorn
Database (optional): SQLite / PostgreSQL
CORS: fastapi-cors

Web Interface:

Frontend: HTML5, CSS3, JavaScript (ES6+)
Real-time: WebSocket API
Styling: CSS Grid/Flexbox for layout

Project Structure

local-transcription/
 client/                      # Local transcription client
    __init__.py
    audio_capture.py         # Audio input handling
    transcription_engine.py  # Whisper integration
    network_client.py        # Server communication
    config.py                # Configuration management
    main.py                  # Client entry point
 server/                      # Centralized web server
    __init__.py
    api.py                   # FastAPI routes
    websocket_handler.py     # WebSocket management
    models.py                # Data models
    database.py              # Optional DB layer
    main.py                  # Server entry point
 web/                         # Web stream interface
    index.html               # OBS browser source page
    styles.css               # Customizable styling
    app.js                   # WebSocket client & UI logic
 config/
    client_config.example.yaml
    server_config.example.yaml
 tests/
    test_audio.py
    test_transcription.py
    test_server.py
 requirements.txt             # Python dependencies
 README.md
 main.py                      # Combined launcher (optional)

Installation (Planned)

Prerequisites:

Python 3.9 or higher
CUDA-capable GPU (optional, for GPU acceleration)
FFmpeg (required by Whisper)

Steps:

Clone the repository

git clone <repository-url>
cd local-transcription

Install dependencies
```
pip install -r requirements.txt
```

Download Whisper models

# Models will be auto-downloaded on first run
# Or manually download:
python -c "import whisper; whisper.load_model('base')"

Configure client

cp config/client_config.example.yaml config/client_config.yaml
# Edit config/client_config.yaml with your settings

Run the server (one instance)
```
python server/main.py
```
Run the client (on each user's machine)
```
python client/main.py
```
Add to OBS
- Add a Browser Source
- URL: http://<server-ip>:8000/stream
- Set width/height as needed
- Check "Shutdown source when not visible" for performance

Configuration (Planned)

Client Configuration:

user:
  name: "Streamer1"          # Display name for transcriptions
  id: "unique-user-id"       # Optional unique identifier

audio:
  input_device: "default"    # or specific device index
  sample_rate: 16000
  chunk_duration: 2.0        # seconds

noise_suppression:
  enabled: true              # Enable/disable noise reduction
  strength: 0.7              # 0.0 to 1.0 - reduction strength
  method: "noisereduce"      # "noisereduce" or "rnnoise"

transcription:
  model: "base"              # tiny, base, small, medium, large
  device: "cuda"             # cpu, cuda, mps
  language: "en"             # or "auto" for detection
  task: "transcribe"         # or "translate"

processing:
  use_vad: true              # Voice Activity Detection
  min_confidence: 0.5        # Minimum transcription confidence

server_sync:
  enabled: false             # Enable multi-user server sync
  url: "ws://localhost:8000" # Server URL (when enabled)
  api_key: ""                # Optional API key

display:
  show_timestamps: true      # Show timestamps in local display
  max_lines: 100             # Maximum lines to keep in display
  font_size: 12              # GUI font size

Server Configuration:

server:
  host: "0.0.0.0"
  port: 8000
  api_key_required: false

stream:
  max_clients: 10
  buffer_size: 100         # messages to buffer
  retention_time: 300      # seconds

database:
  enabled: false
  path: "transcriptions.db"

Roadmap

Project planning and architecture design
Phase 1: Standalone desktop application with GUI
Phase 2: Web server and sync system (optional multi-user mode)
Phase 3: Client-server communication (optional)
Phase 4: Web stream interface for OBS (optional)
Phase 5: Advanced features (hotkeys, themes, Docker, etc.)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

[Choose appropriate license - MIT, Apache 2.0, etc.]

Acknowledgments

OpenAI Whisper for the excellent speech recognition model
The streaming community for inspiration and use cases

Releases 1

v1.4.0 - Auto-Update Notifications Latest

2026-01-23 02:13:56 +00:00

Languages

Python 84%

JavaScript 14.8%

Shell 0.8%

Batchfile 0.4%

README.md Unescape Escape

Local Transcription for Streamers

Features

Quick Start

Running from Source

Building Standalone Executables

Architecture Overview

Standalone Mode (No Server Required):

Multi-user Sync Mode (Optional):

Use Cases

Implementation Plan

Phase 1: Standalone Desktop Application

Components:

Tasks:

Phase 2: Web Server and Sync System

Components:

API Endpoints:

Tasks:

Phase 3: Client-Server Communication (Optional Multi-user Mode)

Components:

Tasks:

Phase 4: Web Stream Interface (OBS Integration)

Components:

Tasks:

Phase 5: Advanced Features

Features:

Tasks:

Technology Stack

Local Client:

Server:

Web Interface:

Project Structure

Installation (Planned)

Prerequisites:

Steps:

Configuration (Planned)

Client Configuration:

Server Configuration:

Roadmap

Contributing

License

Acknowledgments

README.md