Phase 1 Complete - Standalone Desktop Application Features: - Real-time speech-to-text with Whisper (faster-whisper) - PySide6 desktop GUI with settings dialog - Web server for OBS browser source integration - Audio capture with automatic sample rate detection and resampling - Noise suppression with Voice Activity Detection (VAD) - Configurable display settings (font, timestamps, fade duration) - Settings apply without restart (with automatic model reloading) - Auto-fade for web display transcriptions - CPU/GPU support with automatic device detection - Standalone executable builds (PyInstaller) - CUDA build support (works on systems without CUDA hardware) Components: - Audio capture with sounddevice - Noise reduction with noisereduce + webrtcvad - Transcription with faster-whisper - GUI with PySide6 - Web server with FastAPI + WebSocket - Configuration system with YAML Build System: - Standard builds (CPU-only): build.sh / build.bat - CUDA builds (universal): build-cuda.sh / build-cuda.bat - Comprehensive BUILD.md documentation - Cross-platform support (Linux, Windows) Documentation: - README.md with project overview and quick start - BUILD.md with detailed build instructions - NEXT_STEPS.md with future enhancement roadmap - INSTALL.md with setup instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
495 lines
15 KiB
Markdown
495 lines
15 KiB
Markdown
# Local Transcription for Streamers
|
||
|
||
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
|
||
|
||
## Features
|
||
|
||
- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required
|
||
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine
|
||
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware
|
||
- **Real-time Processing**: Live audio transcription with minimal latency
|
||
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
|
||
- **User Configuration**: Set your display name and preferences through the GUI
|
||
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users
|
||
- **OBS Integration**: Web-based output designed for easy browser source capture
|
||
- **Privacy-First**: All processing happens locally; only transcription text is shared
|
||
- **Customizable**: Configure model size, language, and streaming settings
|
||
|
||
## Quick Start
|
||
|
||
### Running from Source
|
||
|
||
```bash
|
||
# Install dependencies
|
||
uv sync
|
||
|
||
# Run the application
|
||
uv run python main.py
|
||
```
|
||
|
||
### Building Standalone Executables
|
||
|
||
To create standalone executables for distribution:
|
||
|
||
**Linux:**
|
||
```bash
|
||
./build.sh
|
||
```
|
||
|
||
**Windows:**
|
||
```cmd
|
||
build.bat
|
||
```
|
||
|
||
For detailed build instructions, see [BUILD.md](BUILD.md).
|
||
|
||
## Architecture Overview
|
||
|
||
The application can run in two modes:
|
||
|
||
### Standalone Mode (No Server Required):
|
||
1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
|
||
|
||
### Multi-user Sync Mode (Optional):
|
||
1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
|
||
2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
|
||
3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
|
||
|
||
## Use Cases
|
||
|
||
- **Multi-language Streams**: Multiple translators transcribing in different languages
|
||
- **Accessibility**: Provide real-time captions for viewers
|
||
- **Collaborative Podcasts**: Multiple hosts with separate transcriptions
|
||
- **Gaming Commentary**: Track who said what in multiplayer sessions
|
||
|
||
---
|
||
|
||
## Implementation Plan
|
||
|
||
### Phase 1: Standalone Desktop Application
|
||
|
||
**Objective**: Build a fully functional standalone transcription app with GUI that works without any server
|
||
|
||
#### Components:
|
||
1. **Audio Capture Module**
|
||
- Capture system audio or microphone input
|
||
- Support multiple audio sources (virtual audio cables, physical devices)
|
||
- Real-time audio buffering with configurable chunk sizes
|
||
- **Noise Suppression**: Preprocess audio to reduce background noise
|
||
- Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
|
||
|
||
2. **Noise Suppression Engine**
|
||
- Real-time noise reduction using RNNoise or noisereduce
|
||
- Adjustable noise reduction strength
|
||
- Optional VAD (Voice Activity Detection) to skip silent segments
|
||
- Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad`
|
||
|
||
3. **Transcription Engine**
|
||
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
|
||
- Support multiple model sizes (tiny, base, small, medium, large)
|
||
- CPU and GPU inference options
|
||
- Model management and automatic downloading
|
||
- Libraries: `openai-whisper`, `faster-whisper`, `torch`
|
||
|
||
4. **Device Selection**
|
||
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
|
||
- Allow user to specify preferred device via GUI
|
||
- Graceful fallback if GPU unavailable
|
||
- Display device status and performance metrics
|
||
|
||
5. **Desktop GUI Application**
|
||
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
|
||
- Main transcription display window (scrolling text area)
|
||
- Settings panel for configuration
|
||
- User name input field
|
||
- Audio input device selector
|
||
- Model size selector
|
||
- CPU/GPU toggle
|
||
- Start/Stop transcription button
|
||
- Optional: System tray integration
|
||
- Libraries: `PyQt6`, `customtkinter`, or `tkinter`
|
||
|
||
6. **Local Display**
|
||
- Real-time transcription display in GUI window
|
||
- Scrolling text with timestamps
|
||
- User name/label shown with transcriptions
|
||
- Copy transcription to clipboard
|
||
- Optional: Save transcription to file (TXT, SRT, VTT)
|
||
|
||
#### Tasks:
|
||
- [ ] Set up project structure and dependencies
|
||
- [ ] Implement audio capture with device selection
|
||
- [ ] Add noise suppression and VAD preprocessing
|
||
- [ ] Integrate Whisper model loading and inference
|
||
- [ ] Add CPU/GPU device detection and selection logic
|
||
- [ ] Create real-time audio buffer processing pipeline
|
||
- [ ] Design and implement GUI layout (main window)
|
||
- [ ] Add settings panel with user name configuration
|
||
- [ ] Implement local transcription display area
|
||
- [ ] Add start/stop controls and status indicators
|
||
- [ ] Test transcription accuracy and latency
|
||
- [ ] Test noise suppression effectiveness
|
||
|
||
---
|
||
|
||
### Phase 2: Web Server and Sync System
|
||
|
||
**Objective**: Create a centralized server to aggregate and serve transcriptions
|
||
|
||
#### Components:
|
||
1. **Web Server**
|
||
- FastAPI or Flask-based REST API
|
||
- WebSocket support for real-time updates
|
||
- User/client registration and management
|
||
- Libraries: `fastapi`, `uvicorn`, `websockets`
|
||
|
||
2. **Transcription Aggregator**
|
||
- Receive transcription chunks from multiple clients
|
||
- Associate transcriptions with user IDs/names
|
||
- Timestamp management and synchronization
|
||
- Buffer management for smooth streaming
|
||
|
||
3. **Database/Storage** (Optional)
|
||
- Store transcription history (SQLite for simplicity)
|
||
- Session management
|
||
- Export functionality (SRT, VTT, TXT formats)
|
||
|
||
#### API Endpoints:
|
||
- `POST /api/register` - Register a new client
|
||
- `POST /api/transcription` - Submit transcription chunk
|
||
- `WS /api/stream` - WebSocket for real-time transcription stream
|
||
- `GET /stream` - Web page for OBS browser source
|
||
|
||
#### Tasks:
|
||
- [ ] Set up FastAPI server with CORS support
|
||
- [ ] Implement WebSocket handler for real-time streaming
|
||
- [ ] Create client registration system
|
||
- [ ] Build transcription aggregation logic
|
||
- [ ] Add timestamp synchronization
|
||
- [ ] Create data models for clients and transcriptions
|
||
|
||
---
|
||
|
||
### Phase 3: Client-Server Communication (Optional Multi-user Mode)
|
||
|
||
**Objective**: Add optional server connectivity to enable multi-user transcription sync
|
||
|
||
#### Components:
|
||
1. **HTTP/WebSocket Client**
|
||
- Register client with server on startup
|
||
- Send transcription chunks as they're generated
|
||
- Handle connection drops and reconnection
|
||
- Libraries: `requests`, `websockets`
|
||
|
||
2. **Configuration System**
|
||
- Config file for server URL, API keys, user settings
|
||
- Model preferences (size, language)
|
||
- Audio input settings
|
||
- Format: YAML or JSON
|
||
|
||
3. **Status Monitoring**
|
||
- Connection status indicator
|
||
- Transcription queue health
|
||
- Error handling and logging
|
||
|
||
#### Tasks:
|
||
- [ ] Add "Enable Server Sync" toggle to GUI
|
||
- [ ] Add server URL configuration field in settings
|
||
- [ ] Implement WebSocket client for sending transcriptions
|
||
- [ ] Add configuration file support (YAML/JSON)
|
||
- [ ] Create connection management with auto-reconnect
|
||
- [ ] Add local logging and error handling
|
||
- [ ] Add server connection status indicator to GUI
|
||
- [ ] Allow app to function normally if server is unavailable
|
||
|
||
---
|
||
|
||
### Phase 4: Web Stream Interface (OBS Integration)
|
||
|
||
**Objective**: Create a web page that displays synchronized transcriptions for OBS
|
||
|
||
#### Components:
|
||
1. **Web Frontend**
|
||
- HTML/CSS/JavaScript page for displaying transcriptions
|
||
- Responsive design with customizable styling
|
||
- Auto-scroll with configurable retention window
|
||
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
|
||
|
||
2. **Styling Options**
|
||
- Customizable fonts, colors, sizes
|
||
- Background transparency for OBS chroma key
|
||
- User name/ID display options
|
||
- Timestamp display (optional)
|
||
|
||
3. **Display Modes**
|
||
- Scrolling captions (like live TV captions)
|
||
- Multi-user panel view (separate sections per user)
|
||
- Overlay mode (minimal UI for transparency)
|
||
|
||
#### Tasks:
|
||
- [ ] Create HTML template for transcription display
|
||
- [ ] Implement WebSocket client in JavaScript
|
||
- [ ] Add CSS styling with OBS-friendly transparency
|
||
- [ ] Create customization controls (URL parameters or UI)
|
||
- [ ] Test with OBS browser source
|
||
- [ ] Add configurable retention/scroll behavior
|
||
|
||
---
|
||
|
||
### Phase 5: Advanced Features
|
||
|
||
**Objective**: Enhance functionality and user experience
|
||
|
||
#### Features:
|
||
1. **Language Detection**
|
||
- Auto-detect spoken language
|
||
- Multi-language support in single stream
|
||
- Language selector in GUI
|
||
|
||
2. **Speaker Diarization** (Optional)
|
||
- Identify different speakers
|
||
- Label transcriptions by speaker
|
||
- Useful for multi-host streams
|
||
|
||
3. **Profanity Filtering**
|
||
- Optional word filtering/replacement
|
||
- Customizable filter lists
|
||
- Toggle in GUI settings
|
||
|
||
4. **Advanced Noise Profiles**
|
||
- Save and load custom noise profiles
|
||
- Adaptive noise suppression
|
||
- Different profiles for different environments
|
||
|
||
5. **Export Functionality**
|
||
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
|
||
- Export button in GUI
|
||
- Automatic session saving
|
||
|
||
6. **Hotkey Support**
|
||
- Global hotkeys to start/stop transcription
|
||
- Mute/unmute hotkey
|
||
- Quick save hotkey
|
||
|
||
7. **Docker Support**
|
||
- Containerized server deployment
|
||
- Docker Compose for easy multi-component setup
|
||
- Pre-built images for easy deployment
|
||
|
||
8. **Themes and Customization**
|
||
- Dark/light theme toggle
|
||
- Customizable font sizes and colors for display
|
||
- OBS-friendly transparent overlay mode
|
||
|
||
#### Tasks:
|
||
- [ ] Add language detection and multi-language support
|
||
- [ ] Implement speaker diarization
|
||
- [ ] Create optional profanity filter
|
||
- [ ] Add export functionality (SRT, VTT, plain text, JSON)
|
||
- [ ] Implement global hotkey support
|
||
- [ ] Create Docker containers for server component
|
||
- [ ] Add theme customization options
|
||
- [ ] Create advanced noise profile management
|
||
|
||
---
|
||
|
||
## Technology Stack
|
||
|
||
### Local Client:
|
||
- **Python 3.9+**
|
||
- **GUI**: PyQt6 / CustomTkinter / tkinter
|
||
- **Audio**: PyAudio / sounddevice
|
||
- **Noise Suppression**: noisereduce / rnnoise-python
|
||
- **VAD**: webrtcvad
|
||
- **ML Framework**: PyTorch (for Whisper)
|
||
- **Transcription**: openai-whisper / faster-whisper
|
||
- **Networking**: websockets, requests (optional for server sync)
|
||
- **Config**: PyYAML / json
|
||
|
||
### Server:
|
||
- **Backend**: FastAPI / Flask
|
||
- **WebSocket**: python-websockets / FastAPI WebSockets
|
||
- **Server**: Uvicorn / Gunicorn
|
||
- **Database** (optional): SQLite / PostgreSQL
|
||
- **CORS**: fastapi-cors
|
||
|
||
### Web Interface:
|
||
- **Frontend**: HTML5, CSS3, JavaScript (ES6+)
|
||
- **Real-time**: WebSocket API
|
||
- **Styling**: CSS Grid/Flexbox for layout
|
||
|
||
---
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
local-transcription/
|
||
|