Files
local-transcription/README.md
Josh Knapp 472233aec4 Initial commit: Local Transcription App v1.0
Phase 1 Complete - Standalone Desktop Application

Features:
- Real-time speech-to-text with Whisper (faster-whisper)
- PySide6 desktop GUI with settings dialog
- Web server for OBS browser source integration
- Audio capture with automatic sample rate detection and resampling
- Noise suppression with Voice Activity Detection (VAD)
- Configurable display settings (font, timestamps, fade duration)
- Settings apply without restart (with automatic model reloading)
- Auto-fade for web display transcriptions
- CPU/GPU support with automatic device detection
- Standalone executable builds (PyInstaller)
- CUDA build support (works on systems without CUDA hardware)

Components:
- Audio capture with sounddevice
- Noise reduction with noisereduce + webrtcvad
- Transcription with faster-whisper
- GUI with PySide6
- Web server with FastAPI + WebSocket
- Configuration system with YAML

Build System:
- Standard builds (CPU-only): build.sh / build.bat
- CUDA builds (universal): build-cuda.sh / build-cuda.bat
- Comprehensive BUILD.md documentation
- Cross-platform support (Linux, Windows)

Documentation:
- README.md with project overview and quick start
- BUILD.md with detailed build instructions
- NEXT_STEPS.md with future enhancement roadmap
- INSTALL.md with setup instructions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-25 18:48:23 -08:00

495 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Local Transcription for Streamers
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
## Features
- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware
- **Real-time Processing**: Live audio transcription with minimal latency
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
- **User Configuration**: Set your display name and preferences through the GUI
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users
- **OBS Integration**: Web-based output designed for easy browser source capture
- **Privacy-First**: All processing happens locally; only transcription text is shared
- **Customizable**: Configure model size, language, and streaming settings
## Quick Start
### Running from Source
```bash
# Install dependencies
uv sync
# Run the application
uv run python main.py
```
### Building Standalone Executables
To create standalone executables for distribution:
**Linux:**
```bash
./build.sh
```
**Windows:**
```cmd
build.bat
```
For detailed build instructions, see [BUILD.md](BUILD.md).
## Architecture Overview
The application can run in two modes:
### Standalone Mode (No Server Required):
1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
### Multi-user Sync Mode (Optional):
1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
## Use Cases
- **Multi-language Streams**: Multiple translators transcribing in different languages
- **Accessibility**: Provide real-time captions for viewers
- **Collaborative Podcasts**: Multiple hosts with separate transcriptions
- **Gaming Commentary**: Track who said what in multiplayer sessions
---
## Implementation Plan
### Phase 1: Standalone Desktop Application
**Objective**: Build a fully functional standalone transcription app with GUI that works without any server
#### Components:
1. **Audio Capture Module**
- Capture system audio or microphone input
- Support multiple audio sources (virtual audio cables, physical devices)
- Real-time audio buffering with configurable chunk sizes
- **Noise Suppression**: Preprocess audio to reduce background noise
- Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
2. **Noise Suppression Engine**
- Real-time noise reduction using RNNoise or noisereduce
- Adjustable noise reduction strength
- Optional VAD (Voice Activity Detection) to skip silent segments
- Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad`
3. **Transcription Engine**
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
- Support multiple model sizes (tiny, base, small, medium, large)
- CPU and GPU inference options
- Model management and automatic downloading
- Libraries: `openai-whisper`, `faster-whisper`, `torch`
4. **Device Selection**
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
- Allow user to specify preferred device via GUI
- Graceful fallback if GPU unavailable
- Display device status and performance metrics
5. **Desktop GUI Application**
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
- Main transcription display window (scrolling text area)
- Settings panel for configuration
- User name input field
- Audio input device selector
- Model size selector
- CPU/GPU toggle
- Start/Stop transcription button
- Optional: System tray integration
- Libraries: `PyQt6`, `customtkinter`, or `tkinter`
6. **Local Display**
- Real-time transcription display in GUI window
- Scrolling text with timestamps
- User name/label shown with transcriptions
- Copy transcription to clipboard
- Optional: Save transcription to file (TXT, SRT, VTT)
#### Tasks:
- [ ] Set up project structure and dependencies
- [ ] Implement audio capture with device selection
- [ ] Add noise suppression and VAD preprocessing
- [ ] Integrate Whisper model loading and inference
- [ ] Add CPU/GPU device detection and selection logic
- [ ] Create real-time audio buffer processing pipeline
- [ ] Design and implement GUI layout (main window)
- [ ] Add settings panel with user name configuration
- [ ] Implement local transcription display area
- [ ] Add start/stop controls and status indicators
- [ ] Test transcription accuracy and latency
- [ ] Test noise suppression effectiveness
---
### Phase 2: Web Server and Sync System
**Objective**: Create a centralized server to aggregate and serve transcriptions
#### Components:
1. **Web Server**
- FastAPI or Flask-based REST API
- WebSocket support for real-time updates
- User/client registration and management
- Libraries: `fastapi`, `uvicorn`, `websockets`
2. **Transcription Aggregator**
- Receive transcription chunks from multiple clients
- Associate transcriptions with user IDs/names
- Timestamp management and synchronization
- Buffer management for smooth streaming
3. **Database/Storage** (Optional)
- Store transcription history (SQLite for simplicity)
- Session management
- Export functionality (SRT, VTT, TXT formats)
#### API Endpoints:
- `POST /api/register` - Register a new client
- `POST /api/transcription` - Submit transcription chunk
- `WS /api/stream` - WebSocket for real-time transcription stream
- `GET /stream` - Web page for OBS browser source
#### Tasks:
- [ ] Set up FastAPI server with CORS support
- [ ] Implement WebSocket handler for real-time streaming
- [ ] Create client registration system
- [ ] Build transcription aggregation logic
- [ ] Add timestamp synchronization
- [ ] Create data models for clients and transcriptions
---
### Phase 3: Client-Server Communication (Optional Multi-user Mode)
**Objective**: Add optional server connectivity to enable multi-user transcription sync
#### Components:
1. **HTTP/WebSocket Client**
- Register client with server on startup
- Send transcription chunks as they're generated
- Handle connection drops and reconnection
- Libraries: `requests`, `websockets`
2. **Configuration System**
- Config file for server URL, API keys, user settings
- Model preferences (size, language)
- Audio input settings
- Format: YAML or JSON
3. **Status Monitoring**
- Connection status indicator
- Transcription queue health
- Error handling and logging
#### Tasks:
- [ ] Add "Enable Server Sync" toggle to GUI
- [ ] Add server URL configuration field in settings
- [ ] Implement WebSocket client for sending transcriptions
- [ ] Add configuration file support (YAML/JSON)
- [ ] Create connection management with auto-reconnect
- [ ] Add local logging and error handling
- [ ] Add server connection status indicator to GUI
- [ ] Allow app to function normally if server is unavailable
---
### Phase 4: Web Stream Interface (OBS Integration)
**Objective**: Create a web page that displays synchronized transcriptions for OBS
#### Components:
1. **Web Frontend**
- HTML/CSS/JavaScript page for displaying transcriptions
- Responsive design with customizable styling
- Auto-scroll with configurable retention window
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
2. **Styling Options**
- Customizable fonts, colors, sizes
- Background transparency for OBS chroma key
- User name/ID display options
- Timestamp display (optional)
3. **Display Modes**
- Scrolling captions (like live TV captions)
- Multi-user panel view (separate sections per user)
- Overlay mode (minimal UI for transparency)
#### Tasks:
- [ ] Create HTML template for transcription display
- [ ] Implement WebSocket client in JavaScript
- [ ] Add CSS styling with OBS-friendly transparency
- [ ] Create customization controls (URL parameters or UI)
- [ ] Test with OBS browser source
- [ ] Add configurable retention/scroll behavior
---
### Phase 5: Advanced Features
**Objective**: Enhance functionality and user experience
#### Features:
1. **Language Detection**
- Auto-detect spoken language
- Multi-language support in single stream
- Language selector in GUI
2. **Speaker Diarization** (Optional)
- Identify different speakers
- Label transcriptions by speaker
- Useful for multi-host streams
3. **Profanity Filtering**
- Optional word filtering/replacement
- Customizable filter lists
- Toggle in GUI settings
4. **Advanced Noise Profiles**
- Save and load custom noise profiles
- Adaptive noise suppression
- Different profiles for different environments
5. **Export Functionality**
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
- Export button in GUI
- Automatic session saving
6. **Hotkey Support**
- Global hotkeys to start/stop transcription
- Mute/unmute hotkey
- Quick save hotkey
7. **Docker Support**
- Containerized server deployment
- Docker Compose for easy multi-component setup
- Pre-built images for easy deployment
8. **Themes and Customization**
- Dark/light theme toggle
- Customizable font sizes and colors for display
- OBS-friendly transparent overlay mode
#### Tasks:
- [ ] Add language detection and multi-language support
- [ ] Implement speaker diarization
- [ ] Create optional profanity filter
- [ ] Add export functionality (SRT, VTT, plain text, JSON)
- [ ] Implement global hotkey support
- [ ] Create Docker containers for server component
- [ ] Add theme customization options
- [ ] Create advanced noise profile management
---
## Technology Stack
### Local Client:
- **Python 3.9+**
- **GUI**: PyQt6 / CustomTkinter / tkinter
- **Audio**: PyAudio / sounddevice
- **Noise Suppression**: noisereduce / rnnoise-python
- **VAD**: webrtcvad
- **ML Framework**: PyTorch (for Whisper)
- **Transcription**: openai-whisper / faster-whisper
- **Networking**: websockets, requests (optional for server sync)
- **Config**: PyYAML / json
### Server:
- **Backend**: FastAPI / Flask
- **WebSocket**: python-websockets / FastAPI WebSockets
- **Server**: Uvicorn / Gunicorn
- **Database** (optional): SQLite / PostgreSQL
- **CORS**: fastapi-cors
### Web Interface:
- **Frontend**: HTML5, CSS3, JavaScript (ES6+)
- **Real-time**: WebSocket API
- **Styling**: CSS Grid/Flexbox for layout
---
## Project Structure
```
local-transcription/
 client/ # Local transcription client
  __init__.py
  audio_capture.py # Audio input handling
  transcription_engine.py # Whisper integration
  network_client.py # Server communication
  config.py # Configuration management
  main.py # Client entry point
 server/ # Centralized web server
  __init__.py
  api.py # FastAPI routes
  websocket_handler.py # WebSocket management
  models.py # Data models
  database.py # Optional DB layer
  main.py # Server entry point
 web/ # Web stream interface
  index.html # OBS browser source page
  styles.css # Customizable styling
  app.js # WebSocket client & UI logic
 config/
  client_config.example.yaml
  server_config.example.yaml
 tests/
  test_audio.py
  test_transcription.py
  test_server.py
 requirements.txt # Python dependencies
 README.md
 main.py # Combined launcher (optional)
```
---
## Installation (Planned)
### Prerequisites:
- Python 3.9 or higher
- CUDA-capable GPU (optional, for GPU acceleration)
- FFmpeg (required by Whisper)
### Steps:
1. **Clone the repository**
```bash
git clone <repository-url>
cd local-transcription
```
2. **Install dependencies**
```bash
pip install -r requirements.txt
```
3. **Download Whisper models**
```bash
# Models will be auto-downloaded on first run
# Or manually download:
python -c "import whisper; whisper.load_model('base')"
```
4. **Configure client**
```bash
cp config/client_config.example.yaml config/client_config.yaml
# Edit config/client_config.yaml with your settings
```
5. **Run the server** (one instance)
```bash
python server/main.py
```
6. **Run the client** (on each user's machine)
```bash
python client/main.py
```
7. **Add to OBS**
- Add a Browser Source
- URL: `http://<server-ip>:8000/stream`
- Set width/height as needed
- Check "Shutdown source when not visible" for performance
---
## Configuration (Planned)
### Client Configuration:
```yaml
user:
name: "Streamer1" # Display name for transcriptions
id: "unique-user-id" # Optional unique identifier
audio:
input_device: "default" # or specific device index
sample_rate: 16000
chunk_duration: 2.0 # seconds
noise_suppression:
enabled: true # Enable/disable noise reduction
strength: 0.7 # 0.0 to 1.0 - reduction strength
method: "noisereduce" # "noisereduce" or "rnnoise"
transcription:
model: "base" # tiny, base, small, medium, large
device: "cuda" # cpu, cuda, mps
language: "en" # or "auto" for detection
task: "transcribe" # or "translate"
processing:
use_vad: true # Voice Activity Detection
min_confidence: 0.5 # Minimum transcription confidence
server_sync:
enabled: false # Enable multi-user server sync
url: "ws://localhost:8000" # Server URL (when enabled)
api_key: "" # Optional API key
display:
show_timestamps: true # Show timestamps in local display
max_lines: 100 # Maximum lines to keep in display
font_size: 12 # GUI font size
```
### Server Configuration:
```yaml
server:
host: "0.0.0.0"
port: 8000
api_key_required: false
stream:
max_clients: 10
buffer_size: 100 # messages to buffer
retention_time: 300 # seconds
database:
enabled: false
path: "transcriptions.db"
```
---
## Roadmap
- [x] Project planning and architecture design
- [ ] Phase 1: Standalone desktop application with GUI
- [ ] Phase 2: Web server and sync system (optional multi-user mode)
- [ ] Phase 3: Client-server communication (optional)
- [ ] Phase 4: Web stream interface for OBS (optional)
- [ ] Phase 5: Advanced features (hotkeys, themes, Docker, etc.)
---
## Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
---
## License
[Choose appropriate license - MIT, Apache 2.0, etc.]
---
## Acknowledgments
- OpenAI Whisper for the excellent speech recognition model
- The streaming community for inspiration and use cases