495 lines
15 KiB
Markdown
495 lines
15 KiB
Markdown
|
|
# Local Transcription for Streamers
|
|||
|
|
|
|||
|
|
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
|
|||
|
|
|
|||
|
|
## Features
|
|||
|
|
|
|||
|
|
- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required
|
|||
|
|
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine
|
|||
|
|
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware
|
|||
|
|
- **Real-time Processing**: Live audio transcription with minimal latency
|
|||
|
|
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
|
|||
|
|
- **User Configuration**: Set your display name and preferences through the GUI
|
|||
|
|
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users
|
|||
|
|
- **OBS Integration**: Web-based output designed for easy browser source capture
|
|||
|
|
- **Privacy-First**: All processing happens locally; only transcription text is shared
|
|||
|
|
- **Customizable**: Configure model size, language, and streaming settings
|
|||
|
|
|
|||
|
|
## Quick Start
|
|||
|
|
|
|||
|
|
### Running from Source
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Install dependencies
|
|||
|
|
uv sync
|
|||
|
|
|
|||
|
|
# Run the application
|
|||
|
|
uv run python main.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Building Standalone Executables
|
|||
|
|
|
|||
|
|
To create standalone executables for distribution:
|
|||
|
|
|
|||
|
|
**Linux:**
|
|||
|
|
```bash
|
|||
|
|
./build.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Windows:**
|
|||
|
|
```cmd
|
|||
|
|
build.bat
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For detailed build instructions, see [BUILD.md](BUILD.md).
|
|||
|
|
|
|||
|
|
## Architecture Overview
|
|||
|
|
|
|||
|
|
The application can run in two modes:
|
|||
|
|
|
|||
|
|
### Standalone Mode (No Server Required):
|
|||
|
|
1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
|
|||
|
|
|
|||
|
|
### Multi-user Sync Mode (Optional):
|
|||
|
|
1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
|
|||
|
|
2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
|
|||
|
|
3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
|
|||
|
|
|
|||
|
|
## Use Cases
|
|||
|
|
|
|||
|
|
- **Multi-language Streams**: Multiple translators transcribing in different languages
|
|||
|
|
- **Accessibility**: Provide real-time captions for viewers
|
|||
|
|
- **Collaborative Podcasts**: Multiple hosts with separate transcriptions
|
|||
|
|
- **Gaming Commentary**: Track who said what in multiplayer sessions
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Plan
|
|||
|
|
|
|||
|
|
### Phase 1: Standalone Desktop Application
|
|||
|
|
|
|||
|
|
**Objective**: Build a fully functional standalone transcription app with GUI that works without any server
|
|||
|
|
|
|||
|
|
#### Components:
|
|||
|
|
1. **Audio Capture Module**
|
|||
|
|
- Capture system audio or microphone input
|
|||
|
|
- Support multiple audio sources (virtual audio cables, physical devices)
|
|||
|
|
- Real-time audio buffering with configurable chunk sizes
|
|||
|
|
- **Noise Suppression**: Preprocess audio to reduce background noise
|
|||
|
|
- Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
|
|||
|
|
|
|||
|
|
2. **Noise Suppression Engine**
|
|||
|
|
- Real-time noise reduction using RNNoise or noisereduce
|
|||
|
|
- Adjustable noise reduction strength
|
|||
|
|
- Optional VAD (Voice Activity Detection) to skip silent segments
|
|||
|
|
- Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad`
|
|||
|
|
|
|||
|
|
3. **Transcription Engine**
|
|||
|
|
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
|
|||
|
|
- Support multiple model sizes (tiny, base, small, medium, large)
|
|||
|
|
- CPU and GPU inference options
|
|||
|
|
- Model management and automatic downloading
|
|||
|
|
- Libraries: `openai-whisper`, `faster-whisper`, `torch`
|
|||
|
|
|
|||
|
|
4. **Device Selection**
|
|||
|
|
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
|
|||
|
|
- Allow user to specify preferred device via GUI
|
|||
|
|
- Graceful fallback if GPU unavailable
|
|||
|
|
- Display device status and performance metrics
|
|||
|
|
|
|||
|
|
5. **Desktop GUI Application**
|
|||
|
|
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
|
|||
|
|
- Main transcription display window (scrolling text area)
|
|||
|
|
- Settings panel for configuration
|
|||
|
|
- User name input field
|
|||
|
|
- Audio input device selector
|
|||
|
|
- Model size selector
|
|||
|
|
- CPU/GPU toggle
|
|||
|
|
- Start/Stop transcription button
|
|||
|
|
- Optional: System tray integration
|
|||
|
|
- Libraries: `PyQt6`, `customtkinter`, or `tkinter`
|
|||
|
|
|
|||
|
|
6. **Local Display**
|
|||
|
|
- Real-time transcription display in GUI window
|
|||
|
|
- Scrolling text with timestamps
|
|||
|
|
- User name/label shown with transcriptions
|
|||
|
|
- Copy transcription to clipboard
|
|||
|
|
- Optional: Save transcription to file (TXT, SRT, VTT)
|
|||
|
|
|
|||
|
|
#### Tasks:
|
|||
|
|
- [ ] Set up project structure and dependencies
|
|||
|
|
- [ ] Implement audio capture with device selection
|
|||
|
|
- [ ] Add noise suppression and VAD preprocessing
|
|||
|
|
- [ ] Integrate Whisper model loading and inference
|
|||
|
|
- [ ] Add CPU/GPU device detection and selection logic
|
|||
|
|
- [ ] Create real-time audio buffer processing pipeline
|
|||
|
|
- [ ] Design and implement GUI layout (main window)
|
|||
|
|
- [ ] Add settings panel with user name configuration
|
|||
|
|
- [ ] Implement local transcription display area
|
|||
|
|
- [ ] Add start/stop controls and status indicators
|
|||
|
|
- [ ] Test transcription accuracy and latency
|
|||
|
|
- [ ] Test noise suppression effectiveness
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 2: Web Server and Sync System
|
|||
|
|
|
|||
|
|
**Objective**: Create a centralized server to aggregate and serve transcriptions
|
|||
|
|
|
|||
|
|
#### Components:
|
|||
|
|
1. **Web Server**
|
|||
|
|
- FastAPI or Flask-based REST API
|
|||
|
|
- WebSocket support for real-time updates
|
|||
|
|
- User/client registration and management
|
|||
|
|
- Libraries: `fastapi`, `uvicorn`, `websockets`
|
|||
|
|
|
|||
|
|
2. **Transcription Aggregator**
|
|||
|
|
- Receive transcription chunks from multiple clients
|
|||
|
|
- Associate transcriptions with user IDs/names
|
|||
|
|
- Timestamp management and synchronization
|
|||
|
|
- Buffer management for smooth streaming
|
|||
|
|
|
|||
|
|
3. **Database/Storage** (Optional)
|
|||
|
|
- Store transcription history (SQLite for simplicity)
|
|||
|
|
- Session management
|
|||
|
|
- Export functionality (SRT, VTT, TXT formats)
|
|||
|
|
|
|||
|
|
#### API Endpoints:
|
|||
|
|
- `POST /api/register` - Register a new client
|
|||
|
|
- `POST /api/transcription` - Submit transcription chunk
|
|||
|
|
- `WS /api/stream` - WebSocket for real-time transcription stream
|
|||
|
|
- `GET /stream` - Web page for OBS browser source
|
|||
|
|
|
|||
|
|
#### Tasks:
|
|||
|
|
- [ ] Set up FastAPI server with CORS support
|
|||
|
|
- [ ] Implement WebSocket handler for real-time streaming
|
|||
|
|
- [ ] Create client registration system
|
|||
|
|
- [ ] Build transcription aggregation logic
|
|||
|
|
- [ ] Add timestamp synchronization
|
|||
|
|
- [ ] Create data models for clients and transcriptions
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 3: Client-Server Communication (Optional Multi-user Mode)
|
|||
|
|
|
|||
|
|
**Objective**: Add optional server connectivity to enable multi-user transcription sync
|
|||
|
|
|
|||
|
|
#### Components:
|
|||
|
|
1. **HTTP/WebSocket Client**
|
|||
|
|
- Register client with server on startup
|
|||
|
|
- Send transcription chunks as they're generated
|
|||
|
|
- Handle connection drops and reconnection
|
|||
|
|
- Libraries: `requests`, `websockets`
|
|||
|
|
|
|||
|
|
2. **Configuration System**
|
|||
|
|
- Config file for server URL, API keys, user settings
|
|||
|
|
- Model preferences (size, language)
|
|||
|
|
- Audio input settings
|
|||
|
|
- Format: YAML or JSON
|
|||
|
|
|
|||
|
|
3. **Status Monitoring**
|
|||
|
|
- Connection status indicator
|
|||
|
|
- Transcription queue health
|
|||
|
|
- Error handling and logging
|
|||
|
|
|
|||
|
|
#### Tasks:
|
|||
|
|
- [ ] Add "Enable Server Sync" toggle to GUI
|
|||
|
|
- [ ] Add server URL configuration field in settings
|
|||
|
|
- [ ] Implement WebSocket client for sending transcriptions
|
|||
|
|
- [ ] Add configuration file support (YAML/JSON)
|
|||
|
|
- [ ] Create connection management with auto-reconnect
|
|||
|
|
- [ ] Add local logging and error handling
|
|||
|
|
- [ ] Add server connection status indicator to GUI
|
|||
|
|
- [ ] Allow app to function normally if server is unavailable
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 4: Web Stream Interface (OBS Integration)
|
|||
|
|
|
|||
|
|
**Objective**: Create a web page that displays synchronized transcriptions for OBS
|
|||
|
|
|
|||
|
|
#### Components:
|
|||
|
|
1. **Web Frontend**
|
|||
|
|
- HTML/CSS/JavaScript page for displaying transcriptions
|
|||
|
|
- Responsive design with customizable styling
|
|||
|
|
- Auto-scroll with configurable retention window
|
|||
|
|
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
|
|||
|
|
|
|||
|
|
2. **Styling Options**
|
|||
|
|
- Customizable fonts, colors, sizes
|
|||
|
|
- Background transparency for OBS chroma key
|
|||
|
|
- User name/ID display options
|
|||
|
|
- Timestamp display (optional)
|
|||
|
|
|
|||
|
|
3. **Display Modes**
|
|||
|
|
- Scrolling captions (like live TV captions)
|
|||
|
|
- Multi-user panel view (separate sections per user)
|
|||
|
|
- Overlay mode (minimal UI for transparency)
|
|||
|
|
|
|||
|
|
#### Tasks:
|
|||
|
|
- [ ] Create HTML template for transcription display
|
|||
|
|
- [ ] Implement WebSocket client in JavaScript
|
|||
|
|
- [ ] Add CSS styling with OBS-friendly transparency
|
|||
|
|
- [ ] Create customization controls (URL parameters or UI)
|
|||
|
|
- [ ] Test with OBS browser source
|
|||
|
|
- [ ] Add configurable retention/scroll behavior
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 5: Advanced Features
|
|||
|
|
|
|||
|
|
**Objective**: Enhance functionality and user experience
|
|||
|
|
|
|||
|
|
#### Features:
|
|||
|
|
1. **Language Detection**
|
|||
|
|
- Auto-detect spoken language
|
|||
|
|
- Multi-language support in single stream
|
|||
|
|
- Language selector in GUI
|
|||
|
|
|
|||
|
|
2. **Speaker Diarization** (Optional)
|
|||
|
|
- Identify different speakers
|
|||
|
|
- Label transcriptions by speaker
|
|||
|
|
- Useful for multi-host streams
|
|||
|
|
|
|||
|
|
3. **Profanity Filtering**
|
|||
|
|
- Optional word filtering/replacement
|
|||
|
|
- Customizable filter lists
|
|||
|
|
- Toggle in GUI settings
|
|||
|
|
|
|||
|
|
4. **Advanced Noise Profiles**
|
|||
|
|
- Save and load custom noise profiles
|
|||
|
|
- Adaptive noise suppression
|
|||
|
|
- Different profiles for different environments
|
|||
|
|
|
|||
|
|
5. **Export Functionality**
|
|||
|
|
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
|
|||
|
|
- Export button in GUI
|
|||
|
|
- Automatic session saving
|
|||
|
|
|
|||
|
|
6. **Hotkey Support**
|
|||
|
|
- Global hotkeys to start/stop transcription
|
|||
|
|
- Mute/unmute hotkey
|
|||
|
|
- Quick save hotkey
|
|||
|
|
|
|||
|
|
7. **Docker Support**
|
|||
|
|
- Containerized server deployment
|
|||
|
|
- Docker Compose for easy multi-component setup
|
|||
|
|
- Pre-built images for easy deployment
|
|||
|
|
|
|||
|
|
8. **Themes and Customization**
|
|||
|
|
- Dark/light theme toggle
|
|||
|
|
- Customizable font sizes and colors for display
|
|||
|
|
- OBS-friendly transparent overlay mode
|
|||
|
|
|
|||
|
|
#### Tasks:
|
|||
|
|
- [ ] Add language detection and multi-language support
|
|||
|
|
- [ ] Implement speaker diarization
|
|||
|
|
- [ ] Create optional profanity filter
|
|||
|
|
- [ ] Add export functionality (SRT, VTT, plain text, JSON)
|
|||
|
|
- [ ] Implement global hotkey support
|
|||
|
|
- [ ] Create Docker containers for server component
|
|||
|
|
- [ ] Add theme customization options
|
|||
|
|
- [ ] Create advanced noise profile management
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Technology Stack
|
|||
|
|
|
|||
|
|
### Local Client:
|
|||
|
|
- **Python 3.9+**
|
|||
|
|
- **GUI**: PyQt6 / CustomTkinter / tkinter
|
|||
|
|
- **Audio**: PyAudio / sounddevice
|
|||
|
|
- **Noise Suppression**: noisereduce / rnnoise-python
|
|||
|
|
- **VAD**: webrtcvad
|
|||
|
|
- **ML Framework**: PyTorch (for Whisper)
|
|||
|
|
- **Transcription**: openai-whisper / faster-whisper
|
|||
|
|
- **Networking**: websockets, requests (optional for server sync)
|
|||
|
|
- **Config**: PyYAML / json
|
|||
|
|
|
|||
|
|
### Server:
|
|||
|
|
- **Backend**: FastAPI / Flask
|
|||
|
|
- **WebSocket**: python-websockets / FastAPI WebSockets
|
|||
|
|
- **Server**: Uvicorn / Gunicorn
|
|||
|
|
- **Database** (optional): SQLite / PostgreSQL
|
|||
|
|
- **CORS**: fastapi-cors
|
|||
|
|
|
|||
|
|
### Web Interface:
|
|||
|
|
- **Frontend**: HTML5, CSS3, JavaScript (ES6+)
|
|||
|
|
- **Real-time**: WebSocket API
|
|||
|
|
- **Styling**: CSS Grid/Flexbox for layout
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Project Structure
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
local-transcription/
|
|||
|
|
|