Update README to reflect current application state
Remove outdated implementation plan and task checklists. Document actual implemented features including RealtimeSTT, dual-layer VAD, custom fonts/colors, and auto-updates. Add practical usage instructions for standalone mode, OBS setup, and multi-user sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
574
README.md
574
README.md
@@ -1,19 +1,22 @@
|
||||
# Local Transcription for Streamers
|
||||
# Local Transcription
|
||||
|
||||
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
|
||||
A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.
|
||||
|
||||
**Version 1.4.0**
|
||||
|
||||
## Features
|
||||
|
||||
- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required
|
||||
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine
|
||||
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware
|
||||
- **Real-time Processing**: Live audio transcription with minimal latency
|
||||
- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
|
||||
- **Standalone Desktop App**: PySide6/Qt GUI that works without any server
|
||||
- **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
|
||||
- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
|
||||
- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
|
||||
- **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users
|
||||
- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
|
||||
- **Customizable Colors**: User-configurable colors for name, text, and background
|
||||
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
|
||||
- **User Configuration**: Set your display name and preferences through the GUI
|
||||
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users
|
||||
- **OBS Integration**: Web-based output designed for easy browser source capture
|
||||
- **Privacy-First**: All processing happens locally; only transcription text is shared
|
||||
- **Customizable**: Configure model size, language, and streaming settings
|
||||
- **Auto-Updates**: Automatic update checking with release notes display
|
||||
- **Cross-Platform**: Builds available for Windows and Linux
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -27,468 +30,195 @@ uv sync
|
||||
uv run python main.py
|
||||
```
|
||||
|
||||
### Building Standalone Executables
|
||||
### Using Pre-Built Executables
|
||||
|
||||
To create standalone executables for distribution:
|
||||
Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases) and run the executable for your platform.
|
||||
|
||||
### Building from Source
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
./build.sh
|
||||
# Output: dist/LocalTranscription/LocalTranscription
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
```cmd
|
||||
build.bat
|
||||
# Output: dist\LocalTranscription\LocalTranscription.exe
|
||||
```
|
||||
|
||||
For detailed build instructions, see [BUILD.md](BUILD.md).
|
||||
|
||||
## Architecture Overview
|
||||
## Usage
|
||||
|
||||
The application can run in two modes:
|
||||
### Standalone Mode
|
||||
|
||||
### Standalone Mode (No Server Required):
|
||||
1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
|
||||
1. Launch the application
|
||||
2. Select your microphone from the audio device dropdown
|
||||
3. Choose a Whisper model (smaller = faster, larger = more accurate):
|
||||
- `tiny.en` / `tiny` - Fastest, good for quick captions
|
||||
- `base.en` / `base` - Balanced speed and accuracy
|
||||
- `small.en` / `small` - Better accuracy
|
||||
- `medium.en` / `medium` - High accuracy
|
||||
- `large-v3` - Best accuracy (requires more resources)
|
||||
4. Click **Start** to begin transcription
|
||||
5. Transcriptions appear in the main window and at `http://localhost:8080`
|
||||
|
||||
### Multi-user Sync Mode (Optional):
|
||||
1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
|
||||
2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
|
||||
3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
|
||||
### OBS Browser Source Setup
|
||||
|
||||
## Use Cases
|
||||
1. Start the Local Transcription app
|
||||
2. In OBS, add a **Browser** source
|
||||
3. Set URL to `http://localhost:8080`
|
||||
4. Set dimensions (e.g., 1920x300)
|
||||
5. Check "Shutdown source when not visible" for performance
|
||||
|
||||
- **Multi-language Streams**: Multiple translators transcribing in different languages
|
||||
- **Accessibility**: Provide real-time captions for viewers
|
||||
- **Collaborative Podcasts**: Multiple hosts with separate transcriptions
|
||||
- **Gaming Commentary**: Track who said what in multiplayer sessions
|
||||
### Multi-User Mode (Optional)
|
||||
|
||||
---
|
||||
For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):
|
||||
|
||||
## Implementation Plan
|
||||
1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
|
||||
2. In the app settings, enable **Server Sync**
|
||||
3. Enter the server URL (e.g., `http://your-server:3000/api/send`)
|
||||
4. Set a room name and passphrase (shared with other users)
|
||||
5. In OBS, use the server's display URL with your room name:
|
||||
```
|
||||
http://your-server:3000/display?room=YOURROOM×tamps=true&maxlines=50
|
||||
```
|
||||
|
||||
### Phase 1: Standalone Desktop Application
|
||||
## Configuration
|
||||
|
||||
**Objective**: Build a fully functional standalone transcription app with GUI that works without any server
|
||||
Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel.
|
||||
|
||||
#### Components:
|
||||
1. **Audio Capture Module**
|
||||
- Capture system audio or microphone input
|
||||
- Support multiple audio sources (virtual audio cables, physical devices)
|
||||
- Real-time audio buffering with configurable chunk sizes
|
||||
- **Noise Suppression**: Preprocess audio to reduce background noise
|
||||
- Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
|
||||
### Key Settings
|
||||
|
||||
2. **Noise Suppression Engine**
|
||||
- Real-time noise reduction using RNNoise or noisereduce
|
||||
- Adjustable noise reduction strength
|
||||
- Optional VAD (Voice Activity Detection) to skip silent segments
|
||||
- Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad`
|
||||
| Setting | Description | Default |
|
||||
|---------|-------------|---------|
|
||||
| `transcription.model` | Whisper model to use | `base.en` |
|
||||
| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
|
||||
| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
|
||||
| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
|
||||
| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
|
||||
| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
|
||||
| `display.show_timestamps` | Show timestamps with transcriptions | `true` |
|
||||
| `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
|
||||
| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
|
||||
| `web_server.port` | Local web server port | `8080` |
|
||||
|
||||
3. **Transcription Engine**
|
||||
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
|
||||
- Support multiple model sizes (tiny, base, small, medium, large)
|
||||
- CPU and GPU inference options
|
||||
- Model management and automatic downloading
|
||||
- Libraries: `openai-whisper`, `faster-whisper`, `torch`
|
||||
|
||||
4. **Device Selection**
|
||||
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
|
||||
- Allow user to specify preferred device via GUI
|
||||
- Graceful fallback if GPU unavailable
|
||||
- Display device status and performance metrics
|
||||
|
||||
5. **Desktop GUI Application**
|
||||
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
|
||||
- Main transcription display window (scrolling text area)
|
||||
- Settings panel for configuration
|
||||
- User name input field
|
||||
- Audio input device selector
|
||||
- Model size selector
|
||||
- CPU/GPU toggle
|
||||
- Start/Stop transcription button
|
||||
- Optional: System tray integration
|
||||
- Libraries: `PyQt6`, `customtkinter`, or `tkinter`
|
||||
|
||||
6. **Local Display**
|
||||
- Real-time transcription display in GUI window
|
||||
- Scrolling text with timestamps
|
||||
- User name/label shown with transcriptions
|
||||
- Copy transcription to clipboard
|
||||
- Optional: Save transcription to file (TXT, SRT, VTT)
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Set up project structure and dependencies
|
||||
- [ ] Implement audio capture with device selection
|
||||
- [ ] Add noise suppression and VAD preprocessing
|
||||
- [ ] Integrate Whisper model loading and inference
|
||||
- [ ] Add CPU/GPU device detection and selection logic
|
||||
- [ ] Create real-time audio buffer processing pipeline
|
||||
- [ ] Design and implement GUI layout (main window)
|
||||
- [ ] Add settings panel with user name configuration
|
||||
- [ ] Implement local transcription display area
|
||||
- [ ] Add start/stop controls and status indicators
|
||||
- [ ] Test transcription accuracy and latency
|
||||
- [ ] Test noise suppression effectiveness
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Web Server and Sync System
|
||||
|
||||
**Objective**: Create a centralized server to aggregate and serve transcriptions
|
||||
|
||||
#### Components:
|
||||
1. **Web Server**
|
||||
- FastAPI or Flask-based REST API
|
||||
- WebSocket support for real-time updates
|
||||
- User/client registration and management
|
||||
- Libraries: `fastapi`, `uvicorn`, `websockets`
|
||||
|
||||
2. **Transcription Aggregator**
|
||||
- Receive transcription chunks from multiple clients
|
||||
- Associate transcriptions with user IDs/names
|
||||
- Timestamp management and synchronization
|
||||
- Buffer management for smooth streaming
|
||||
|
||||
3. **Database/Storage** (Optional)
|
||||
- Store transcription history (SQLite for simplicity)
|
||||
- Session management
|
||||
- Export functionality (SRT, VTT, TXT formats)
|
||||
|
||||
#### API Endpoints:
|
||||
- `POST /api/register` - Register a new client
|
||||
- `POST /api/transcription` - Submit transcription chunk
|
||||
- `WS /api/stream` - WebSocket for real-time transcription stream
|
||||
- `GET /stream` - Web page for OBS browser source
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Set up FastAPI server with CORS support
|
||||
- [ ] Implement WebSocket handler for real-time streaming
|
||||
- [ ] Create client registration system
|
||||
- [ ] Build transcription aggregation logic
|
||||
- [ ] Add timestamp synchronization
|
||||
- [ ] Create data models for clients and transcriptions
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Client-Server Communication (Optional Multi-user Mode)
|
||||
|
||||
**Objective**: Add optional server connectivity to enable multi-user transcription sync
|
||||
|
||||
#### Components:
|
||||
1. **HTTP/WebSocket Client**
|
||||
- Register client with server on startup
|
||||
- Send transcription chunks as they're generated
|
||||
- Handle connection drops and reconnection
|
||||
- Libraries: `requests`, `websockets`
|
||||
|
||||
2. **Configuration System**
|
||||
- Config file for server URL, API keys, user settings
|
||||
- Model preferences (size, language)
|
||||
- Audio input settings
|
||||
- Format: YAML or JSON
|
||||
|
||||
3. **Status Monitoring**
|
||||
- Connection status indicator
|
||||
- Transcription queue health
|
||||
- Error handling and logging
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Add "Enable Server Sync" toggle to GUI
|
||||
- [ ] Add server URL configuration field in settings
|
||||
- [ ] Implement WebSocket client for sending transcriptions
|
||||
- [ ] Add configuration file support (YAML/JSON)
|
||||
- [ ] Create connection management with auto-reconnect
|
||||
- [ ] Add local logging and error handling
|
||||
- [ ] Add server connection status indicator to GUI
|
||||
- [ ] Allow app to function normally if server is unavailable
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Web Stream Interface (OBS Integration)
|
||||
|
||||
**Objective**: Create a web page that displays synchronized transcriptions for OBS
|
||||
|
||||
#### Components:
|
||||
1. **Web Frontend**
|
||||
- HTML/CSS/JavaScript page for displaying transcriptions
|
||||
- Responsive design with customizable styling
|
||||
- Auto-scroll with configurable retention window
|
||||
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
|
||||
|
||||
2. **Styling Options**
|
||||
- Customizable fonts, colors, sizes
|
||||
- Background transparency for OBS chroma key
|
||||
- User name/ID display options
|
||||
- Timestamp display (optional)
|
||||
|
||||
3. **Display Modes**
|
||||
- Scrolling captions (like live TV captions)
|
||||
- Multi-user panel view (separate sections per user)
|
||||
- Overlay mode (minimal UI for transparency)
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Create HTML template for transcription display
|
||||
- [ ] Implement WebSocket client in JavaScript
|
||||
- [ ] Add CSS styling with OBS-friendly transparency
|
||||
- [ ] Create customization controls (URL parameters or UI)
|
||||
- [ ] Test with OBS browser source
|
||||
- [ ] Add configurable retention/scroll behavior
|
||||
|
||||
---
|
||||
|
||||
### Phase 5: Advanced Features
|
||||
|
||||
**Objective**: Enhance functionality and user experience
|
||||
|
||||
#### Features:
|
||||
1. **Language Detection**
|
||||
- Auto-detect spoken language
|
||||
- Multi-language support in single stream
|
||||
- Language selector in GUI
|
||||
|
||||
2. **Speaker Diarization** (Optional)
|
||||
- Identify different speakers
|
||||
- Label transcriptions by speaker
|
||||
- Useful for multi-host streams
|
||||
|
||||
3. **Profanity Filtering**
|
||||
- Optional word filtering/replacement
|
||||
- Customizable filter lists
|
||||
- Toggle in GUI settings
|
||||
|
||||
4. **Advanced Noise Profiles**
|
||||
- Save and load custom noise profiles
|
||||
- Adaptive noise suppression
|
||||
- Different profiles for different environments
|
||||
|
||||
5. **Export Functionality**
|
||||
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
|
||||
- Export button in GUI
|
||||
- Automatic session saving
|
||||
|
||||
6. **Hotkey Support**
|
||||
- Global hotkeys to start/stop transcription
|
||||
- Mute/unmute hotkey
|
||||
- Quick save hotkey
|
||||
|
||||
7. **Docker Support**
|
||||
- Containerized server deployment
|
||||
- Docker Compose for easy multi-component setup
|
||||
- Pre-built images for easy deployment
|
||||
|
||||
8. **Themes and Customization**
|
||||
- Dark/light theme toggle
|
||||
- Customizable font sizes and colors for display
|
||||
- OBS-friendly transparent overlay mode
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Add language detection and multi-language support
|
||||
- [ ] Implement speaker diarization
|
||||
- [ ] Create optional profanity filter
|
||||
- [ ] Add export functionality (SRT, VTT, plain text, JSON)
|
||||
- [ ] Implement global hotkey support
|
||||
- [ ] Create Docker containers for server component
|
||||
- [ ] Add theme customization options
|
||||
- [ ] Create advanced noise profile management
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
### Local Client:
|
||||
- **Python 3.9+**
|
||||
- **GUI**: PyQt6 / CustomTkinter / tkinter
|
||||
- **Audio**: PyAudio / sounddevice
|
||||
- **Noise Suppression**: noisereduce / rnnoise-python
|
||||
- **VAD**: webrtcvad
|
||||
- **ML Framework**: PyTorch (for Whisper)
|
||||
- **Transcription**: openai-whisper / faster-whisper
|
||||
- **Networking**: websockets, requests (optional for server sync)
|
||||
- **Config**: PyYAML / json
|
||||
|
||||
### Server:
|
||||
- **Backend**: FastAPI / Flask
|
||||
- **WebSocket**: python-websockets / FastAPI WebSockets
|
||||
- **Server**: Uvicorn / Gunicorn
|
||||
- **Database** (optional): SQLite / PostgreSQL
|
||||
- **CORS**: fastapi-cors
|
||||
|
||||
### Web Interface:
|
||||
- **Frontend**: HTML5, CSS3, JavaScript (ES6+)
|
||||
- **Real-time**: WebSocket API
|
||||
- **Styling**: CSS Grid/Flexbox for layout
|
||||
|
||||
---
|
||||
See [config/default_config.yaml](config/default_config.yaml) for all available options.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
local-transcription/
|
||||
| ||||