Update README to reflect current application state
Remove outdated implementation plan and task checklists. Document actual implemented features including RealtimeSTT, dual-layer VAD, custom fonts/colors, and auto-updates. Add practical usage instructions for standalone mode, OBS setup, and multi-user sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
574
README.md
574
README.md
@@ -1,19 +1,22 @@
|
|||||||
# Local Transcription for Streamers
|
# Local Transcription
|
||||||
|
|
||||||
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
|
A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.
|
||||||
|
|
||||||
|
**Version 1.4.0**
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required
|
- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
|
||||||
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine
|
- **Standalone Desktop App**: PySide6/Qt GUI that works without any server
|
||||||
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware
|
- **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
|
||||||
- **Real-time Processing**: Live audio transcription with minimal latency
|
- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
|
||||||
|
- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
|
||||||
|
- **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users
|
||||||
|
- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
|
||||||
|
- **Customizable Colors**: User-configurable colors for name, text, and background
|
||||||
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
|
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
|
||||||
- **User Configuration**: Set your display name and preferences through the GUI
|
- **Auto-Updates**: Automatic update checking with release notes display
|
||||||
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users
|
- **Cross-Platform**: Builds available for Windows and Linux
|
||||||
- **OBS Integration**: Web-based output designed for easy browser source capture
|
|
||||||
- **Privacy-First**: All processing happens locally; only transcription text is shared
|
|
||||||
- **Customizable**: Configure model size, language, and streaming settings
|
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
@@ -27,468 +30,195 @@ uv sync
|
|||||||
uv run python main.py
|
uv run python main.py
|
||||||
```
|
```
|
||||||
|
|
||||||
### Building Standalone Executables
|
### Using Pre-Built Executables
|
||||||
|
|
||||||
To create standalone executables for distribution:
|
Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases) and run the executable for your platform.
|
||||||
|
|
||||||
|
### Building from Source
|
||||||
|
|
||||||
**Linux:**
|
**Linux:**
|
||||||
```bash
|
```bash
|
||||||
./build.sh
|
./build.sh
|
||||||
|
# Output: dist/LocalTranscription/LocalTranscription
|
||||||
```
|
```
|
||||||
|
|
||||||
**Windows:**
|
**Windows:**
|
||||||
```cmd
|
```cmd
|
||||||
build.bat
|
build.bat
|
||||||
|
# Output: dist\LocalTranscription\LocalTranscription.exe
|
||||||
```
|
```
|
||||||
|
|
||||||
For detailed build instructions, see [BUILD.md](BUILD.md).
|
For detailed build instructions, see [BUILD.md](BUILD.md).
|
||||||
|
|
||||||
## Architecture Overview
|
## Usage
|
||||||
|
|
||||||
The application can run in two modes:
|
### Standalone Mode
|
||||||
|
|
||||||
### Standalone Mode (No Server Required):
|
1. Launch the application
|
||||||
1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
|
2. Select your microphone from the audio device dropdown
|
||||||
|
3. Choose a Whisper model (smaller = faster, larger = more accurate):
|
||||||
|
- `tiny.en` / `tiny` - Fastest, good for quick captions
|
||||||
|
- `base.en` / `base` - Balanced speed and accuracy
|
||||||
|
- `small.en` / `small` - Better accuracy
|
||||||
|
- `medium.en` / `medium` - High accuracy
|
||||||
|
- `large-v3` - Best accuracy (requires more resources)
|
||||||
|
4. Click **Start** to begin transcription
|
||||||
|
5. Transcriptions appear in the main window and at `http://localhost:8080`
|
||||||
|
|
||||||
### Multi-user Sync Mode (Optional):
|
### OBS Browser Source Setup
|
||||||
1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
|
|
||||||
2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
|
|
||||||
3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
|
|
||||||
|
|
||||||
## Use Cases
|
1. Start the Local Transcription app
|
||||||
|
2. In OBS, add a **Browser** source
|
||||||
|
3. Set URL to `http://localhost:8080`
|
||||||
|
4. Set dimensions (e.g., 1920x300)
|
||||||
|
5. Check "Shutdown source when not visible" for performance
|
||||||
|
|
||||||
- **Multi-language Streams**: Multiple translators transcribing in different languages
|
### Multi-User Mode (Optional)
|
||||||
- **Accessibility**: Provide real-time captions for viewers
|
|
||||||
- **Collaborative Podcasts**: Multiple hosts with separate transcriptions
|
|
||||||
- **Gaming Commentary**: Track who said what in multiplayer sessions
|
|
||||||
|
|
||||||
---
|
For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):
|
||||||
|
|
||||||
## Implementation Plan
|
1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
|
||||||
|
2. In the app settings, enable **Server Sync**
|
||||||
|
3. Enter the server URL (e.g., `http://your-server:3000/api/send`)
|
||||||
|
4. Set a room name and passphrase (shared with other users)
|
||||||
|
5. In OBS, use the server's display URL with your room name:
|
||||||
|
```
|
||||||
|
http://your-server:3000/display?room=YOURROOM×tamps=true&maxlines=50
|
||||||
|
```
|
||||||
|
|
||||||
### Phase 1: Standalone Desktop Application
|
## Configuration
|
||||||
|
|
||||||
**Objective**: Build a fully functional standalone transcription app with GUI that works without any server
|
Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel.
|
||||||
|
|
||||||
#### Components:
|
### Key Settings
|
||||||
1. **Audio Capture Module**
|
|
||||||
- Capture system audio or microphone input
|
|
||||||
- Support multiple audio sources (virtual audio cables, physical devices)
|
|
||||||
- Real-time audio buffering with configurable chunk sizes
|
|
||||||
- **Noise Suppression**: Preprocess audio to reduce background noise
|
|
||||||
- Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
|
|
||||||
|
|
||||||
2. **Noise Suppression Engine**
|
| Setting | Description | Default |
|
||||||
- Real-time noise reduction using RNNoise or noisereduce
|
|---------|-------------|---------|
|
||||||
- Adjustable noise reduction strength
|
| `transcription.model` | Whisper model to use | `base.en` |
|
||||||
- Optional VAD (Voice Activity Detection) to skip silent segments
|
| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
|
||||||
- Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad`
|
| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
|
||||||
|
| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
|
||||||
|
| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
|
||||||
|
| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
|
||||||
|
| `display.show_timestamps` | Show timestamps with transcriptions | `true` |
|
||||||
|
| `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
|
||||||
|
| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
|
||||||
|
| `web_server.port` | Local web server port | `8080` |
|
||||||
|
|
||||||
3. **Transcription Engine**
|
See [config/default_config.yaml](config/default_config.yaml) for all available options.
|
||||||
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
|
|
||||||
- Support multiple model sizes (tiny, base, small, medium, large)
|
|
||||||
- CPU and GPU inference options
|
|
||||||
- Model management and automatic downloading
|
|
||||||
- Libraries: `openai-whisper`, `faster-whisper`, `torch`
|
|
||||||
|
|
||||||
4. **Device Selection**
|
|
||||||
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
|
|
||||||
- Allow user to specify preferred device via GUI
|
|
||||||
- Graceful fallback if GPU unavailable
|
|
||||||
- Display device status and performance metrics
|
|
||||||
|
|
||||||
5. **Desktop GUI Application**
|
|
||||||
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
|
|
||||||
- Main transcription display window (scrolling text area)
|
|
||||||
- Settings panel for configuration
|
|
||||||
- User name input field
|
|
||||||
- Audio input device selector
|
|
||||||
- Model size selector
|
|
||||||
- CPU/GPU toggle
|
|
||||||
- Start/Stop transcription button
|
|
||||||
- Optional: System tray integration
|
|
||||||
- Libraries: `PyQt6`, `customtkinter`, or `tkinter`
|
|
||||||
|
|
||||||
6. **Local Display**
|
|
||||||
- Real-time transcription display in GUI window
|
|
||||||
- Scrolling text with timestamps
|
|
||||||
- User name/label shown with transcriptions
|
|
||||||
- Copy transcription to clipboard
|
|
||||||
- Optional: Save transcription to file (TXT, SRT, VTT)
|
|
||||||
|
|
||||||
#### Tasks:
|
|
||||||
- [ ] Set up project structure and dependencies
|
|
||||||
- [ ] Implement audio capture with device selection
|
|
||||||
- [ ] Add noise suppression and VAD preprocessing
|
|
||||||
- [ ] Integrate Whisper model loading and inference
|
|
||||||
- [ ] Add CPU/GPU device detection and selection logic
|
|
||||||
- [ ] Create real-time audio buffer processing pipeline
|
|
||||||
- [ ] Design and implement GUI layout (main window)
|
|
||||||
- [ ] Add settings panel with user name configuration
|
|
||||||
- [ ] Implement local transcription display area
|
|
||||||
- [ ] Add start/stop controls and status indicators
|
|
||||||
- [ ] Test transcription accuracy and latency
|
|
||||||
- [ ] Test noise suppression effectiveness
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Phase 2: Web Server and Sync System
|
|
||||||
|
|
||||||
**Objective**: Create a centralized server to aggregate and serve transcriptions
|
|
||||||
|
|
||||||
#### Components:
|
|
||||||
1. **Web Server**
|
|
||||||
- FastAPI or Flask-based REST API
|
|
||||||
- WebSocket support for real-time updates
|
|
||||||
- User/client registration and management
|
|
||||||
- Libraries: `fastapi`, `uvicorn`, `websockets`
|
|
||||||
|
|
||||||
2. **Transcription Aggregator**
|
|
||||||
- Receive transcription chunks from multiple clients
|
|
||||||
- Associate transcriptions with user IDs/names
|
|
||||||
- Timestamp management and synchronization
|
|
||||||
- Buffer management for smooth streaming
|
|
||||||
|
|
||||||
3. **Database/Storage** (Optional)
|
|
||||||
- Store transcription history (SQLite for simplicity)
|
|
||||||
- Session management
|
|
||||||
- Export functionality (SRT, VTT, TXT formats)
|
|
||||||
|
|
||||||
#### API Endpoints:
|
|
||||||
- `POST /api/register` - Register a new client
|
|
||||||
- `POST /api/transcription` - Submit transcription chunk
|
|
||||||
- `WS /api/stream` - WebSocket for real-time transcription stream
|
|
||||||
- `GET /stream` - Web page for OBS browser source
|
|
||||||
|
|
||||||
#### Tasks:
|
|
||||||
- [ ] Set up FastAPI server with CORS support
|
|
||||||
- [ ] Implement WebSocket handler for real-time streaming
|
|
||||||
- [ ] Create client registration system
|
|
||||||
- [ ] Build transcription aggregation logic
|
|
||||||
- [ ] Add timestamp synchronization
|
|
||||||
- [ ] Create data models for clients and transcriptions
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Phase 3: Client-Server Communication (Optional Multi-user Mode)
|
|
||||||
|
|
||||||
**Objective**: Add optional server connectivity to enable multi-user transcription sync
|
|
||||||
|
|
||||||
#### Components:
|
|
||||||
1. **HTTP/WebSocket Client**
|
|
||||||
- Register client with server on startup
|
|
||||||
- Send transcription chunks as they're generated
|
|
||||||
- Handle connection drops and reconnection
|
|
||||||
- Libraries: `requests`, `websockets`
|
|
||||||
|
|
||||||
2. **Configuration System**
|
|
||||||
- Config file for server URL, API keys, user settings
|
|
||||||
- Model preferences (size, language)
|
|
||||||
- Audio input settings
|
|
||||||
- Format: YAML or JSON
|
|
||||||
|
|
||||||
3. **Status Monitoring**
|
|
||||||
- Connection status indicator
|
|
||||||
- Transcription queue health
|
|
||||||
- Error handling and logging
|
|
||||||
|
|
||||||
#### Tasks:
|
|
||||||
- [ ] Add "Enable Server Sync" toggle to GUI
|
|
||||||
- [ ] Add server URL configuration field in settings
|
|
||||||
- [ ] Implement WebSocket client for sending transcriptions
|
|
||||||
- [ ] Add configuration file support (YAML/JSON)
|
|
||||||
- [ ] Create connection management with auto-reconnect
|
|
||||||
- [ ] Add local logging and error handling
|
|
||||||
- [ ] Add server connection status indicator to GUI
|
|
||||||
- [ ] Allow app to function normally if server is unavailable
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Phase 4: Web Stream Interface (OBS Integration)
|
|
||||||
|
|
||||||
**Objective**: Create a web page that displays synchronized transcriptions for OBS
|
|
||||||
|
|
||||||
#### Components:
|
|
||||||
1. **Web Frontend**
|
|
||||||
- HTML/CSS/JavaScript page for displaying transcriptions
|
|
||||||
- Responsive design with customizable styling
|
|
||||||
- Auto-scroll with configurable retention window
|
|
||||||
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
|
|
||||||
|
|
||||||
2. **Styling Options**
|
|
||||||
- Customizable fonts, colors, sizes
|
|
||||||
- Background transparency for OBS chroma key
|
|
||||||
- User name/ID display options
|
|
||||||
- Timestamp display (optional)
|
|
||||||
|
|
||||||
3. **Display Modes**
|
|
||||||
- Scrolling captions (like live TV captions)
|
|
||||||
- Multi-user panel view (separate sections per user)
|
|
||||||
- Overlay mode (minimal UI for transparency)
|
|
||||||
|
|
||||||
#### Tasks:
|
|
||||||
- [ ] Create HTML template for transcription display
|
|
||||||
- [ ] Implement WebSocket client in JavaScript
|
|
||||||
- [ ] Add CSS styling with OBS-friendly transparency
|
|
||||||
- [ ] Create customization controls (URL parameters or UI)
|
|
||||||
- [ ] Test with OBS browser source
|
|
||||||
- [ ] Add configurable retention/scroll behavior
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Phase 5: Advanced Features
|
|
||||||
|
|
||||||
**Objective**: Enhance functionality and user experience
|
|
||||||
|
|
||||||
#### Features:
|
|
||||||
1. **Language Detection**
|
|
||||||
- Auto-detect spoken language
|
|
||||||
- Multi-language support in single stream
|
|
||||||
- Language selector in GUI
|
|
||||||
|
|
||||||
2. **Speaker Diarization** (Optional)
|
|
||||||
- Identify different speakers
|
|
||||||
- Label transcriptions by speaker
|
|
||||||
- Useful for multi-host streams
|
|
||||||
|
|
||||||
3. **Profanity Filtering**
|
|
||||||
- Optional word filtering/replacement
|
|
||||||
- Customizable filter lists
|
|
||||||
- Toggle in GUI settings
|
|
||||||
|
|
||||||
4. **Advanced Noise Profiles**
|
|
||||||
- Save and load custom noise profiles
|
|
||||||
- Adaptive noise suppression
|
|
||||||
- Different profiles for different environments
|
|
||||||
|
|
||||||
5. **Export Functionality**
|
|
||||||
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
|
|
||||||
- Export button in GUI
|
|
||||||
- Automatic session saving
|
|
||||||
|
|
||||||
6. **Hotkey Support**
|
|
||||||
- Global hotkeys to start/stop transcription
|
|
||||||
- Mute/unmute hotkey
|
|
||||||
- Quick save hotkey
|
|
||||||
|
|
||||||
7. **Docker Support**
|
|
||||||
- Containerized server deployment
|
|
||||||
- Docker Compose for easy multi-component setup
|
|
||||||
- Pre-built images for easy deployment
|
|
||||||
|
|
||||||
8. **Themes and Customization**
|
|
||||||
- Dark/light theme toggle
|
|
||||||
- Customizable font sizes and colors for display
|
|
||||||
- OBS-friendly transparent overlay mode
|
|
||||||
|
|
||||||
#### Tasks:
|
|
||||||
- [ ] Add language detection and multi-language support
|
|
||||||
- [ ] Implement speaker diarization
|
|
||||||
- [ ] Create optional profanity filter
|
|
||||||
- [ ] Add export functionality (SRT, VTT, plain text, JSON)
|
|
||||||
- [ ] Implement global hotkey support
|
|
||||||
- [ ] Create Docker containers for server component
|
|
||||||
- [ ] Add theme customization options
|
|
||||||
- [ ] Create advanced noise profile management
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Technology Stack
|
|
||||||
|
|
||||||
### Local Client:
|
|
||||||
- **Python 3.9+**
|
|
||||||
- **GUI**: PyQt6 / CustomTkinter / tkinter
|
|
||||||
- **Audio**: PyAudio / sounddevice
|
|
||||||
- **Noise Suppression**: noisereduce / rnnoise-python
|
|
||||||
- **VAD**: webrtcvad
|
|
||||||
- **ML Framework**: PyTorch (for Whisper)
|
|
||||||
- **Transcription**: openai-whisper / faster-whisper
|
|
||||||
- **Networking**: websockets, requests (optional for server sync)
|
|
||||||
- **Config**: PyYAML / json
|
|
||||||
|
|
||||||
### Server:
|
|
||||||
- **Backend**: FastAPI / Flask
|
|
||||||
- **WebSocket**: python-websockets / FastAPI WebSockets
|
|
||||||
- **Server**: Uvicorn / Gunicorn
|
|
||||||
- **Database** (optional): SQLite / PostgreSQL
|
|
||||||
- **CORS**: fastapi-cors
|
|
||||||
|
|
||||||
### Web Interface:
|
|
||||||
- **Frontend**: HTML5, CSS3, JavaScript (ES6+)
|
|
||||||
- **Real-time**: WebSocket API
|
|
||||||
- **Styling**: CSS Grid/Flexbox for layout
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
local-transcription/
|
local-transcription/
|
||||||
| |||||||