Update README to reflect current application state

Remove outdated implementation plan and task checklists. Document
actual implemented features including RealtimeSTT, dual-layer VAD,
custom fonts/colors, and auto-updates. Add practical usage instructions
for standalone mode, OBS setup, and multi-user sync.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-23 06:31:27 -08:00
parent b7ab57f21f
commit bb8a8c251d

574
README.md
View File

@@ -1,19 +1,22 @@
# Local Transcription for Streamers # Local Transcription
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software. A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.
**Version 1.4.0**
## Features ## Features
- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required - **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine - **Standalone Desktop App**: PySide6/Qt GUI that works without any server
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware - **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
- **Real-time Processing**: Live audio transcription with minimal latency - **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
- **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users
- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
- **Customizable Colors**: User-configurable colors for name, text, and background
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise - **Noise Suppression**: Built-in audio preprocessing to reduce background noise
- **User Configuration**: Set your display name and preferences through the GUI - **Auto-Updates**: Automatic update checking with release notes display
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users - **Cross-Platform**: Builds available for Windows and Linux
- **OBS Integration**: Web-based output designed for easy browser source capture
- **Privacy-First**: All processing happens locally; only transcription text is shared
- **Customizable**: Configure model size, language, and streaming settings
## Quick Start ## Quick Start
@@ -27,468 +30,195 @@ uv sync
uv run python main.py uv run python main.py
``` ```
### Building Standalone Executables ### Using Pre-Built Executables
To create standalone executables for distribution: Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases) and run the executable for your platform.
### Building from Source
**Linux:** **Linux:**
```bash ```bash
./build.sh ./build.sh
# Output: dist/LocalTranscription/LocalTranscription
``` ```
**Windows:** **Windows:**
```cmd ```cmd
build.bat build.bat
# Output: dist\LocalTranscription\LocalTranscription.exe
``` ```
For detailed build instructions, see [BUILD.md](BUILD.md). For detailed build instructions, see [BUILD.md](BUILD.md).
## Architecture Overview ## Usage
The application can run in two modes: ### Standalone Mode
### Standalone Mode (No Server Required): 1. Launch the application
1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window 2. Select your microphone from the audio device dropdown
3. Choose a Whisper model (smaller = faster, larger = more accurate):
- `tiny.en` / `tiny` - Fastest, good for quick captions
- `base.en` / `base` - Balanced speed and accuracy
- `small.en` / `small` - Better accuracy
- `medium.en` / `medium` - High accuracy
- `large-v3` - Best accuracy (requires more resources)
4. Click **Start** to begin transcription
5. Transcriptions appear in the main window and at `http://localhost:8080`
### Multi-user Sync Mode (Optional): ### OBS Browser Source Setup
1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
## Use Cases 1. Start the Local Transcription app
2. In OBS, add a **Browser** source
3. Set URL to `http://localhost:8080`
4. Set dimensions (e.g., 1920x300)
5. Check "Shutdown source when not visible" for performance
- **Multi-language Streams**: Multiple translators transcribing in different languages ### Multi-User Mode (Optional)
- **Accessibility**: Provide real-time captions for viewers
- **Collaborative Podcasts**: Multiple hosts with separate transcriptions
- **Gaming Commentary**: Track who said what in multiplayer sessions
--- For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):
## Implementation Plan 1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
2. In the app settings, enable **Server Sync**
3. Enter the server URL (e.g., `http://your-server:3000/api/send`)
4. Set a room name and passphrase (shared with other users)
5. In OBS, use the server's display URL with your room name:
```
http://your-server:3000/display?room=YOURROOM&timestamps=true&maxlines=50
```
### Phase 1: Standalone Desktop Application ## Configuration
**Objective**: Build a fully functional standalone transcription app with GUI that works without any server Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel.
#### Components: ### Key Settings
1. **Audio Capture Module**
- Capture system audio or microphone input
- Support multiple audio sources (virtual audio cables, physical devices)
- Real-time audio buffering with configurable chunk sizes
- **Noise Suppression**: Preprocess audio to reduce background noise
- Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
2. **Noise Suppression Engine** | Setting | Description | Default |
- Real-time noise reduction using RNNoise or noisereduce |---------|-------------|---------|
- Adjustable noise reduction strength | `transcription.model` | Whisper model to use | `base.en` |
- Optional VAD (Voice Activity Detection) to skip silent segments | `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
- Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad` | `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
| `display.show_timestamps` | Show timestamps with transcriptions | `true` |
| `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
| `web_server.port` | Local web server port | `8080` |
3. **Transcription Engine** See [config/default_config.yaml](config/default_config.yaml) for all available options.
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
- Support multiple model sizes (tiny, base, small, medium, large)
- CPU and GPU inference options
- Model management and automatic downloading
- Libraries: `openai-whisper`, `faster-whisper`, `torch`
4. **Device Selection**
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
- Allow user to specify preferred device via GUI
- Graceful fallback if GPU unavailable
- Display device status and performance metrics
5. **Desktop GUI Application**
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
- Main transcription display window (scrolling text area)
- Settings panel for configuration
- User name input field
- Audio input device selector
- Model size selector
- CPU/GPU toggle
- Start/Stop transcription button
- Optional: System tray integration
- Libraries: `PyQt6`, `customtkinter`, or `tkinter`
6. **Local Display**
- Real-time transcription display in GUI window
- Scrolling text with timestamps
- User name/label shown with transcriptions
- Copy transcription to clipboard
- Optional: Save transcription to file (TXT, SRT, VTT)
#### Tasks:
- [ ] Set up project structure and dependencies
- [ ] Implement audio capture with device selection
- [ ] Add noise suppression and VAD preprocessing
- [ ] Integrate Whisper model loading and inference
- [ ] Add CPU/GPU device detection and selection logic
- [ ] Create real-time audio buffer processing pipeline
- [ ] Design and implement GUI layout (main window)
- [ ] Add settings panel with user name configuration
- [ ] Implement local transcription display area
- [ ] Add start/stop controls and status indicators
- [ ] Test transcription accuracy and latency
- [ ] Test noise suppression effectiveness
---
### Phase 2: Web Server and Sync System
**Objective**: Create a centralized server to aggregate and serve transcriptions
#### Components:
1. **Web Server**
- FastAPI or Flask-based REST API
- WebSocket support for real-time updates
- User/client registration and management
- Libraries: `fastapi`, `uvicorn`, `websockets`
2. **Transcription Aggregator**
- Receive transcription chunks from multiple clients
- Associate transcriptions with user IDs/names
- Timestamp management and synchronization
- Buffer management for smooth streaming
3. **Database/Storage** (Optional)
- Store transcription history (SQLite for simplicity)
- Session management
- Export functionality (SRT, VTT, TXT formats)
#### API Endpoints:
- `POST /api/register` - Register a new client
- `POST /api/transcription` - Submit transcription chunk
- `WS /api/stream` - WebSocket for real-time transcription stream
- `GET /stream` - Web page for OBS browser source
#### Tasks:
- [ ] Set up FastAPI server with CORS support
- [ ] Implement WebSocket handler for real-time streaming
- [ ] Create client registration system
- [ ] Build transcription aggregation logic
- [ ] Add timestamp synchronization
- [ ] Create data models for clients and transcriptions
---
### Phase 3: Client-Server Communication (Optional Multi-user Mode)
**Objective**: Add optional server connectivity to enable multi-user transcription sync
#### Components:
1. **HTTP/WebSocket Client**
- Register client with server on startup
- Send transcription chunks as they're generated
- Handle connection drops and reconnection
- Libraries: `requests`, `websockets`
2. **Configuration System**
- Config file for server URL, API keys, user settings
- Model preferences (size, language)
- Audio input settings
- Format: YAML or JSON
3. **Status Monitoring**
- Connection status indicator
- Transcription queue health
- Error handling and logging
#### Tasks:
- [ ] Add "Enable Server Sync" toggle to GUI
- [ ] Add server URL configuration field in settings
- [ ] Implement WebSocket client for sending transcriptions
- [ ] Add configuration file support (YAML/JSON)
- [ ] Create connection management with auto-reconnect
- [ ] Add local logging and error handling
- [ ] Add server connection status indicator to GUI
- [ ] Allow app to function normally if server is unavailable
---
### Phase 4: Web Stream Interface (OBS Integration)
**Objective**: Create a web page that displays synchronized transcriptions for OBS
#### Components:
1. **Web Frontend**
- HTML/CSS/JavaScript page for displaying transcriptions
- Responsive design with customizable styling
- Auto-scroll with configurable retention window
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
2. **Styling Options**
- Customizable fonts, colors, sizes
- Background transparency for OBS chroma key
- User name/ID display options
- Timestamp display (optional)
3. **Display Modes**
- Scrolling captions (like live TV captions)
- Multi-user panel view (separate sections per user)
- Overlay mode (minimal UI for transparency)
#### Tasks:
- [ ] Create HTML template for transcription display
- [ ] Implement WebSocket client in JavaScript
- [ ] Add CSS styling with OBS-friendly transparency
- [ ] Create customization controls (URL parameters or UI)
- [ ] Test with OBS browser source
- [ ] Add configurable retention/scroll behavior
---
### Phase 5: Advanced Features
**Objective**: Enhance functionality and user experience
#### Features:
1. **Language Detection**
- Auto-detect spoken language
- Multi-language support in single stream
- Language selector in GUI
2. **Speaker Diarization** (Optional)
- Identify different speakers
- Label transcriptions by speaker
- Useful for multi-host streams
3. **Profanity Filtering**
- Optional word filtering/replacement
- Customizable filter lists
- Toggle in GUI settings
4. **Advanced Noise Profiles**
- Save and load custom noise profiles
- Adaptive noise suppression
- Different profiles for different environments
5. **Export Functionality**
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
- Export button in GUI
- Automatic session saving
6. **Hotkey Support**
- Global hotkeys to start/stop transcription
- Mute/unmute hotkey
- Quick save hotkey
7. **Docker Support**
- Containerized server deployment
- Docker Compose for easy multi-component setup
- Pre-built images for easy deployment
8. **Themes and Customization**
- Dark/light theme toggle
- Customizable font sizes and colors for display
- OBS-friendly transparent overlay mode
#### Tasks:
- [ ] Add language detection and multi-language support
- [ ] Implement speaker diarization
- [ ] Create optional profanity filter
- [ ] Add export functionality (SRT, VTT, plain text, JSON)
- [ ] Implement global hotkey support
- [ ] Create Docker containers for server component
- [ ] Add theme customization options
- [ ] Create advanced noise profile management
---
## Technology Stack
### Local Client:
- **Python 3.9+**
- **GUI**: PyQt6 / CustomTkinter / tkinter
- **Audio**: PyAudio / sounddevice
- **Noise Suppression**: noisereduce / rnnoise-python
- **VAD**: webrtcvad
- **ML Framework**: PyTorch (for Whisper)
- **Transcription**: openai-whisper / faster-whisper
- **Networking**: websockets, requests (optional for server sync)
- **Config**: PyYAML / json
### Server:
- **Backend**: FastAPI / Flask
- **WebSocket**: python-websockets / FastAPI WebSockets
- **Server**: Uvicorn / Gunicorn
- **Database** (optional): SQLite / PostgreSQL
- **CORS**: fastapi-cors
### Web Interface:
- **Frontend**: HTML5, CSS3, JavaScript (ES6+)
- **Real-time**: WebSocket API
- **Styling**: CSS Grid/Flexbox for layout
---
## Project Structure ## Project Structure
``` ```
local-transcription/ local-transcription/
 client/ # Local transcription client ├── client/ # Core transcription modules
  __init__.py ├── audio_capture.py # Audio input handling
  audio_capture.py # Audio input handling ├── transcription_engine_realtime.py # RealtimeSTT integration
  transcription_engine.py # Whisper integration ├── noise_suppression.py # VAD and noise reduction
  network_client.py # Server communication ├── device_utils.py # CPU/GPU detection
  config.py # Configuration management ├── config.py # Configuration management
  main.py # Client entry point ├── server_sync.py # Multi-user server client
 server/ # Centralized web server │ └── update_checker.py # Auto-update functionality
  __init__.py ├── gui/ # Desktop application UI
  api.py # FastAPI routes ├── main_window_qt.py # Main application window
  websocket_handler.py # WebSocket management ├── settings_dialog_qt.py # Settings dialog
  models.py # Data models └── transcription_display_qt.py # Display widget
  database.py # Optional DB layer ├── server/ # Web servers
  main.py # Server entry point ├── web_display.py # Local FastAPI server for OBS
 web/ # Web stream interface │ └── nodejs/ # Multi-user sync server
  index.html # OBS browser source page ├── server.js # Express + WebSocket server
  styles.css # Customizable styling └── README.md # Deployment instructions
  app.js # WebSocket client & UI logic ├── config/
 config/ │ └── default_config.yaml # Default settings template
  client_config.example.yaml ├── main.py # GUI entry point
  server_config.example.yaml ├── main_cli.py # CLI version (for testing)
 tests/ ├── build.sh # Linux build script
  test_audio.py ├── build.bat # Windows build script
  test_transcription.py └── local-transcription.spec # PyInstaller configuration
  test_server.py
 requirements.txt # Python dependencies
 README.md
 main.py # Combined launcher (optional)
``` ```
--- ## Technology Stack
## Installation (Planned) ### Desktop Application
- **Python 3.9+**
- **PySide6** - Qt6 GUI framework
- **RealtimeSTT** - Real-time speech-to-text with advanced VAD
- **faster-whisper** - Optimized Whisper model inference
- **PyTorch** - ML framework (CUDA-enabled)
- **sounddevice** - Cross-platform audio capture
- **webrtcvad + silero_vad** - Voice activity detection
- **noisereduce** - Noise suppression
### Prerequisites: ### Web Servers
- Python 3.9 or higher - **FastAPI + Uvicorn** - Local web display server
- CUDA-capable GPU (optional, for GPU acceleration) - **Node.js + Express + WebSocket** - Multi-user sync server
- FFmpeg (required by Whisper)
### Steps: ### Build Tools
- **PyInstaller** - Executable packaging
- **uv** - Fast Python package manager
1. **Clone the repository** ## System Requirements
### Minimum
- Python 3.9+
- 4GB RAM
- Any modern CPU
### Recommended (for real-time performance)
- 8GB+ RAM
- NVIDIA GPU with CUDA support (for GPU acceleration)
- FFmpeg (installed automatically with dependencies)
### For Building
- **Linux**: gcc, Python dev headers
- **Windows**: Visual Studio Build Tools, Python dev headers
## Troubleshooting
### Model Loading Issues
- Models download automatically on first use to `~/.cache/huggingface/`
- First run requires internet connection
- Check disk space (models range from 75MB to 3GB)
### Audio Device Issues
```bash ```bash
git clone <repository-url> # List available audio devices
cd local-transcription uv run python main_cli.py --list-devices
``` ```
- Ensure microphone permissions are granted
- Try different device indices in settings
2. **Install dependencies** ### GPU Not Detected
```bash ```bash
pip install -r requirements.txt # Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"
``` ```
- Install NVIDIA drivers (CUDA toolkit is bundled)
- The app automatically falls back to CPU if no GPU is available
3. **Download Whisper models** ### Web Server Port Conflicts
```bash - Default port is 8080
# Models will be auto-downloaded on first run - Change in settings or edit config file
# Or manually download: - Check for conflicts: `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows)
python -c "import whisper; whisper.load_model('base')"
```
4. **Configure client** ## Use Cases
```bash
cp config/client_config.example.yaml config/client_config.yaml
# Edit config/client_config.yaml with your settings
```
5. **Run the server** (one instance) - **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams
```bash - **Multi-Language Translation**: Multiple translators transcribing in different languages
python server/main.py - **Accessibility**: Provide captions for hearing-impaired viewers
``` - **Podcast Recording**: Real-time transcription for multi-host shows
- **Gaming Commentary**: Track who said what in multiplayer sessions
6. **Run the client** (on each user's machine)
```bash
python client/main.py
```
7. **Add to OBS**
- Add a Browser Source
- URL: `http://<server-ip>:8000/stream`
- Set width/height as needed
- Check "Shutdown source when not visible" for performance
---
## Configuration (Planned)
### Client Configuration:
```yaml
user:
name: "Streamer1" # Display name for transcriptions
id: "unique-user-id" # Optional unique identifier
audio:
input_device: "default" # or specific device index
sample_rate: 16000
chunk_duration: 2.0 # seconds
noise_suppression:
enabled: true # Enable/disable noise reduction
strength: 0.7 # 0.0 to 1.0 - reduction strength
method: "noisereduce" # "noisereduce" or "rnnoise"
transcription:
model: "base" # tiny, base, small, medium, large
device: "cuda" # cpu, cuda, mps
language: "en" # or "auto" for detection
task: "transcribe" # or "translate"
processing:
use_vad: true # Voice Activity Detection
min_confidence: 0.5 # Minimum transcription confidence
server_sync:
enabled: false # Enable multi-user server sync
url: "ws://localhost:8000" # Server URL (when enabled)
api_key: "" # Optional API key
display:
show_timestamps: true # Show timestamps in local display
max_lines: 100 # Maximum lines to keep in display
font_size: 12 # GUI font size
```
### Server Configuration:
```yaml
server:
host: "0.0.0.0"
port: 8000
api_key_required: false
stream:
max_clients: 10
buffer_size: 100 # messages to buffer
retention_time: 300 # seconds
database:
enabled: false
path: "transcriptions.db"
```
---
## Roadmap
- [x] Project planning and architecture design
- [ ] Phase 1: Standalone desktop application with GUI
- [ ] Phase 2: Web server and sync system (optional multi-user mode)
- [ ] Phase 3: Client-server communication (optional)
- [ ] Phase 4: Web stream interface for OBS (optional)
- [ ] Phase 5: Advanced features (hotkeys, themes, Docker, etc.)
---
## Contributing ## Contributing
Contributions are welcome! Please feel free to submit issues or pull requests. Contributions are welcome! Please feel free to submit issues or pull requests at the [repository](https://repo.anhonesthost.net/streamer-tools/local-transcription).
---
## License ## License
[Choose appropriate license - MIT, Apache 2.0, etc.] MIT License
---
## Acknowledgments ## Acknowledgments
- OpenAI Whisper for the excellent speech recognition model - [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model
- The streaming community for inspiration and use cases - [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities
- [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference