Update README to reflect current application state

Remove outdated implementation plan and task checklists. Document actual implemented features including RealtimeSTT, dual-layer VAD, custom fonts/colors, and auto-updates. Add practical usage instructions for standalone mode, OBS setup, and multi-user sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 06:31:27 -08:00
parent b7ab57f21f
commit bb8a8c251d
1 changed files with 148 additions and 418 deletions
--- a/README.md
+++ b/README.md
@@ -1,19 +1,22 @@
-# Local Transcription for Streamers
+# Local Transcription
-A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
+A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.
 **Version 1.4.0**
 ## Features
- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required
+- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine
+- **Standalone Desktop App**: PySide6/Qt GUI that works without any server
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware
+- **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
- **Real-time Processing**: Live audio transcription with minimal latency
+- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
 - **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
 - **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users
 - **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
 - **Customizable Colors**: User-configurable colors for name, text, and background
 - **Noise Suppression**: Built-in audio preprocessing to reduce background noise
- **User Configuration**: Set your display name and preferences through the GUI
+- **Auto-Updates**: Automatic update checking with release notes display
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users
+- **Cross-Platform**: Builds available for Windows and Linux
 - **OBS Integration**: Web-based output designed for easy browser source capture
 - **Privacy-First**: All processing happens locally; only transcription text is shared
 - **Customizable**: Configure model size, language, and streaming settings
 ## Quick Start
@@ -27,468 +30,195 @@ uv sync
 uv run python main.py
 ```
-### Building Standalone Executables
+### Using Pre-Built Executables
-To create standalone executables for distribution:
+Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases) and run the executable for your platform.
 ### Building from Source
 **Linux:**
 ```bash
 ./build.sh
 # Output: dist/LocalTranscription/LocalTranscription
 ```
 **Windows:**
 ```cmd
 build.bat
 # Output: dist\LocalTranscription\LocalTranscription.exe
 ```
 For detailed build instructions, see [BUILD.md](BUILD.md).
-## Architecture Overview
+## Usage
-The application can run in two modes:
+### Standalone Mode
-### Standalone Mode (No Server Required):
+1. Launch the application
-1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
+2. Select your microphone from the audio device dropdown
 3. Choose a Whisper model (smaller = faster, larger = more accurate):
   - `tiny.en` / `tiny` - Fastest, good for quick captions
   - `base.en` / `base` - Balanced speed and accuracy
   - `small.en` / `small` - Better accuracy
   - `medium.en` / `medium` - High accuracy
   - `large-v3` - Best accuracy (requires more resources)
 4. Click **Start** to begin transcription
 5. Transcriptions appear in the main window and at `http://localhost:8080`
-### Multi-user Sync Mode (Optional):
+### OBS Browser Source Setup
 1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
 2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
 3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
-## Use Cases
+1. Start the Local Transcription app
 2. In OBS, add a **Browser** source
 3. Set URL to `http://localhost:8080`
 4. Set dimensions (e.g., 1920x300)
 5. Check "Shutdown source when not visible" for performance
- **Multi-language Streams**: Multiple translators transcribing in different languages
+### Multi-User Mode (Optional)
 - **Accessibility**: Provide real-time captions for viewers
 - **Collaborative Podcasts**: Multiple hosts with separate transcriptions
 - **Gaming Commentary**: Track who said what in multiplayer sessions
---
+For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):
-## Implementation Plan
+1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
 2. In the app settings, enable **Server Sync**
 3. Enter the server URL (e.g., `http://your-server:3000/api/send`)
 4. Set a room name and passphrase (shared with other users)
 5. In OBS, use the server's display URL with your room name:
   ```
   http://your-server:3000/display?room=YOURROOM&timestamps=true&maxlines=50
   ```
-### Phase 1: Standalone Desktop Application
+## Configuration
-**Objective**: Build a fully functional standalone transcription app with GUI that works without any server
+Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel.
-#### Components:
+### Key Settings
 1. **Audio Capture Module**
   - Capture system audio or microphone input
   - Support multiple audio sources (virtual audio cables, physical devices)
   - Real-time audio buffering with configurable chunk sizes
   - **Noise Suppression**: Preprocess audio to reduce background noise
   - Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
-2. **Noise Suppression Engine**
+| Setting | Description | Default |
-   - Real-time noise reduction using RNNoise or noisereduce
+|---------|-------------|---------|
-   - Adjustable noise reduction strength
+| `transcription.model` | Whisper model to use | `base.en` |
-   - Optional VAD (Voice Activity Detection) to skip silent segments
+| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
-   - Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad`
+| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
 | `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
 | `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
 | `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
 | `display.show_timestamps` | Show timestamps with transcriptions | `true` |
 | `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
 | `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
 | `web_server.port` | Local web server port | `8080` |
-3. **Transcription Engine**
+See [config/default_config.yaml](config/default_config.yaml) for all available options.
   - Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
   - Support multiple model sizes (tiny, base, small, medium, large)
   - CPU and GPU inference options
   - Model management and automatic downloading
   - Libraries: `openai-whisper`, `faster-whisper`, `torch`
 4. **Device Selection**
   - Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
   - Allow user to specify preferred device via GUI
   - Graceful fallback if GPU unavailable
   - Display device status and performance metrics
 5. **Desktop GUI Application**
   - Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
   - Main transcription display window (scrolling text area)
   - Settings panel for configuration
   - User name input field
   - Audio input device selector
   - Model size selector
   - CPU/GPU toggle
   - Start/Stop transcription button
   - Optional: System tray integration
   - Libraries: `PyQt6`, `customtkinter`, or `tkinter`
 6. **Local Display**
   - Real-time transcription display in GUI window
   - Scrolling text with timestamps
   - User name/label shown with transcriptions
   - Copy transcription to clipboard
   - Optional: Save transcription to file (TXT, SRT, VTT)
 #### Tasks:
 - [ ] Set up project structure and dependencies
 - [ ] Implement audio capture with device selection
 - [ ] Add noise suppression and VAD preprocessing
 - [ ] Integrate Whisper model loading and inference
 - [ ] Add CPU/GPU device detection and selection logic
 - [ ] Create real-time audio buffer processing pipeline
 - [ ] Design and implement GUI layout (main window)
 - [ ] Add settings panel with user name configuration
 - [ ] Implement local transcription display area
 - [ ] Add start/stop controls and status indicators
 - [ ] Test transcription accuracy and latency
 - [ ] Test noise suppression effectiveness
 ---
 ### Phase 2: Web Server and Sync System
 **Objective**: Create a centralized server to aggregate and serve transcriptions
 #### Components:
 1. **Web Server**
   - FastAPI or Flask-based REST API
   - WebSocket support for real-time updates
   - User/client registration and management
   - Libraries: `fastapi`, `uvicorn`, `websockets`
 2. **Transcription Aggregator**
   - Receive transcription chunks from multiple clients
   - Associate transcriptions with user IDs/names
   - Timestamp management and synchronization
   - Buffer management for smooth streaming
 3. **Database/Storage** (Optional)
   - Store transcription history (SQLite for simplicity)
   - Session management
   - Export functionality (SRT, VTT, TXT formats)
 #### API Endpoints:
 - `POST /api/register` - Register a new client
 - `POST /api/transcription` - Submit transcription chunk
 - `WS /api/stream` - WebSocket for real-time transcription stream
 - `GET /stream` - Web page for OBS browser source
 #### Tasks:
 - [ ] Set up FastAPI server with CORS support
 - [ ] Implement WebSocket handler for real-time streaming
 - [ ] Create client registration system
 - [ ] Build transcription aggregation logic
 - [ ] Add timestamp synchronization
 - [ ] Create data models for clients and transcriptions
 ---
 ### Phase 3: Client-Server Communication (Optional Multi-user Mode)
 **Objective**: Add optional server connectivity to enable multi-user transcription sync
 #### Components:
 1. **HTTP/WebSocket Client**
   - Register client with server on startup
   - Send transcription chunks as they're generated
   - Handle connection drops and reconnection
   - Libraries: `requests`, `websockets`
 2. **Configuration System**
   - Config file for server URL, API keys, user settings
   - Model preferences (size, language)
   - Audio input settings
   - Format: YAML or JSON
 3. **Status Monitoring**
   - Connection status indicator
   - Transcription queue health
   - Error handling and logging
 #### Tasks:
 - [ ] Add "Enable Server Sync" toggle to GUI
 - [ ] Add server URL configuration field in settings
 - [ ] Implement WebSocket client for sending transcriptions
 - [ ] Add configuration file support (YAML/JSON)
 - [ ] Create connection management with auto-reconnect
 - [ ] Add local logging and error handling
 - [ ] Add server connection status indicator to GUI
 - [ ] Allow app to function normally if server is unavailable
 ---
 ### Phase 4: Web Stream Interface (OBS Integration)
 **Objective**: Create a web page that displays synchronized transcriptions for OBS
 #### Components:
 1. **Web Frontend**
   - HTML/CSS/JavaScript page for displaying transcriptions
   - Responsive design with customizable styling
   - Auto-scroll with configurable retention window
   - Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
 2. **Styling Options**
   - Customizable fonts, colors, sizes
   - Background transparency for OBS chroma key
   - User name/ID display options
   - Timestamp display (optional)
 3. **Display Modes**
   - Scrolling captions (like live TV captions)
   - Multi-user panel view (separate sections per user)
   - Overlay mode (minimal UI for transparency)
 #### Tasks:
 - [ ] Create HTML template for transcription display
 - [ ] Implement WebSocket client in JavaScript
 - [ ] Add CSS styling with OBS-friendly transparency
 - [ ] Create customization controls (URL parameters or UI)
 - [ ] Test with OBS browser source
 - [ ] Add configurable retention/scroll behavior
 ---
 ### Phase 5: Advanced Features
 **Objective**: Enhance functionality and user experience
 #### Features:
 1. **Language Detection**
   - Auto-detect spoken language
   - Multi-language support in single stream
   - Language selector in GUI
 2. **Speaker Diarization** (Optional)
   - Identify different speakers
   - Label transcriptions by speaker
   - Useful for multi-host streams
 3. **Profanity Filtering**
   - Optional word filtering/replacement
   - Customizable filter lists
   - Toggle in GUI settings
 4. **Advanced Noise Profiles**
   - Save and load custom noise profiles
   - Adaptive noise suppression
   - Different profiles for different environments
 5. **Export Functionality**
   - Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
   - Export button in GUI
   - Automatic session saving
 6. **Hotkey Support**
   - Global hotkeys to start/stop transcription
   - Mute/unmute hotkey
   - Quick save hotkey
 7. **Docker Support**
   - Containerized server deployment
   - Docker Compose for easy multi-component setup
   - Pre-built images for easy deployment
 8. **Themes and Customization**
   - Dark/light theme toggle
   - Customizable font sizes and colors for display
   - OBS-friendly transparent overlay mode
 #### Tasks:
 - [ ] Add language detection and multi-language support
 - [ ] Implement speaker diarization
 - [ ] Create optional profanity filter
 - [ ] Add export functionality (SRT, VTT, plain text, JSON)
 - [ ] Implement global hotkey support
 - [ ] Create Docker containers for server component
 - [ ] Add theme customization options
 - [ ] Create advanced noise profile management
 ---
 ## Technology Stack
 ### Local Client:
 - **Python 3.9+**
 - **GUI**: PyQt6 / CustomTkinter / tkinter
 - **Audio**: PyAudio / sounddevice
 - **Noise Suppression**: noisereduce / rnnoise-python
 - **VAD**: webrtcvad
 - **ML Framework**: PyTorch (for Whisper)
 - **Transcription**: openai-whisper / faster-whisper
 - **Networking**: websockets, requests (optional for server sync)
 - **Config**: PyYAML / json
 ### Server:
 - **Backend**: FastAPI / Flask
 - **WebSocket**: python-websockets / FastAPI WebSockets
 - **Server**: Uvicorn / Gunicorn
 - **Database** (optional): SQLite / PostgreSQL
 - **CORS**: fastapi-cors
 ### Web Interface:
 - **Frontend**: HTML5, CSS3, JavaScript (ES6+)
 - **Real-time**: WebSocket API
 - **Styling**: CSS Grid/Flexbox for layout
 ---
 ## Project Structure
 ```
 local-transcription/
- client/                      # Local transcription client
+├── client/                      # Core transcription modules
-    __init__.py
+│   ├── audio_capture.py         # Audio input handling
-    audio_capture.py         # Audio input handling
+│   ├── transcription_engine_realtime.py  # RealtimeSTT integration
-    transcription_engine.py  # Whisper integration
+│   ├── noise_suppression.py     # VAD and noise reduction
-    network_client.py        # Server communication
+│   ├── device_utils.py          # CPU/GPU detection
-    config.py                # Configuration management
+│   ├── config.py                # Configuration management
-    main.py                  # Client entry point
+│   ├── server_sync.py           # Multi-user server client
- server/                      # Centralized web server
+│   └── update_checker.py        # Auto-update functionality
-    __init__.py
+├── gui/                         # Desktop application UI
-    api.py                   # FastAPI routes
+│   ├── main_window_qt.py        # Main application window
-    websocket_handler.py     # WebSocket management
+│   ├── settings_dialog_qt.py    # Settings dialog
-    models.py                # Data models
+│   └── transcription_display_qt.py  # Display widget
-    database.py              # Optional DB layer
+├── server/                      # Web servers
-    main.py                  # Server entry point
+│   ├── web_display.py           # Local FastAPI server for OBS
- web/                         # Web stream interface
+│   └── nodejs/                  # Multi-user sync server
-    index.html               # OBS browser source page
+│       ├── server.js            # Express + WebSocket server
-    styles.css               # Customizable styling
+│       └── README.md            # Deployment instructions
-    app.js                   # WebSocket client & UI logic
+├── config/
- config/
+│   └── default_config.yaml      # Default settings template
-    client_config.example.yaml
+├── main.py                      # GUI entry point
-    server_config.example.yaml
+├── main_cli.py                  # CLI version (for testing)
- tests/
+├── build.sh                     # Linux build script
-    test_audio.py
+├── build.bat                    # Windows build script
-    test_transcription.py
+└── local-transcription.spec     # PyInstaller configuration
    test_server.py
 requirements.txt             # Python dependencies
 README.md
 main.py                      # Combined launcher (optional)
 ```
---
+## Technology Stack
-## Installation (Planned)
+### Desktop Application
 - **Python 3.9+**
 - **PySide6** - Qt6 GUI framework
 - **RealtimeSTT** - Real-time speech-to-text with advanced VAD
 - **faster-whisper** - Optimized Whisper model inference
 - **PyTorch** - ML framework (CUDA-enabled)
 - **sounddevice** - Cross-platform audio capture
 - **webrtcvad + silero_vad** - Voice activity detection
 - **noisereduce** - Noise suppression
-### Prerequisites:
+### Web Servers
- Python 3.9 or higher
+- **FastAPI + Uvicorn** - Local web display server
- CUDA-capable GPU (optional, for GPU acceleration)
+- **Node.js + Express + WebSocket** - Multi-user sync server
 - FFmpeg (required by Whisper)
-### Steps:
+### Build Tools
 - **PyInstaller** - Executable packaging
 - **uv** - Fast Python package manager
-1. **Clone the repository**
+## System Requirements
 ### Minimum
 - Python 3.9+
 - 4GB RAM
 - Any modern CPU
 ### Recommended (for real-time performance)
 - 8GB+ RAM
 - NVIDIA GPU with CUDA support (for GPU acceleration)
 - FFmpeg (installed automatically with dependencies)
 ### For Building
 - **Linux**: gcc, Python dev headers
 - **Windows**: Visual Studio Build Tools, Python dev headers
 ## Troubleshooting
 ### Model Loading Issues
 - Models download automatically on first use to `~/.cache/huggingface/`
 - First run requires internet connection
 - Check disk space (models range from 75MB to 3GB)
 ### Audio Device Issues
 ```bash
-   git clone <repository-url>
+# List available audio devices
-   cd local-transcription
+uv run python main_cli.py --list-devices
 ```
 - Ensure microphone permissions are granted
 - Try different device indices in settings
-2. **Install dependencies**
+### GPU Not Detected
 ```bash
-   pip install -r requirements.txt
+# Check CUDA availability
 uv run python -c "import torch; print(torch.cuda.is_available())"
 ```
 - Install NVIDIA drivers (CUDA toolkit is bundled)
 - The app automatically falls back to CPU if no GPU is available
-3. **Download Whisper models**
+### Web Server Port Conflicts
-   ```bash
+- Default port is 8080
-   # Models will be auto-downloaded on first run
+- Change in settings or edit config file
-   # Or manually download:
+- Check for conflicts: `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows)
   python -c "import whisper; whisper.load_model('base')"
   ```
-4. **Configure client**
+## Use Cases
   ```bash
   cp config/client_config.example.yaml config/client_config.yaml
   # Edit config/client_config.yaml with your settings
   ```
-5. **Run the server** (one instance)
+- **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams
-   ```bash
+- **Multi-Language Translation**: Multiple translators transcribing in different languages
-   python server/main.py
+- **Accessibility**: Provide captions for hearing-impaired viewers
-   ```
+- **Podcast Recording**: Real-time transcription for multi-host shows
-
+- **Gaming Commentary**: Track who said what in multiplayer sessions
 6. **Run the client** (on each user's machine)
   ```bash
   python client/main.py
   ```
 7. **Add to OBS**
   - Add a Browser Source
   - URL: `http://<server-ip>:8000/stream`
   - Set width/height as needed
   - Check "Shutdown source when not visible" for performance
 ---
 ## Configuration (Planned)
 ### Client Configuration:
 ```yaml
 user:
  name: "Streamer1"          # Display name for transcriptions
  id: "unique-user-id"       # Optional unique identifier
 audio:
  input_device: "default"    # or specific device index
  sample_rate: 16000
  chunk_duration: 2.0        # seconds
 noise_suppression:
  enabled: true              # Enable/disable noise reduction
  strength: 0.7              # 0.0 to 1.0 - reduction strength
  method: "noisereduce"      # "noisereduce" or "rnnoise"
 transcription:
  model: "base"              # tiny, base, small, medium, large
  device: "cuda"             # cpu, cuda, mps
  language: "en"             # or "auto" for detection
  task: "transcribe"         # or "translate"
 processing:
  use_vad: true              # Voice Activity Detection
  min_confidence: 0.5        # Minimum transcription confidence
 server_sync:
  enabled: false             # Enable multi-user server sync
  url: "ws://localhost:8000" # Server URL (when enabled)
  api_key: ""                # Optional API key
 display:
  show_timestamps: true      # Show timestamps in local display
  max_lines: 100             # Maximum lines to keep in display
  font_size: 12              # GUI font size
 ```
 ### Server Configuration:
 ```yaml
 server:
  host: "0.0.0.0"
  port: 8000
  api_key_required: false
 stream:
  max_clients: 10
  buffer_size: 100         # messages to buffer
  retention_time: 300      # seconds
 database:
  enabled: false
  path: "transcriptions.db"
 ```
 ---
 ## Roadmap
 - [x] Project planning and architecture design
 - [ ] Phase 1: Standalone desktop application with GUI
 - [ ] Phase 2: Web server and sync system (optional multi-user mode)
 - [ ] Phase 3: Client-server communication (optional)
 - [ ] Phase 4: Web stream interface for OBS (optional)
 - [ ] Phase 5: Advanced features (hotkeys, themes, Docker, etc.)
 ---
 ## Contributing
-Contributions are welcome! Please feel free to submit issues or pull requests.
+Contributions are welcome! Please feel free to submit issues or pull requests at the [repository](https://repo.anhonesthost.net/streamer-tools/local-transcription).
 ---
 ## License
-[Choose appropriate license - MIT, Apache 2.0, etc.]
+MIT License
 ---
 ## Acknowledgments
- OpenAI Whisper for the excellent speech recognition model
+- [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model
- The streaming community for inspiration and use cases
+- [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities
 - [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference