Update README to reflect current application state

Remove outdated implementation plan and task checklists. Document actual implemented features including RealtimeSTT, dual-layer VAD, custom fonts/colors, and auto-updates. Add practical usage instructions for standalone mode, OBS setup, and multi-user sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 06:31:27 -08:00
parent b7ab57f21f
commit bb8a8c251d
1 changed files with 148 additions and 418 deletions
--- a/README.md
+++ b/README.md
@@ -1,19 +1,22 @@
-# Local Transcription for Streamers
+# Local Transcription

-A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
+A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.
+
+**Version 1.4.0**

 ## Features

- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware
- **Real-time Processing**: Live audio transcription with minimal latency
+- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
+- **Standalone Desktop App**: PySide6/Qt GUI that works without any server
+- **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
+- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
+- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
+- **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users
+- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
+- **Customizable Colors**: User-configurable colors for name, text, and background
 - **Noise Suppression**: Built-in audio preprocessing to reduce background noise
- **User Configuration**: Set your display name and preferences through the GUI
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users
- **OBS Integration**: Web-based output designed for easy browser source capture
- **Privacy-First**: All processing happens locally; only transcription text is shared
- **Customizable**: Configure model size, language, and streaming settings
+- **Auto-Updates**: Automatic update checking with release notes display
+- **Cross-Platform**: Builds available for Windows and Linux

 ## Quick Start

@@ -27,468 +30,195 @@ uv sync
 uv run python main.py
 ```

-### Building Standalone Executables
+### Using Pre-Built Executables

-To create standalone executables for distribution:
+Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases) and run the executable for your platform.
+
+### Building from Source

 **Linux:**
 ```bash
 ./build.sh
+# Output: dist/LocalTranscription/LocalTranscription
 ```

 **Windows:**
 ```cmd
 build.bat
+# Output: dist\LocalTranscription\LocalTranscription.exe
 ```

 For detailed build instructions, see [BUILD.md](BUILD.md).

-## Architecture Overview
+## Usage

-The application can run in two modes:
+### Standalone Mode

-### Standalone Mode (No Server Required):
-1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
+1. Launch the application
+2. Select your microphone from the audio device dropdown
+3. Choose a Whisper model (smaller = faster, larger = more accurate):
+   - `tiny.en` / `tiny` - Fastest, good for quick captions
+   - `base.en` / `base` - Balanced speed and accuracy
+   - `small.en` / `small` - Better accuracy
+   - `medium.en` / `medium` - High accuracy
+   - `large-v3` - Best accuracy (requires more resources)
+4. Click **Start** to begin transcription
+5. Transcriptions appear in the main window and at `http://localhost:8080`

-### Multi-user Sync Mode (Optional):
-1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
-2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
-3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
+### OBS Browser Source Setup

-## Use Cases
+1. Start the Local Transcription app
+2. In OBS, add a **Browser** source
+3. Set URL to `http://localhost:8080`
+4. Set dimensions (e.g., 1920x300)
+5. Check "Shutdown source when not visible" for performance

- **Multi-language Streams**: Multiple translators transcribing in different languages
- **Accessibility**: Provide real-time captions for viewers
- **Collaborative Podcasts**: Multiple hosts with separate transcriptions
- **Gaming Commentary**: Track who said what in multiplayer sessions
+### Multi-User Mode (Optional)

---
+For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):

-## Implementation Plan
+1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
+2. In the app settings, enable **Server Sync**
+3. Enter the server URL (e.g., `http://your-server:3000/api/send`)
+4. Set a room name and passphrase (shared with other users)
+5. In OBS, use the server's display URL with your room name:
+   ```
+   http://your-server:3000/display?room=YOURROOM&timestamps=true&maxlines=50
+   ```

-### Phase 1: Standalone Desktop Application
+## Configuration

-**Objective**: Build a fully functional standalone transcription app with GUI that works without any server
+Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel.

-#### Components:
-1. **Audio Capture Module**
-   - Capture system audio or microphone input
-   - Support multiple audio sources (virtual audio cables, physical devices)
-   - Real-time audio buffering with configurable chunk sizes
-   - **Noise Suppression**: Preprocess audio to reduce background noise
-   - Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
+### Key Settings

-2. **Noise Suppression Engine**
-   - Real-time noise reduction using RNNoise or noisereduce
-   - Adjustable noise reduction strength
-   - Optional VAD (Voice Activity Detection) to skip silent segments
-   - Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad`
+| Setting | Description | Default |
+|---------|-------------|---------|
+| `transcription.model` | Whisper model to use | `base.en` |
+| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
+| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
+| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
+| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
+| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
+| `display.show_timestamps` | Show timestamps with transcriptions | `true` |
+| `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
+| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
+| `web_server.port` | Local web server port | `8080` |

-3. **Transcription Engine**
-   - Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
-   - Support multiple model sizes (tiny, base, small, medium, large)
-   - CPU and GPU inference options
-   - Model management and automatic downloading
-   - Libraries: `openai-whisper`, `faster-whisper`, `torch`
-
-4. **Device Selection**
-   - Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
-   - Allow user to specify preferred device via GUI
-   - Graceful fallback if GPU unavailable
-   - Display device status and performance metrics
-
-5. **Desktop GUI Application**
-   - Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
-   - Main transcription display window (scrolling text area)
-   - Settings panel for configuration
-   - User name input field
-   - Audio input device selector
-   - Model size selector
-   - CPU/GPU toggle
-   - Start/Stop transcription button
-   - Optional: System tray integration
-   - Libraries: `PyQt6`, `customtkinter`, or `tkinter`
-
-6. **Local Display**
-   - Real-time transcription display in GUI window
-   - Scrolling text with timestamps
-   - User name/label shown with transcriptions
-   - Copy transcription to clipboard
-   - Optional: Save transcription to file (TXT, SRT, VTT)
-
-#### Tasks:
- [ ] Set up project structure and dependencies
- [ ] Implement audio capture with device selection
- [ ] Add noise suppression and VAD preprocessing
- [ ] Integrate Whisper model loading and inference
- [ ] Add CPU/GPU device detection and selection logic
- [ ] Create real-time audio buffer processing pipeline
- [ ] Design and implement GUI layout (main window)
- [ ] Add settings panel with user name configuration
- [ ] Implement local transcription display area
- [ ] Add start/stop controls and status indicators
- [ ] Test transcription accuracy and latency
- [ ] Test noise suppression effectiveness
-
---
-
-### Phase 2: Web Server and Sync System
-
-**Objective**: Create a centralized server to aggregate and serve transcriptions
-
-#### Components:
-1. **Web Server**
-   - FastAPI or Flask-based REST API
-   - WebSocket support for real-time updates
-   - User/client registration and management
-   - Libraries: `fastapi`, `uvicorn`, `websockets`
-
-2. **Transcription Aggregator**
-   - Receive transcription chunks from multiple clients
-   - Associate transcriptions with user IDs/names
-   - Timestamp management and synchronization
-   - Buffer management for smooth streaming
-
-3. **Database/Storage** (Optional)
-   - Store transcription history (SQLite for simplicity)
-   - Session management
-   - Export functionality (SRT, VTT, TXT formats)
-
-#### API Endpoints:
- `POST /api/register` - Register a new client
- `POST /api/transcription` - Submit transcription chunk
- `WS /api/stream` - WebSocket for real-time transcription stream
- `GET /stream` - Web page for OBS browser source
-
-#### Tasks:
- [ ] Set up FastAPI server with CORS support
- [ ] Implement WebSocket handler for real-time streaming
- [ ] Create client registration system
- [ ] Build transcription aggregation logic
- [ ] Add timestamp synchronization
- [ ] Create data models for clients and transcriptions
-
---
-
-### Phase 3: Client-Server Communication (Optional Multi-user Mode)
-
-**Objective**: Add optional server connectivity to enable multi-user transcription sync
-
-#### Components:
-1. **HTTP/WebSocket Client**
-   - Register client with server on startup
-   - Send transcription chunks as they're generated
-   - Handle connection drops and reconnection
-   - Libraries: `requests`, `websockets`
-
-2. **Configuration System**
-   - Config file for server URL, API keys, user settings
-   - Model preferences (size, language)
-   - Audio input settings
-   - Format: YAML or JSON
-
-3. **Status Monitoring**
-   - Connection status indicator
-   - Transcription queue health
-   - Error handling and logging
-
-#### Tasks:
- [ ] Add "Enable Server Sync" toggle to GUI
- [ ] Add server URL configuration field in settings
- [ ] Implement WebSocket client for sending transcriptions
- [ ] Add configuration file support (YAML/JSON)
- [ ] Create connection management with auto-reconnect
- [ ] Add local logging and error handling
- [ ] Add server connection status indicator to GUI
- [ ] Allow app to function normally if server is unavailable
-
---
-
-### Phase 4: Web Stream Interface (OBS Integration)
-
-**Objective**: Create a web page that displays synchronized transcriptions for OBS
-
-#### Components:
-1. **Web Frontend**
-   - HTML/CSS/JavaScript page for displaying transcriptions
-   - Responsive design with customizable styling
-   - Auto-scroll with configurable retention window
-   - Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
-
-2. **Styling Options**
-   - Customizable fonts, colors, sizes
-   - Background transparency for OBS chroma key
-   - User name/ID display options
-   - Timestamp display (optional)
-
-3. **Display Modes**
-   - Scrolling captions (like live TV captions)
-   - Multi-user panel view (separate sections per user)
-   - Overlay mode (minimal UI for transparency)
-
-#### Tasks:
- [ ] Create HTML template for transcription display
- [ ] Implement WebSocket client in JavaScript
- [ ] Add CSS styling with OBS-friendly transparency
- [ ] Create customization controls (URL parameters or UI)
- [ ] Test with OBS browser source
- [ ] Add configurable retention/scroll behavior
-
---
-
-### Phase 5: Advanced Features
-
-**Objective**: Enhance functionality and user experience
-
-#### Features:
-1. **Language Detection**
-   - Auto-detect spoken language
-   - Multi-language support in single stream
-   - Language selector in GUI
-
-2. **Speaker Diarization** (Optional)
-   - Identify different speakers
-   - Label transcriptions by speaker
-   - Useful for multi-host streams
-
-3. **Profanity Filtering**
-   - Optional word filtering/replacement
-   - Customizable filter lists
-   - Toggle in GUI settings
-
-4. **Advanced Noise Profiles**
-   - Save and load custom noise profiles
-   - Adaptive noise suppression
-   - Different profiles for different environments
-
-5. **Export Functionality**
-   - Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
-   - Export button in GUI
-   - Automatic session saving
-
-6. **Hotkey Support**
-   - Global hotkeys to start/stop transcription
-   - Mute/unmute hotkey
-   - Quick save hotkey
-
-7. **Docker Support**
-   - Containerized server deployment
-   - Docker Compose for easy multi-component setup
-   - Pre-built images for easy deployment
-
-8. **Themes and Customization**
-   - Dark/light theme toggle
-   - Customizable font sizes and colors for display
-   - OBS-friendly transparent overlay mode
-
-#### Tasks:
- [ ] Add language detection and multi-language support
- [ ] Implement speaker diarization
- [ ] Create optional profanity filter
- [ ] Add export functionality (SRT, VTT, plain text, JSON)
- [ ] Implement global hotkey support
- [ ] Create Docker containers for server component
- [ ] Add theme customization options
- [ ] Create advanced noise profile management
-
---
-
-## Technology Stack
-
-### Local Client:
- **Python 3.9+**
- **GUI**: PyQt6 / CustomTkinter / tkinter
- **Audio**: PyAudio / sounddevice
- **Noise Suppression**: noisereduce / rnnoise-python
- **VAD**: webrtcvad
- **ML Framework**: PyTorch (for Whisper)
- **Transcription**: openai-whisper / faster-whisper
- **Networking**: websockets, requests (optional for server sync)
- **Config**: PyYAML / json
-
-### Server:
- **Backend**: FastAPI / Flask
- **WebSocket**: python-websockets / FastAPI WebSockets
- **Server**: Uvicorn / Gunicorn
- **Database** (optional): SQLite / PostgreSQL
- **CORS**: fastapi-cors
-
-### Web Interface:
- **Frontend**: HTML5, CSS3, JavaScript (ES6+)
- **Real-time**: WebSocket API
- **Styling**: CSS Grid/Flexbox for layout
-
---
+See [config/default_config.yaml](config/default_config.yaml) for all available options.

 ## Project Structure

 ```
 local-transcription/
- client/                      # Local transcription client
-    __init__.py
-    audio_capture.py         # Audio input handling
-    transcription_engine.py  # Whisper integration
-    network_client.py        # Server communication
-    config.py                # Configuration management
-    main.py                  # Client entry point
- server/                      # Centralized web server
-    __init__.py
-    api.py                   # FastAPI routes
-    websocket_handler.py     # WebSocket management
-    models.py                # Data models
-    database.py              # Optional DB layer
-    main.py                  # Server entry point
- web/                         # Web stream interface
-    index.html               # OBS browser source page
-    styles.css               # Customizable styling
-    app.js                   # WebSocket client & UI logic
- config/
-    client_config.example.yaml
-    server_config.example.yaml
- tests/
-    test_audio.py
-    test_transcription.py
-    test_server.py
- requirements.txt             # Python dependencies
- README.md
- main.py                      # Combined launcher (optional)
+├── client/                      # Core transcription modules
+│   ├── audio_capture.py         # Audio input handling
+│   ├── transcription_engine_realtime.py  # RealtimeSTT integration
+│   ├── noise_suppression.py     # VAD and noise reduction
+│   ├── device_utils.py          # CPU/GPU detection
+│   ├── config.py                # Configuration management
+│   ├── server_sync.py           # Multi-user server client
+│   └── update_checker.py        # Auto-update functionality
+├── gui/                         # Desktop application UI
+│   ├── main_window_qt.py        # Main application window
+│   ├── settings_dialog_qt.py    # Settings dialog
+│   └── transcription_display_qt.py  # Display widget
+├── server/                      # Web servers
+│   ├── web_display.py           # Local FastAPI server for OBS
+│   └── nodejs/                  # Multi-user sync server
+│       ├── server.js            # Express + WebSocket server
+│       └── README.md            # Deployment instructions
+├── config/
+│   └── default_config.yaml      # Default settings template
+├── main.py                      # GUI entry point
+├── main_cli.py                  # CLI version (for testing)
+├── build.sh                     # Linux build script
+├── build.bat                    # Windows build script
+└── local-transcription.spec     # PyInstaller configuration
 ```

---
+## Technology Stack

-## Installation (Planned)
+### Desktop Application
+- **Python 3.9+**
+- **PySide6** - Qt6 GUI framework
+- **RealtimeSTT** - Real-time speech-to-text with advanced VAD
+- **faster-whisper** - Optimized Whisper model inference
+- **PyTorch** - ML framework (CUDA-enabled)
+- **sounddevice** - Cross-platform audio capture
+- **webrtcvad + silero_vad** - Voice activity detection
+- **noisereduce** - Noise suppression

-### Prerequisites:
- Python 3.9 or higher
- CUDA-capable GPU (optional, for GPU acceleration)
- FFmpeg (required by Whisper)
+### Web Servers
+- **FastAPI + Uvicorn** - Local web display server
+- **Node.js + Express + WebSocket** - Multi-user sync server

-### Steps:
+### Build Tools
+- **PyInstaller** - Executable packaging
+- **uv** - Fast Python package manager

-1. **Clone the repository**
+## System Requirements
+
+### Minimum
+- Python 3.9+
+- 4GB RAM
+- Any modern CPU
+
+### Recommended (for real-time performance)
+- 8GB+ RAM
+- NVIDIA GPU with CUDA support (for GPU acceleration)
+- FFmpeg (installed automatically with dependencies)
+
+### For Building
+- **Linux**: gcc, Python dev headers
+- **Windows**: Visual Studio Build Tools, Python dev headers
+
+## Troubleshooting
+
+### Model Loading Issues
+- Models download automatically on first use to `~/.cache/huggingface/`
+- First run requires internet connection
+- Check disk space (models range from 75MB to 3GB)
+
+### Audio Device Issues
 ```bash
-   git clone <repository-url>
-   cd local-transcription
+# List available audio devices
+uv run python main_cli.py --list-devices
 ```
+- Ensure microphone permissions are granted
+- Try different device indices in settings

-2. **Install dependencies**
+### GPU Not Detected
 ```bash
-   pip install -r requirements.txt
+# Check CUDA availability
+uv run python -c "import torch; print(torch.cuda.is_available())"
 ```
+- Install NVIDIA drivers (CUDA toolkit is bundled)
+- The app automatically falls back to CPU if no GPU is available

-3. **Download Whisper models**
-   ```bash
-   # Models will be auto-downloaded on first run
-   # Or manually download:
-   python -c "import whisper; whisper.load_model('base')"
-   ```
+### Web Server Port Conflicts
+- Default port is 8080
+- Change in settings or edit config file
+- Check for conflicts: `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows)

-4. **Configure client**
-   ```bash
-   cp config/client_config.example.yaml config/client_config.yaml
-   # Edit config/client_config.yaml with your settings
-   ```
+## Use Cases

-5. **Run the server** (one instance)
-   ```bash
-   python server/main.py
-   ```
-
-6. **Run the client** (on each user's machine)
-   ```bash
-   python client/main.py
-   ```
-
-7. **Add to OBS**
-   - Add a Browser Source
-   - URL: `http://<server-ip>:8000/stream`
-   - Set width/height as needed
-   - Check "Shutdown source when not visible" for performance
-
---
-
-## Configuration (Planned)
-
-### Client Configuration:
-```yaml
-user:
-  name: "Streamer1"          # Display name for transcriptions
-  id: "unique-user-id"       # Optional unique identifier
-
-audio:
-  input_device: "default"    # or specific device index
-  sample_rate: 16000
-  chunk_duration: 2.0        # seconds
-
-noise_suppression:
-  enabled: true              # Enable/disable noise reduction
-  strength: 0.7              # 0.0 to 1.0 - reduction strength
-  method: "noisereduce"      # "noisereduce" or "rnnoise"
-
-transcription:
-  model: "base"              # tiny, base, small, medium, large
-  device: "cuda"             # cpu, cuda, mps
-  language: "en"             # or "auto" for detection
-  task: "transcribe"         # or "translate"
-
-processing:
-  use_vad: true              # Voice Activity Detection
-  min_confidence: 0.5        # Minimum transcription confidence
-
-server_sync:
-  enabled: false             # Enable multi-user server sync
-  url: "ws://localhost:8000" # Server URL (when enabled)
-  api_key: ""                # Optional API key
-
-display:
-  show_timestamps: true      # Show timestamps in local display
-  max_lines: 100             # Maximum lines to keep in display
-  font_size: 12              # GUI font size
-```
-
-### Server Configuration:
-```yaml
-server:
-  host: "0.0.0.0"
-  port: 8000
-  api_key_required: false
-
-stream:
-  max_clients: 10
-  buffer_size: 100         # messages to buffer
-  retention_time: 300      # seconds
-
-database:
-  enabled: false
-  path: "transcriptions.db"
-```
-
---
-
-## Roadmap
-
- [x] Project planning and architecture design
- [ ] Phase 1: Standalone desktop application with GUI
- [ ] Phase 2: Web server and sync system (optional multi-user mode)
- [ ] Phase 3: Client-server communication (optional)
- [ ] Phase 4: Web stream interface for OBS (optional)
- [ ] Phase 5: Advanced features (hotkeys, themes, Docker, etc.)
-
---
+- **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams
+- **Multi-Language Translation**: Multiple translators transcribing in different languages
+- **Accessibility**: Provide captions for hearing-impaired viewers
+- **Podcast Recording**: Real-time transcription for multi-host shows
+- **Gaming Commentary**: Track who said what in multiplayer sessions

 ## Contributing

-Contributions are welcome! Please feel free to submit issues or pull requests.
-
---
+Contributions are welcome! Please feel free to submit issues or pull requests at the [repository](https://repo.anhonesthost.net/streamer-tools/local-transcription).

 ## License

-[Choose appropriate license - MIT, Apache 2.0, etc.]
-
---
+MIT License

 ## Acknowledgments

- OpenAI Whisper for the excellent speech recognition model
- The streaming community for inspiration and use cases
+- [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model
+- [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities
+- [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference