diff --git a/README.md b/README.md index c045df5..c77516a 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,22 @@ -# Local Transcription for Streamers +# Local Transcription -A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software. +A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server. + +**Version 1.4.0** ## Features -- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required -- **Local Transcription**: Run Whisper (or compatible models) locally on your machine -- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware -- **Real-time Processing**: Live audio transcription with minimal latency +- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency +- **Standalone Desktop App**: PySide6/Qt GUI that works without any server +- **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback +- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection +- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080` +- **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users +- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files +- **Customizable Colors**: User-configurable colors for name, text, and background - **Noise Suppression**: Built-in audio preprocessing to reduce background noise -- **User Configuration**: Set your display name and preferences through the GUI -- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users -- **OBS Integration**: Web-based output designed for easy browser source capture -- **Privacy-First**: All processing happens locally; only transcription text is shared -- **Customizable**: Configure model size, language, and streaming settings +- **Auto-Updates**: Automatic update checking with release notes display +- **Cross-Platform**: Builds available for Windows and Linux ## Quick Start @@ -27,468 +30,195 @@ uv sync uv run python main.py ``` -### Building Standalone Executables +### Using Pre-Built Executables -To create standalone executables for distribution: +Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases) and run the executable for your platform. + +### Building from Source **Linux:** ```bash ./build.sh +# Output: dist/LocalTranscription/LocalTranscription ``` **Windows:** ```cmd build.bat +# Output: dist\LocalTranscription\LocalTranscription.exe ``` For detailed build instructions, see [BUILD.md](BUILD.md). -## Architecture Overview +## Usage -The application can run in two modes: +### Standalone Mode -### Standalone Mode (No Server Required): -1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window +1. Launch the application +2. Select your microphone from the audio device dropdown +3. Choose a Whisper model (smaller = faster, larger = more accurate): + - `tiny.en` / `tiny` - Fastest, good for quick captions + - `base.en` / `base` - Balanced speed and accuracy + - `small.en` / `small` - Better accuracy + - `medium.en` / `medium` - High accuracy + - `large-v3` - Best accuracy (requires more resources) +4. Click **Start** to begin transcription +5. Transcriptions appear in the main window and at `http://localhost:8080` -### Multi-user Sync Mode (Optional): -1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server -2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream -3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture) +### OBS Browser Source Setup -## Use Cases +1. Start the Local Transcription app +2. In OBS, add a **Browser** source +3. Set URL to `http://localhost:8080` +4. Set dimensions (e.g., 1920x300) +5. Check "Shutdown source when not visible" for performance -- **Multi-language Streams**: Multiple translators transcribing in different languages -- **Accessibility**: Provide real-time captions for viewers -- **Collaborative Podcasts**: Multiple hosts with separate transcriptions -- **Gaming Commentary**: Track who said what in multiplayer sessions +### Multi-User Mode (Optional) ---- +For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams): -## Implementation Plan +1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md)) +2. In the app settings, enable **Server Sync** +3. Enter the server URL (e.g., `http://your-server:3000/api/send`) +4. Set a room name and passphrase (shared with other users) +5. In OBS, use the server's display URL with your room name: + ``` + http://your-server:3000/display?room=YOURROOM×tamps=true&maxlines=50 + ``` -### Phase 1: Standalone Desktop Application +## Configuration -**Objective**: Build a fully functional standalone transcription app with GUI that works without any server +Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel. -#### Components: -1. **Audio Capture Module** - - Capture system audio or microphone input - - Support multiple audio sources (virtual audio cables, physical devices) - - Real-time audio buffering with configurable chunk sizes - - **Noise Suppression**: Preprocess audio to reduce background noise - - Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad` +### Key Settings -2. **Noise Suppression Engine** - - Real-time noise reduction using RNNoise or noisereduce - - Adjustable noise reduction strength - - Optional VAD (Voice Activity Detection) to skip silent segments - - Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad` +| Setting | Description | Default | +|---------|-------------|---------| +| `transcription.model` | Whisper model to use | `base.en` | +| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` | +| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` | +| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` | +| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` | +| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` | +| `display.show_timestamps` | Show timestamps with transcriptions | `true` | +| `display.fade_after_seconds` | Fade out time (0 = never) | `10` | +| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` | +| `web_server.port` | Local web server port | `8080` | -3. **Transcription Engine** - - Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp) - - Support multiple model sizes (tiny, base, small, medium, large) - - CPU and GPU inference options - - Model management and automatic downloading - - Libraries: `openai-whisper`, `faster-whisper`, `torch` - -4. **Device Selection** - - Auto-detect available compute devices (CPU, CUDA, MPS for Mac) - - Allow user to specify preferred device via GUI - - Graceful fallback if GPU unavailable - - Display device status and performance metrics - -5. **Desktop GUI Application** - - Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter - - Main transcription display window (scrolling text area) - - Settings panel for configuration - - User name input field - - Audio input device selector - - Model size selector - - CPU/GPU toggle - - Start/Stop transcription button - - Optional: System tray integration - - Libraries: `PyQt6`, `customtkinter`, or `tkinter` - -6. **Local Display** - - Real-time transcription display in GUI window - - Scrolling text with timestamps - - User name/label shown with transcriptions - - Copy transcription to clipboard - - Optional: Save transcription to file (TXT, SRT, VTT) - -#### Tasks: -- [ ] Set up project structure and dependencies -- [ ] Implement audio capture with device selection -- [ ] Add noise suppression and VAD preprocessing -- [ ] Integrate Whisper model loading and inference -- [ ] Add CPU/GPU device detection and selection logic -- [ ] Create real-time audio buffer processing pipeline -- [ ] Design and implement GUI layout (main window) -- [ ] Add settings panel with user name configuration -- [ ] Implement local transcription display area -- [ ] Add start/stop controls and status indicators -- [ ] Test transcription accuracy and latency -- [ ] Test noise suppression effectiveness - ---- - -### Phase 2: Web Server and Sync System - -**Objective**: Create a centralized server to aggregate and serve transcriptions - -#### Components: -1. **Web Server** - - FastAPI or Flask-based REST API - - WebSocket support for real-time updates - - User/client registration and management - - Libraries: `fastapi`, `uvicorn`, `websockets` - -2. **Transcription Aggregator** - - Receive transcription chunks from multiple clients - - Associate transcriptions with user IDs/names - - Timestamp management and synchronization - - Buffer management for smooth streaming - -3. **Database/Storage** (Optional) - - Store transcription history (SQLite for simplicity) - - Session management - - Export functionality (SRT, VTT, TXT formats) - -#### API Endpoints: -- `POST /api/register` - Register a new client -- `POST /api/transcription` - Submit transcription chunk -- `WS /api/stream` - WebSocket for real-time transcription stream -- `GET /stream` - Web page for OBS browser source - -#### Tasks: -- [ ] Set up FastAPI server with CORS support -- [ ] Implement WebSocket handler for real-time streaming -- [ ] Create client registration system -- [ ] Build transcription aggregation logic -- [ ] Add timestamp synchronization -- [ ] Create data models for clients and transcriptions - ---- - -### Phase 3: Client-Server Communication (Optional Multi-user Mode) - -**Objective**: Add optional server connectivity to enable multi-user transcription sync - -#### Components: -1. **HTTP/WebSocket Client** - - Register client with server on startup - - Send transcription chunks as they're generated - - Handle connection drops and reconnection - - Libraries: `requests`, `websockets` - -2. **Configuration System** - - Config file for server URL, API keys, user settings - - Model preferences (size, language) - - Audio input settings - - Format: YAML or JSON - -3. **Status Monitoring** - - Connection status indicator - - Transcription queue health - - Error handling and logging - -#### Tasks: -- [ ] Add "Enable Server Sync" toggle to GUI -- [ ] Add server URL configuration field in settings -- [ ] Implement WebSocket client for sending transcriptions -- [ ] Add configuration file support (YAML/JSON) -- [ ] Create connection management with auto-reconnect -- [ ] Add local logging and error handling -- [ ] Add server connection status indicator to GUI -- [ ] Allow app to function normally if server is unavailable - ---- - -### Phase 4: Web Stream Interface (OBS Integration) - -**Objective**: Create a web page that displays synchronized transcriptions for OBS - -#### Components: -1. **Web Frontend** - - HTML/CSS/JavaScript page for displaying transcriptions - - Responsive design with customizable styling - - Auto-scroll with configurable retention window - - Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx) - -2. **Styling Options** - - Customizable fonts, colors, sizes - - Background transparency for OBS chroma key - - User name/ID display options - - Timestamp display (optional) - -3. **Display Modes** - - Scrolling captions (like live TV captions) - - Multi-user panel view (separate sections per user) - - Overlay mode (minimal UI for transparency) - -#### Tasks: -- [ ] Create HTML template for transcription display -- [ ] Implement WebSocket client in JavaScript -- [ ] Add CSS styling with OBS-friendly transparency -- [ ] Create customization controls (URL parameters or UI) -- [ ] Test with OBS browser source -- [ ] Add configurable retention/scroll behavior - ---- - -### Phase 5: Advanced Features - -**Objective**: Enhance functionality and user experience - -#### Features: -1. **Language Detection** - - Auto-detect spoken language - - Multi-language support in single stream - - Language selector in GUI - -2. **Speaker Diarization** (Optional) - - Identify different speakers - - Label transcriptions by speaker - - Useful for multi-host streams - -3. **Profanity Filtering** - - Optional word filtering/replacement - - Customizable filter lists - - Toggle in GUI settings - -4. **Advanced Noise Profiles** - - Save and load custom noise profiles - - Adaptive noise suppression - - Different profiles for different environments - -5. **Export Functionality** - - Save transcriptions in multiple formats (TXT, SRT, VTT, JSON) - - Export button in GUI - - Automatic session saving - -6. **Hotkey Support** - - Global hotkeys to start/stop transcription - - Mute/unmute hotkey - - Quick save hotkey - -7. **Docker Support** - - Containerized server deployment - - Docker Compose for easy multi-component setup - - Pre-built images for easy deployment - -8. **Themes and Customization** - - Dark/light theme toggle - - Customizable font sizes and colors for display - - OBS-friendly transparent overlay mode - -#### Tasks: -- [ ] Add language detection and multi-language support -- [ ] Implement speaker diarization -- [ ] Create optional profanity filter -- [ ] Add export functionality (SRT, VTT, plain text, JSON) -- [ ] Implement global hotkey support -- [ ] Create Docker containers for server component -- [ ] Add theme customization options -- [ ] Create advanced noise profile management - ---- - -## Technology Stack - -### Local Client: -- **Python 3.9+** -- **GUI**: PyQt6 / CustomTkinter / tkinter -- **Audio**: PyAudio / sounddevice -- **Noise Suppression**: noisereduce / rnnoise-python -- **VAD**: webrtcvad -- **ML Framework**: PyTorch (for Whisper) -- **Transcription**: openai-whisper / faster-whisper -- **Networking**: websockets, requests (optional for server sync) -- **Config**: PyYAML / json - -### Server: -- **Backend**: FastAPI / Flask -- **WebSocket**: python-websockets / FastAPI WebSockets -- **Server**: Uvicorn / Gunicorn -- **Database** (optional): SQLite / PostgreSQL -- **CORS**: fastapi-cors - -### Web Interface: -- **Frontend**: HTML5, CSS3, JavaScript (ES6+) -- **Real-time**: WebSocket API -- **Styling**: CSS Grid/Flexbox for layout - ---- +See [config/default_config.yaml](config/default_config.yaml) for all available options. ## Project Structure ``` local-transcription/ - client/ # Local transcription client -  __init__.py -  audio_capture.py # Audio input handling -  transcription_engine.py # Whisper integration -  network_client.py # Server communication -  config.py # Configuration management -  main.py # Client entry point - server/ # Centralized web server -  __init__.py -  api.py # FastAPI routes -  websocket_handler.py # WebSocket management -  models.py # Data models -  database.py # Optional DB layer -  main.py # Server entry point - web/ # Web stream interface -  index.html # OBS browser source page -  styles.css # Customizable styling -  app.js # WebSocket client & UI logic - config/ -  client_config.example.yaml -  server_config.example.yaml - tests/ -  test_audio.py -  test_transcription.py -  test_server.py - requirements.txt # Python dependencies - README.md - main.py # Combined launcher (optional) +├── client/ # Core transcription modules +│ ├── audio_capture.py # Audio input handling +│ ├── transcription_engine_realtime.py # RealtimeSTT integration +│ ├── noise_suppression.py # VAD and noise reduction +│ ├── device_utils.py # CPU/GPU detection +│ ├── config.py # Configuration management +│ ├── server_sync.py # Multi-user server client +│ └── update_checker.py # Auto-update functionality +├── gui/ # Desktop application UI +│ ├── main_window_qt.py # Main application window +│ ├── settings_dialog_qt.py # Settings dialog +│ └── transcription_display_qt.py # Display widget +├── server/ # Web servers +│ ├── web_display.py # Local FastAPI server for OBS +│ └── nodejs/ # Multi-user sync server +│ ├── server.js # Express + WebSocket server +│ └── README.md # Deployment instructions +├── config/ +│ └── default_config.yaml # Default settings template +├── main.py # GUI entry point +├── main_cli.py # CLI version (for testing) +├── build.sh # Linux build script +├── build.bat # Windows build script +└── local-transcription.spec # PyInstaller configuration ``` ---- +## Technology Stack -## Installation (Planned) +### Desktop Application +- **Python 3.9+** +- **PySide6** - Qt6 GUI framework +- **RealtimeSTT** - Real-time speech-to-text with advanced VAD +- **faster-whisper** - Optimized Whisper model inference +- **PyTorch** - ML framework (CUDA-enabled) +- **sounddevice** - Cross-platform audio capture +- **webrtcvad + silero_vad** - Voice activity detection +- **noisereduce** - Noise suppression -### Prerequisites: -- Python 3.9 or higher -- CUDA-capable GPU (optional, for GPU acceleration) -- FFmpeg (required by Whisper) +### Web Servers +- **FastAPI + Uvicorn** - Local web display server +- **Node.js + Express + WebSocket** - Multi-user sync server -### Steps: +### Build Tools +- **PyInstaller** - Executable packaging +- **uv** - Fast Python package manager -1. **Clone the repository** - ```bash - git clone - cd local-transcription - ``` +## System Requirements -2. **Install dependencies** - ```bash - pip install -r requirements.txt - ``` +### Minimum +- Python 3.9+ +- 4GB RAM +- Any modern CPU -3. **Download Whisper models** - ```bash - # Models will be auto-downloaded on first run - # Or manually download: - python -c "import whisper; whisper.load_model('base')" - ``` +### Recommended (for real-time performance) +- 8GB+ RAM +- NVIDIA GPU with CUDA support (for GPU acceleration) +- FFmpeg (installed automatically with dependencies) -4. **Configure client** - ```bash - cp config/client_config.example.yaml config/client_config.yaml - # Edit config/client_config.yaml with your settings - ``` +### For Building +- **Linux**: gcc, Python dev headers +- **Windows**: Visual Studio Build Tools, Python dev headers -5. **Run the server** (one instance) - ```bash - python server/main.py - ``` +## Troubleshooting -6. **Run the client** (on each user's machine) - ```bash - python client/main.py - ``` +### Model Loading Issues +- Models download automatically on first use to `~/.cache/huggingface/` +- First run requires internet connection +- Check disk space (models range from 75MB to 3GB) -7. **Add to OBS** - - Add a Browser Source - - URL: `http://:8000/stream` - - Set width/height as needed - - Check "Shutdown source when not visible" for performance - ---- - -## Configuration (Planned) - -### Client Configuration: -```yaml -user: - name: "Streamer1" # Display name for transcriptions - id: "unique-user-id" # Optional unique identifier - -audio: - input_device: "default" # or specific device index - sample_rate: 16000 - chunk_duration: 2.0 # seconds - -noise_suppression: - enabled: true # Enable/disable noise reduction - strength: 0.7 # 0.0 to 1.0 - reduction strength - method: "noisereduce" # "noisereduce" or "rnnoise" - -transcription: - model: "base" # tiny, base, small, medium, large - device: "cuda" # cpu, cuda, mps - language: "en" # or "auto" for detection - task: "transcribe" # or "translate" - -processing: - use_vad: true # Voice Activity Detection - min_confidence: 0.5 # Minimum transcription confidence - -server_sync: - enabled: false # Enable multi-user server sync - url: "ws://localhost:8000" # Server URL (when enabled) - api_key: "" # Optional API key - -display: - show_timestamps: true # Show timestamps in local display - max_lines: 100 # Maximum lines to keep in display - font_size: 12 # GUI font size +### Audio Device Issues +```bash +# List available audio devices +uv run python main_cli.py --list-devices ``` +- Ensure microphone permissions are granted +- Try different device indices in settings -### Server Configuration: -```yaml -server: - host: "0.0.0.0" - port: 8000 - api_key_required: false - -stream: - max_clients: 10 - buffer_size: 100 # messages to buffer - retention_time: 300 # seconds - -database: - enabled: false - path: "transcriptions.db" +### GPU Not Detected +```bash +# Check CUDA availability +uv run python -c "import torch; print(torch.cuda.is_available())" ``` +- Install NVIDIA drivers (CUDA toolkit is bundled) +- The app automatically falls back to CPU if no GPU is available ---- +### Web Server Port Conflicts +- Default port is 8080 +- Change in settings or edit config file +- Check for conflicts: `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows) -## Roadmap +## Use Cases -- [x] Project planning and architecture design -- [ ] Phase 1: Standalone desktop application with GUI -- [ ] Phase 2: Web server and sync system (optional multi-user mode) -- [ ] Phase 3: Client-server communication (optional) -- [ ] Phase 4: Web stream interface for OBS (optional) -- [ ] Phase 5: Advanced features (hotkeys, themes, Docker, etc.) - ---- +- **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams +- **Multi-Language Translation**: Multiple translators transcribing in different languages +- **Accessibility**: Provide captions for hearing-impaired viewers +- **Podcast Recording**: Real-time transcription for multi-host shows +- **Gaming Commentary**: Track who said what in multiplayer sessions ## Contributing -Contributions are welcome! Please feel free to submit issues or pull requests. - ---- +Contributions are welcome! Please feel free to submit issues or pull requests at the [repository](https://repo.anhonesthost.net/streamer-tools/local-transcription). ## License -[Choose appropriate license - MIT, Apache 2.0, etc.] - ---- +MIT License ## Acknowledgments -- OpenAI Whisper for the excellent speech recognition model -- The streaming community for inspiration and use cases +- [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model +- [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities +- [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference