Remove outdated implementation plan and task checklists. Document actual implemented features including RealtimeSTT, dual-layer VAD, custom fonts/colors, and auto-updates. Add practical usage instructions for standalone mode, OBS setup, and multi-user sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
225 lines
8.4 KiB
Markdown
225 lines
8.4 KiB
Markdown
# Local Transcription
|
|
|
|
A real-time speech-to-text desktop application for streamers. Run locally on your machine with GPU or CPU, display transcriptions via OBS browser source, and optionally sync with other users through a multi-user server.
|
|
|
|
**Version 1.4.0**
|
|
|
|
## Features
|
|
|
|
- **Real-Time Transcription**: Live speech-to-text using Whisper models with minimal latency
|
|
- **Standalone Desktop App**: PySide6/Qt GUI that works without any server
|
|
- **CPU & GPU Support**: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
|
|
- **Advanced Voice Detection**: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
|
|
- **OBS Integration**: Built-in web server for browser source capture at `http://localhost:8080`
|
|
- **Multi-User Sync**: Optional Node.js server to sync transcriptions across multiple users
|
|
- **Custom Fonts**: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
|
|
- **Customizable Colors**: User-configurable colors for name, text, and background
|
|
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
|
|
- **Auto-Updates**: Automatic update checking with release notes display
|
|
- **Cross-Platform**: Builds available for Windows and Linux
|
|
|
|
## Quick Start
|
|
|
|
### Running from Source
|
|
|
|
```bash
|
|
# Install dependencies
|
|
uv sync
|
|
|
|
# Run the application
|
|
uv run python main.py
|
|
```
|
|
|
|
### Using Pre-Built Executables
|
|
|
|
Download the latest release from the [releases page](https://repo.anhonesthost.net/streamer-tools/local-transcription/releases) and run the executable for your platform.
|
|
|
|
### Building from Source
|
|
|
|
**Linux:**
|
|
```bash
|
|
./build.sh
|
|
# Output: dist/LocalTranscription/LocalTranscription
|
|
```
|
|
|
|
**Windows:**
|
|
```cmd
|
|
build.bat
|
|
# Output: dist\LocalTranscription\LocalTranscription.exe
|
|
```
|
|
|
|
For detailed build instructions, see [BUILD.md](BUILD.md).
|
|
|
|
## Usage
|
|
|
|
### Standalone Mode
|
|
|
|
1. Launch the application
|
|
2. Select your microphone from the audio device dropdown
|
|
3. Choose a Whisper model (smaller = faster, larger = more accurate):
|
|
- `tiny.en` / `tiny` - Fastest, good for quick captions
|
|
- `base.en` / `base` - Balanced speed and accuracy
|
|
- `small.en` / `small` - Better accuracy
|
|
- `medium.en` / `medium` - High accuracy
|
|
- `large-v3` - Best accuracy (requires more resources)
|
|
4. Click **Start** to begin transcription
|
|
5. Transcriptions appear in the main window and at `http://localhost:8080`
|
|
|
|
### OBS Browser Source Setup
|
|
|
|
1. Start the Local Transcription app
|
|
2. In OBS, add a **Browser** source
|
|
3. Set URL to `http://localhost:8080`
|
|
4. Set dimensions (e.g., 1920x300)
|
|
5. Check "Shutdown source when not visible" for performance
|
|
|
|
### Multi-User Mode (Optional)
|
|
|
|
For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):
|
|
|
|
1. Deploy the Node.js server (see [server/nodejs/README.md](server/nodejs/README.md))
|
|
2. In the app settings, enable **Server Sync**
|
|
3. Enter the server URL (e.g., `http://your-server:3000/api/send`)
|
|
4. Set a room name and passphrase (shared with other users)
|
|
5. In OBS, use the server's display URL with your room name:
|
|
```
|
|
http://your-server:3000/display?room=YOURROOM×tamps=true&maxlines=50
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Settings are stored at `~/.local-transcription/config.yaml` and can be modified through the GUI settings panel.
|
|
|
|
### Key Settings
|
|
|
|
| Setting | Description | Default |
|
|
|---------|-------------|---------|
|
|
| `transcription.model` | Whisper model to use | `base.en` |
|
|
| `transcription.device` | Processing device (auto/cuda/cpu) | `auto` |
|
|
| `transcription.enable_realtime_transcription` | Show preview while speaking | `false` |
|
|
| `transcription.silero_sensitivity` | VAD sensitivity (0-1, lower = more sensitive) | `0.4` |
|
|
| `transcription.post_speech_silence_duration` | Silence before finalizing (seconds) | `0.3` |
|
|
| `transcription.continuous_mode` | Fast speaker mode for quick talkers | `false` |
|
|
| `display.show_timestamps` | Show timestamps with transcriptions | `true` |
|
|
| `display.fade_after_seconds` | Fade out time (0 = never) | `10` |
|
|
| `display.font_source` | Font type (System Font/Web-Safe/Google Font/Custom File) | `System Font` |
|
|
| `web_server.port` | Local web server port | `8080` |
|
|
|
|
See [config/default_config.yaml](config/default_config.yaml) for all available options.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
local-transcription/
|
|
├── client/ # Core transcription modules
|
|
│ ├── audio_capture.py # Audio input handling
|
|
│ ├── transcription_engine_realtime.py # RealtimeSTT integration
|
|
│ ├── noise_suppression.py # VAD and noise reduction
|
|
│ ├── device_utils.py # CPU/GPU detection
|
|
│ ├── config.py # Configuration management
|
|
│ ├── server_sync.py # Multi-user server client
|
|
│ └── update_checker.py # Auto-update functionality
|
|
├── gui/ # Desktop application UI
|
|
│ ├── main_window_qt.py # Main application window
|
|
│ ├── settings_dialog_qt.py # Settings dialog
|
|
│ └── transcription_display_qt.py # Display widget
|
|
├── server/ # Web servers
|
|
│ ├── web_display.py # Local FastAPI server for OBS
|
|
│ └── nodejs/ # Multi-user sync server
|
|
│ ├── server.js # Express + WebSocket server
|
|
│ └── README.md # Deployment instructions
|
|
├── config/
|
|
│ └── default_config.yaml # Default settings template
|
|
├── main.py # GUI entry point
|
|
├── main_cli.py # CLI version (for testing)
|
|
├── build.sh # Linux build script
|
|
├── build.bat # Windows build script
|
|
└── local-transcription.spec # PyInstaller configuration
|
|
```
|
|
|
|
## Technology Stack
|
|
|
|
### Desktop Application
|
|
- **Python 3.9+**
|
|
- **PySide6** - Qt6 GUI framework
|
|
- **RealtimeSTT** - Real-time speech-to-text with advanced VAD
|
|
- **faster-whisper** - Optimized Whisper model inference
|
|
- **PyTorch** - ML framework (CUDA-enabled)
|
|
- **sounddevice** - Cross-platform audio capture
|
|
- **webrtcvad + silero_vad** - Voice activity detection
|
|
- **noisereduce** - Noise suppression
|
|
|
|
### Web Servers
|
|
- **FastAPI + Uvicorn** - Local web display server
|
|
- **Node.js + Express + WebSocket** - Multi-user sync server
|
|
|
|
### Build Tools
|
|
- **PyInstaller** - Executable packaging
|
|
- **uv** - Fast Python package manager
|
|
|
|
## System Requirements
|
|
|
|
### Minimum
|
|
- Python 3.9+
|
|
- 4GB RAM
|
|
- Any modern CPU
|
|
|
|
### Recommended (for real-time performance)
|
|
- 8GB+ RAM
|
|
- NVIDIA GPU with CUDA support (for GPU acceleration)
|
|
- FFmpeg (installed automatically with dependencies)
|
|
|
|
### For Building
|
|
- **Linux**: gcc, Python dev headers
|
|
- **Windows**: Visual Studio Build Tools, Python dev headers
|
|
|
|
## Troubleshooting
|
|
|
|
### Model Loading Issues
|
|
- Models download automatically on first use to `~/.cache/huggingface/`
|
|
- First run requires internet connection
|
|
- Check disk space (models range from 75MB to 3GB)
|
|
|
|
### Audio Device Issues
|
|
```bash
|
|
# List available audio devices
|
|
uv run python main_cli.py --list-devices
|
|
```
|
|
- Ensure microphone permissions are granted
|
|
- Try different device indices in settings
|
|
|
|
### GPU Not Detected
|
|
```bash
|
|
# Check CUDA availability
|
|
uv run python -c "import torch; print(torch.cuda.is_available())"
|
|
```
|
|
- Install NVIDIA drivers (CUDA toolkit is bundled)
|
|
- The app automatically falls back to CPU if no GPU is available
|
|
|
|
### Web Server Port Conflicts
|
|
- Default port is 8080
|
|
- Change in settings or edit config file
|
|
- Check for conflicts: `lsof -i :8080` (Linux) or `netstat -ano | findstr :8080` (Windows)
|
|
|
|
## Use Cases
|
|
|
|
- **Live Streaming Captions**: Add real-time captions to your Twitch/YouTube streams
|
|
- **Multi-Language Translation**: Multiple translators transcribing in different languages
|
|
- **Accessibility**: Provide captions for hearing-impaired viewers
|
|
- **Podcast Recording**: Real-time transcription for multi-host shows
|
|
- **Gaming Commentary**: Track who said what in multiplayer sessions
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome! Please feel free to submit issues or pull requests at the [repository](https://repo.anhonesthost.net/streamer-tools/local-transcription).
|
|
|
|
## License
|
|
|
|
MIT License
|
|
|
|
## Acknowledgments
|
|
|
|
- [OpenAI Whisper](https://github.com/openai/whisper) for the speech recognition model
|
|
- [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for real-time transcription capabilities
|
|
- [faster-whisper](https://github.com/guillaumekln/faster-whisper) for optimized inference
|