# RealtimeSTT Installation Guide ## Phase 1 Migration Complete! ✅ The application has been fully migrated from the legacy time-based chunking system to **RealtimeSTT** with advanced VAD-based speech detection. ## What Changed ### Eliminated Components - ❌ `client/audio_capture.py` - No longer needed (RealtimeSTT handles audio) - ❌ `client/noise_suppression.py` - No longer needed (VAD handles silence detection) - ❌ `client/transcription_engine.py` - Replaced with `transcription_engine_realtime.py` ### New Components - ✅ `client/transcription_engine_realtime.py` - RealtimeSTT wrapper - ✅ Enhanced settings dialog with VAD controls - ✅ Dual-model support (realtime preview + final transcription) ## Benefits ### Word Loss Elimination - **Pre-recording buffer** (200ms) captures word starts - **Post-speech silence detection** (300ms) prevents word cutoffs - **Dual-layer VAD** (WebRTC + Silero) accurately detects speech boundaries - **No arbitrary chunking** - transcribes natural speech segments ### Performance Improvements - **ONNX-accelerated VAD** (2-3x faster, 30% less CPU) - **Configurable beam size** for quality/speed tradeoff - **Optional realtime preview** with faster model ### New Settings - Silero VAD sensitivity (0.0-1.0) - WebRTC VAD sensitivity (0-3) - Post-speech silence duration - Pre-recording buffer duration - Minimum recording length - Beam size (quality) - Realtime preview toggle ## System Requirements **Important:** FFmpeg is NOT required! RealtimeSTT uses sounddevice/PortAudio for audio capture. ### For Development (Building from Source) #### Linux (Ubuntu/Debian) ```bash # Install PortAudio development headers (required for PyAudio) sudo apt-get install portaudio19-dev python3-dev build-essential ``` #### Linux (Fedora/RHEL) ```bash sudo dnf install portaudio-devel python3-devel gcc ``` #### macOS ```bash brew install portaudio ``` #### Windows PortAudio is bundled with PyAudio wheels - no additional installation needed. ### For End Users (Built Executables) **Nothing required!** Built executables are fully standalone and bundle all dependencies including PortAudio, PyTorch, ONNX Runtime, and Whisper models. ## Installation ```bash # Install dependencies (this will install RealtimeSTT and all dependencies) uv sync # Or with pip pip install -r requirements.txt ``` ## Configuration All RealtimeSTT settings are in `~/.local-transcription/config.yaml`: ```yaml transcription: # Model settings model: "base.en" # tiny, base, small, medium, large-v3 device: "auto" # auto, cuda, cpu compute_type: "default" # default, int8, float16, float32 # Realtime preview (optional) enable_realtime_transcription: false realtime_model: "tiny.en" # VAD sensitivity silero_sensitivity: 0.4 # Lower = more sensitive silero_use_onnx: true # 2-3x faster VAD webrtc_sensitivity: 3 # 0-3, lower = more sensitive # Timing post_speech_silence_duration: 0.3 pre_recording_buffer_duration: 0.2 min_length_of_recording: 0.5 # Quality beam_size: 5 # 1-10, higher = better quality ``` ## GUI Settings The settings dialog now includes: 1. **Transcription Settings** - Model selector (all Whisper models + .en variants) - Compute device and type - Beam size for quality control 2. **Realtime Preview** (Optional) - Toggle preview transcription - Select faster preview model 3. **VAD Settings** - Silero sensitivity slider (0.0-1.0) - WebRTC sensitivity (0-3) - ONNX acceleration toggle 4. **Advanced Timing** - Post-speech silence duration - Minimum recording length - Pre-recording buffer duration ## Testing ```bash # Run CLI version for testing uv run python main_cli.py # Run GUI version uv run python main.py # List available models uv run python -c "from RealtimeSTT import AudioToTextRecorder; print('RealtimeSTT ready!')" ``` ## Troubleshooting ### PyAudio build fails **Error:** `portaudio.h: No such file or directory` **Solution:** ```bash # Linux sudo apt-get install portaudio19-dev # macOS brew install portaudio # Windows - should work automatically ``` ### CUDA not detected RealtimeSTT uses PyTorch's CUDA detection. Check with: ```bash uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')" ``` ### Models not downloading RealtimeSTT downloads models to: - Linux/Mac: `~/.cache/huggingface/` - Windows: `%USERPROFILE%\.cache\huggingface\` Check disk space and internet connection. ### Microphone not working List audio devices: ```bash uv run python main_cli.py --list-devices ``` Then set the device index in settings. ## Performance Tuning ### For lowest latency: - Model: `tiny.en` or `base.en` - Enable realtime preview - Post-speech silence: `0.2s` - Beam size: `1-2` ### For best accuracy: - Model: `small.en` or `medium.en` - Disable realtime preview - Post-speech silence: `0.4s` - Beam size: `5-10` ### For best performance: - Enable ONNX: `true` - Silero sensitivity: `0.4-0.6` (less aggressive) - Use GPU if available ## Build for Distribution ```bash # CPU-only build ./build.sh # Linux build.bat # Windows # CUDA build (works on both GPU and CPU systems) ./build-cuda.sh # Linux build-cuda.bat # Windows ``` Built executables will be in `dist/LocalTranscription/` ## Next Steps (Phase 2) Future migration to **WhisperLiveKit** will add: - Speaker diarization - Multi-language translation - WebSocket-based architecture - Latest SimulStreaming algorithm See `2025-live-transcription-research.md` for details. ## Migration Notes If you have an existing configuration file, it will be automatically migrated on first run. Old settings like `audio.chunk_duration` will be ignored in favor of VAD-based detection. Your transcription quality should immediately improve with: - ✅ No more cut-off words at chunk boundaries - ✅ Natural speech segment detection - ✅ Better handling of pauses and silence - ✅ Faster response time with VAD