234 lines
5.8 KiB
Markdown
234 lines
5.8 KiB
Markdown
|
|
# RealtimeSTT Installation Guide
|
||
|
|
|
||
|
|
## Phase 1 Migration Complete! ✅
|
||
|
|
|
||
|
|
The application has been fully migrated from the legacy time-based chunking system to **RealtimeSTT** with advanced VAD-based speech detection.
|
||
|
|
|
||
|
|
## What Changed
|
||
|
|
|
||
|
|
### Eliminated Components
|
||
|
|
- ❌ `client/audio_capture.py` - No longer needed (RealtimeSTT handles audio)
|
||
|
|
- ❌ `client/noise_suppression.py` - No longer needed (VAD handles silence detection)
|
||
|
|
- ❌ `client/transcription_engine.py` - Replaced with `transcription_engine_realtime.py`
|
||
|
|
|
||
|
|
### New Components
|
||
|
|
- ✅ `client/transcription_engine_realtime.py` - RealtimeSTT wrapper
|
||
|
|
- ✅ Enhanced settings dialog with VAD controls
|
||
|
|
- ✅ Dual-model support (realtime preview + final transcription)
|
||
|
|
|
||
|
|
## Benefits
|
||
|
|
|
||
|
|
### Word Loss Elimination
|
||
|
|
- **Pre-recording buffer** (200ms) captures word starts
|
||
|
|
- **Post-speech silence detection** (300ms) prevents word cutoffs
|
||
|
|
- **Dual-layer VAD** (WebRTC + Silero) accurately detects speech boundaries
|
||
|
|
- **No arbitrary chunking** - transcribes natural speech segments
|
||
|
|
|
||
|
|
### Performance Improvements
|
||
|
|
- **ONNX-accelerated VAD** (2-3x faster, 30% less CPU)
|
||
|
|
- **Configurable beam size** for quality/speed tradeoff
|
||
|
|
- **Optional realtime preview** with faster model
|
||
|
|
|
||
|
|
### New Settings
|
||
|
|
- Silero VAD sensitivity (0.0-1.0)
|
||
|
|
- WebRTC VAD sensitivity (0-3)
|
||
|
|
- Post-speech silence duration
|
||
|
|
- Pre-recording buffer duration
|
||
|
|
- Minimum recording length
|
||
|
|
- Beam size (quality)
|
||
|
|
- Realtime preview toggle
|
||
|
|
|
||
|
|
## System Requirements
|
||
|
|
|
||
|
|
**Important:** FFmpeg is NOT required! RealtimeSTT uses sounddevice/PortAudio for audio capture.
|
||
|
|
|
||
|
|
### For Development (Building from Source)
|
||
|
|
|
||
|
|
#### Linux (Ubuntu/Debian)
|
||
|
|
```bash
|
||
|
|
# Install PortAudio development headers (required for PyAudio)
|
||
|
|
sudo apt-get install portaudio19-dev python3-dev build-essential
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Linux (Fedora/RHEL)
|
||
|
|
```bash
|
||
|
|
sudo dnf install portaudio-devel python3-devel gcc
|
||
|
|
```
|
||
|
|
|
||
|
|
#### macOS
|
||
|
|
```bash
|
||
|
|
brew install portaudio
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Windows
|
||
|
|
PortAudio is bundled with PyAudio wheels - no additional installation needed.
|
||
|
|
|
||
|
|
### For End Users (Built Executables)
|
||
|
|
|
||
|
|
**Nothing required!** Built executables are fully standalone and bundle all dependencies including PortAudio, PyTorch, ONNX Runtime, and Whisper models.
|
||
|
|
|
||
|
|
## Installation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Install dependencies (this will install RealtimeSTT and all dependencies)
|
||
|
|
uv sync
|
||
|
|
|
||
|
|
# Or with pip
|
||
|
|
pip install -r requirements.txt
|
||
|
|
```
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
All RealtimeSTT settings are in `~/.local-transcription/config.yaml`:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
transcription:
|
||
|
|
# Model settings
|
||
|
|
model: "base.en" # tiny, base, small, medium, large-v3
|
||
|
|
device: "auto" # auto, cuda, cpu
|
||
|
|
compute_type: "default" # default, int8, float16, float32
|
||
|
|
|
||
|
|
# Realtime preview (optional)
|
||
|
|
enable_realtime_transcription: false
|
||
|
|
realtime_model: "tiny.en"
|
||
|
|
|
||
|
|
# VAD sensitivity
|
||
|
|
silero_sensitivity: 0.4 # Lower = more sensitive
|
||
|
|
silero_use_onnx: true # 2-3x faster VAD
|
||
|
|
webrtc_sensitivity: 3 # 0-3, lower = more sensitive
|
||
|
|
|
||
|
|
# Timing
|
||
|
|
post_speech_silence_duration: 0.3
|
||
|
|
pre_recording_buffer_duration: 0.2
|
||
|
|
min_length_of_recording: 0.5
|
||
|
|
|
||
|
|
# Quality
|
||
|
|
beam_size: 5 # 1-10, higher = better quality
|
||
|
|
```
|
||
|
|
|
||
|
|
## GUI Settings
|
||
|
|
|
||
|
|
The settings dialog now includes:
|
||
|
|
|
||
|
|
1. **Transcription Settings**
|
||
|
|
- Model selector (all Whisper models + .en variants)
|
||
|
|
- Compute device and type
|
||
|
|
- Beam size for quality control
|
||
|
|
|
||
|
|
2. **Realtime Preview** (Optional)
|
||
|
|
- Toggle preview transcription
|
||
|
|
- Select faster preview model
|
||
|
|
|
||
|
|
3. **VAD Settings**
|
||
|
|
- Silero sensitivity slider (0.0-1.0)
|
||
|
|
- WebRTC sensitivity (0-3)
|
||
|
|
- ONNX acceleration toggle
|
||
|
|
|
||
|
|
4. **Advanced Timing**
|
||
|
|
- Post-speech silence duration
|
||
|
|
- Minimum recording length
|
||
|
|
- Pre-recording buffer duration
|
||
|
|
|
||
|
|
## Testing
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Run CLI version for testing
|
||
|
|
uv run python main_cli.py
|
||
|
|
|
||
|
|
# Run GUI version
|
||
|
|
uv run python main.py
|
||
|
|
|
||
|
|
# List available models
|
||
|
|
uv run python -c "from RealtimeSTT import AudioToTextRecorder; print('RealtimeSTT ready!')"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### PyAudio build fails
|
||
|
|
**Error:** `portaudio.h: No such file or directory`
|
||
|
|
|
||
|
|
**Solution:**
|
||
|
|
```bash
|
||
|
|
# Linux
|
||
|
|
sudo apt-get install portaudio19-dev
|
||
|
|
|
||
|
|
# macOS
|
||
|
|
brew install portaudio
|
||
|
|
|
||
|
|
# Windows - should work automatically
|
||
|
|
```
|
||
|
|
|
||
|
|
### CUDA not detected
|
||
|
|
RealtimeSTT uses PyTorch's CUDA detection. Check with:
|
||
|
|
```bash
|
||
|
|
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Models not downloading
|
||
|
|
RealtimeSTT downloads models to:
|
||
|
|
- Linux/Mac: `~/.cache/huggingface/`
|
||
|
|
- Windows: `%USERPROFILE%\.cache\huggingface\`
|
||
|
|
|
||
|
|
Check disk space and internet connection.
|
||
|
|
|
||
|
|
### Microphone not working
|
||
|
|
List audio devices:
|
||
|
|
```bash
|
||
|
|
uv run python main_cli.py --list-devices
|
||
|
|
```
|
||
|
|
|
||
|
|
Then set the device index in settings.
|
||
|
|
|
||
|
|
## Performance Tuning
|
||
|
|
|
||
|
|
### For lowest latency:
|
||
|
|
- Model: `tiny.en` or `base.en`
|
||
|
|
- Enable realtime preview
|
||
|
|
- Post-speech silence: `0.2s`
|
||
|
|
- Beam size: `1-2`
|
||
|
|
|
||
|
|
### For best accuracy:
|
||
|
|
- Model: `small.en` or `medium.en`
|
||
|
|
- Disable realtime preview
|
||
|
|
- Post-speech silence: `0.4s`
|
||
|
|
- Beam size: `5-10`
|
||
|
|
|
||
|
|
### For best performance:
|
||
|
|
- Enable ONNX: `true`
|
||
|
|
- Silero sensitivity: `0.4-0.6` (less aggressive)
|
||
|
|
- Use GPU if available
|
||
|
|
|
||
|
|
## Build for Distribution
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# CPU-only build
|
||
|
|
./build.sh # Linux
|
||
|
|
build.bat # Windows
|
||
|
|
|
||
|
|
# CUDA build (works on both GPU and CPU systems)
|
||
|
|
./build-cuda.sh # Linux
|
||
|
|
build-cuda.bat # Windows
|
||
|
|
```
|
||
|
|
|
||
|
|
Built executables will be in `dist/LocalTranscription/`
|
||
|
|
|
||
|
|
## Next Steps (Phase 2)
|
||
|
|
|
||
|
|
Future migration to **WhisperLiveKit** will add:
|
||
|
|
- Speaker diarization
|
||
|
|
- Multi-language translation
|
||
|
|
- WebSocket-based architecture
|
||
|
|
- Latest SimulStreaming algorithm
|
||
|
|
|
||
|
|
See `2025-live-transcription-research.md` for details.
|
||
|
|
|
||
|
|
## Migration Notes
|
||
|
|
|
||
|
|
If you have an existing configuration file, it will be automatically migrated on first run. Old settings like `audio.chunk_duration` will be ignored in favor of VAD-based detection.
|
||
|
|
|
||
|
|
Your transcription quality should immediately improve with:
|
||
|
|
- ✅ No more cut-off words at chunk boundaries
|
||
|
|
- ✅ Natural speech segment detection
|
||
|
|
- ✅ Better handling of pauses and silence
|
||
|
|
- ✅ Faster response time with VAD
|