local-transcription/INSTALL_REALTIMESTT.md

# RealtimeSTT Installation Guide

## Phase 1 Migration Complete! ✅

The application has been fully migrated from the legacy time-based chunking system to **RealtimeSTT** with advanced VAD-based speech detection.

## What Changed

### Eliminated Components
- ❌ `client/audio_capture.py` - No longer needed (RealtimeSTT handles audio)
- ❌ `client/noise_suppression.py` - No longer needed (VAD handles silence detection)
- ❌ `client/transcription_engine.py` - Replaced with `transcription_engine_realtime.py`

### New Components
- ✅ `client/transcription_engine_realtime.py` - RealtimeSTT wrapper
- ✅ Enhanced settings dialog with VAD controls
- ✅ Dual-model support (realtime preview + final transcription)

## Benefits

### Word Loss Elimination
- **Pre-recording buffer** (200ms) captures word starts
- **Post-speech silence detection** (300ms) prevents word cutoffs
- **Dual-layer VAD** (WebRTC + Silero) accurately detects speech boundaries
- **No arbitrary chunking** - transcribes natural speech segments

### Performance Improvements
- **ONNX-accelerated VAD** (2-3x faster, 30% less CPU)
- **Configurable beam size** for quality/speed tradeoff
- **Optional realtime preview** with faster model

### New Settings
- Silero VAD sensitivity (0.0-1.0)
- WebRTC VAD sensitivity (0-3)
- Post-speech silence duration
- Pre-recording buffer duration
- Minimum recording length
- Beam size (quality)
- Realtime preview toggle

## System Requirements

**Important:** FFmpeg is NOT required! RealtimeSTT uses sounddevice/PortAudio for audio capture.

### For Development (Building from Source)

#### Linux (Ubuntu/Debian)
```bash
# Install PortAudio development headers (required for PyAudio)
sudo apt-get install portaudio19-dev python3-dev build-essential
```

#### Linux (Fedora/RHEL)
```bash
sudo dnf install portaudio-devel python3-devel gcc
```

#### macOS
```bash
brew install portaudio
```

#### Windows
PortAudio is bundled with PyAudio wheels - no additional installation needed.

### For End Users (Built Executables)

**Nothing required!** Built executables are fully standalone and bundle all dependencies including PortAudio, PyTorch, ONNX Runtime, and Whisper models.

## Installation

```bash
# Install dependencies (this will install RealtimeSTT and all dependencies)
uv sync

# Or with pip
pip install -r requirements.txt
```

## Configuration

All RealtimeSTT settings are in `~/.local-transcription/config.yaml`:

```yaml
transcription:
  # Model settings
  model: "base.en"  # tiny, base, small, medium, large-v3
  device: "auto"  # auto, cuda, cpu
  compute_type: "default"  # default, int8, float16, float32

  # Realtime preview (optional)
  enable_realtime_transcription: false
  realtime_model: "tiny.en"

  # VAD sensitivity
  silero_sensitivity: 0.4  # Lower = more sensitive
  silero_use_onnx: true  # 2-3x faster VAD
  webrtc_sensitivity: 3  # 0-3, lower = more sensitive

  # Timing
  post_speech_silence_duration: 0.3
  pre_recording_buffer_duration: 0.2
  min_length_of_recording: 0.5

  # Quality
  beam_size: 5  # 1-10, higher = better quality
```

## GUI Settings

The settings dialog now includes:

1. **Transcription Settings**
   - Model selector (all Whisper models + .en variants)
   - Compute device and type
   - Beam size for quality control

2. **Realtime Preview** (Optional)
   - Toggle preview transcription
   - Select faster preview model

3. **VAD Settings**
   - Silero sensitivity slider (0.0-1.0)
   - WebRTC sensitivity (0-3)
   - ONNX acceleration toggle

4. **Advanced Timing**
   - Post-speech silence duration
   - Minimum recording length
   - Pre-recording buffer duration

## Testing

```bash
# Run CLI version for testing
uv run python main_cli.py

# Run GUI version
uv run python main.py

# List available models
uv run python -c "from RealtimeSTT import AudioToTextRecorder; print('RealtimeSTT ready!')"
```

## Troubleshooting

### PyAudio build fails
**Error:** `portaudio.h: No such file or directory`

**Solution:**
```bash
# Linux
sudo apt-get install portaudio19-dev

# macOS
brew install portaudio

# Windows - should work automatically
```

### CUDA not detected
RealtimeSTT uses PyTorch's CUDA detection. Check with:
```bash
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
```

### Models not downloading
RealtimeSTT downloads models to:
- Linux/Mac: `~/.cache/huggingface/`
- Windows: `%USERPROFILE%\.cache\huggingface\`

Check disk space and internet connection.

### Microphone not working
List audio devices:
```bash
uv run python main_cli.py --list-devices
```

Then set the device index in settings.

## Performance Tuning

### For lowest latency:
- Model: `tiny.en` or `base.en`
- Enable realtime preview
- Post-speech silence: `0.2s`
- Beam size: `1-2`

### For best accuracy:
- Model: `small.en` or `medium.en`
- Disable realtime preview
- Post-speech silence: `0.4s`
- Beam size: `5-10`

### For best performance:
- Enable ONNX: `true`
- Silero sensitivity: `0.4-0.6` (less aggressive)
- Use GPU if available

## Build for Distribution

```bash
# CPU-only build
./build.sh  # Linux
build.bat   # Windows

# CUDA build (works on both GPU and CPU systems)
./build-cuda.sh  # Linux
build-cuda.bat   # Windows
```

Built executables will be in `dist/LocalTranscription/`

## Next Steps (Phase 2)

Future migration to **WhisperLiveKit** will add:
- Speaker diarization
- Multi-language translation
- WebSocket-based architecture
- Latest SimulStreaming algorithm

See `2025-live-transcription-research.md` for details.

## Migration Notes

If you have an existing configuration file, it will be automatically migrated on first run. Old settings like `audio.chunk_duration` will be ignored in favor of VAD-based detection.

Your transcription quality should immediately improve with:
- ✅ No more cut-off words at chunk boundaries
- ✅ Natural speech segment detection
- ✅ Better handling of pauses and silence
- ✅ Faster response time with VAD