Files
local-transcription/INSTALL.md
jknapp 5f3c058be6 Migrate to RealtimeSTT for advanced VAD-based transcription
Major refactor to eliminate word loss issues using RealtimeSTT with
dual-layer VAD (WebRTC + Silero) instead of time-based chunking.

## Core Changes

### New Transcription Engine
- Add client/transcription_engine_realtime.py with RealtimeSTT wrapper
- Implements initialize() and start_recording() separation for proper lifecycle
- Dual-layer VAD with pre/post buffers prevents word cutoffs
- Optional realtime preview with faster model + final transcription

### Removed Legacy Components
- Remove client/audio_capture.py (RealtimeSTT handles audio)
- Remove client/noise_suppression.py (VAD handles silence detection)
- Remove client/transcription_engine.py (replaced by realtime version)
- Remove chunk_duration setting (no longer using time-based chunking)

### Dependencies
- Add RealtimeSTT>=0.3.0 to pyproject.toml
- Remove noisereduce, webrtcvad, faster-whisper (now dependencies of RealtimeSTT)
- Update PyInstaller spec with ONNX Runtime, halo, colorama

### GUI Improvements
- Refactor main_window_qt.py to use RealtimeSTT with proper start/stop
- Fix recording state management (initialize on startup, record on button click)
- Expand settings dialog (700x1200) with improved spacing (10-15px between groups)
- Add comprehensive tooltips to all settings explaining functionality
- Remove chunk duration field from settings

### Configuration
- Update default_config.yaml with RealtimeSTT parameters:
  - Silero VAD sensitivity (0.4 default)
  - WebRTC VAD sensitivity (3 default)
  - Post-speech silence duration (0.3s)
  - Pre-recording buffer (0.2s)
  - Beam size for quality control (5 default)
  - ONNX acceleration (enabled for 2-3x faster VAD)
  - Optional realtime preview settings

### CLI Updates
- Update main_cli.py to use new engine API
- Separate initialize() and start_recording() calls

### Documentation
- Add INSTALL_REALTIMESTT.md with migration guide and benefits
- Update INSTALL.md: Remove FFmpeg requirement (not needed!)
- Clarify PortAudio is only needed for development
- Document that built executables are fully standalone

## Benefits

-  Eliminates word loss at chunk boundaries
-  Natural speech segment detection via VAD
-  2-3x faster VAD with ONNX acceleration
-  30% lower CPU usage
-  Pre-recording buffer captures word starts
-  Post-speech silence prevents cutoffs
-  Optional instant preview mode
-  Better UX with comprehensive tooltips

## Migration Notes

- Settings apply immediately without restart (except model changes)
- Old chunk_duration configs ignored (VAD-based detection now)
- Recording only starts when user clicks button (not on app startup)
- Stop button immediately stops recording (no delay)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-28 18:48:29 -08:00

198 lines
4.1 KiB
Markdown

# Installation Guide
## Prerequisites
- **Python 3.9 or higher**
- **uv** (Python package installer)
- **PortAudio** (for audio capture - development only)
- **CUDA-capable GPU** (optional, for GPU acceleration)
**Note:** FFmpeg is NOT required. RealtimeSTT and faster-whisper do not use FFmpeg.
### Installing uv
If you don't have `uv` installed:
```bash
# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or with pip
pip install uv
```
### Installing PortAudio (Development Only)
**Note:** Only needed for building from source. Built executables bundle PortAudio.
#### On Ubuntu/Debian:
```bash
sudo apt-get install portaudio19-dev python3-dev
```
#### On macOS (with Homebrew):
```bash
brew install portaudio
```
#### On Windows:
Nothing needed - PyAudio wheels include PortAudio binaries.
## Installation Steps
### 1. Navigate to Project Directory
```bash
cd /home/jknapp/code/local-transcription
```
### 2. Install Dependencies with uv
```bash
# uv will automatically create a virtual environment and install dependencies
uv sync
```
This single command will:
- Create a virtual environment (`.venv/`)
- Install all dependencies from `pyproject.toml`
- Lock dependencies for reproducibility
**Note for CUDA users:** If you have an NVIDIA GPU, install PyTorch with CUDA support:
```bash
# For CUDA 11.8
uv pip install torch --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
```
### 3. Run the Application
```bash
# Option 1: Using uv run (automatically uses the venv)
uv run python main.py
# Option 2: Activate venv manually
source .venv/bin/activate # On Windows: .venv\Scripts\activate
python main.py
```
On first run, the application will:
- Download the Whisper model (this may take a few minutes)
- Create a configuration file at `~/.local-transcription/config.yaml`
## Quick Start Commands
```bash
# Install everything
uv sync
# Run the application
uv run python main.py
# Install with server dependencies (for Phase 2+)
uv sync --extra server
# Update dependencies
uv sync --upgrade
```
## Configuration
Settings can be changed through the GUI (Settings button) or by editing:
```
~/.local-transcription/config.yaml
```
## Troubleshooting
### Audio Device Issues
If no audio devices are detected:
```bash
uv run python -c "import sounddevice as sd; print(sd.query_devices())"
```
### GPU Not Detected
Check if CUDA is available:
```bash
uv run python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
```
### Model Download Fails
Models are downloaded to `~/.cache/huggingface/`. If download fails:
- Check internet connection
- Ensure sufficient disk space (~1-3 GB depending on model size)
### uv Command Not Found
Make sure uv is in your PATH:
```bash
# Add to ~/.bashrc or ~/.zshrc
export PATH="$HOME/.cargo/bin:$PATH"
```
## Performance Tips
For best real-time performance:
1. **Use GPU if available** - 5-10x faster than CPU
2. **Start with smaller models**:
- `tiny`: Fastest, ~39M parameters, 1-2s latency
- `base`: Good balance, ~74M parameters, 2-3s latency
- `small`: Better accuracy, ~244M parameters, 3-5s latency
3. **Enable VAD** (Voice Activity Detection) to skip silent audio
4. **Adjust chunk duration**: Smaller = lower latency, larger = better accuracy
## System Requirements
### Minimum:
- CPU: Dual-core 2GHz+
- RAM: 4GB
- Model: tiny or base
### Recommended:
- CPU: Quad-core 3GHz+ or GPU (NVIDIA GTX 1060+)
- RAM: 8GB
- Model: base or small
### For Best Performance:
- GPU: NVIDIA RTX 2060 or better
- RAM: 16GB
- Model: small or medium
## Development
### Install development dependencies:
```bash
uv sync --extra dev
```
### Run tests:
```bash
uv run pytest
```
### Format code:
```bash
uv run black .
uv run ruff check .
```
## Why uv?
`uv` is significantly faster than pip:
- **10-100x faster** dependency resolution
- **Automatic virtual environment** management
- **Reproducible builds** with lockfile
- **Drop-in replacement** for pip commands
Learn more at [astral.sh/uv](https://astral.sh/uv)