# Transcription Latency Guide ## Understanding the Delay The delay you see between speaking and the transcription appearing is **NOT from server sync** - it's from the **audio processing pipeline**. ### Where the Time Goes ``` You speak: "Hello everyone" ↓ ┌─────────────────────────────────────────────┐ │ 1. Audio Buffer (chunk_duration) │ │ Default: 3.0 seconds │ ← MAIN SOURCE OF DELAY! │ Waiting for enough audio... │ └─────────────────────────────────────────────┘ ↓ (3.0 seconds later) ┌─────────────────────────────────────────────┐ │ 2. Transcription Processing │ │ Whisper model inference │ │ Time: 0.5-1.5 seconds │ ← Depends on model size & device │ (base model on GPU: ~500ms) │ │ (base model on CPU: ~1500ms) │ └─────────────────────────────────────────────┘ ↓ (0.5-1.5 seconds later) ┌─────────────────────────────────────────────┐ │ 3. Display & Server Sync │ │ - Display locally: instant │ │ - Queue for sync: instant │ │ - HTTP request: 50-200ms │ ← Network time └─────────────────────────────────────────────┘ ↓ Total Delay: 3.5-4.5 seconds (mostly buffer time!) ``` ## The Chunk Duration Trade-off ### Current Setting: 3.0 seconds **Location:** Settings → Audio → Chunk Duration (or `~/.local-transcription/config.yaml`) ```yaml audio: chunk_duration: 3.0 # Current setting overlap_duration: 0.5 ``` **Pros:** - ✅ Good accuracy (Whisper has full sentence context) - ✅ Lower CPU usage (fewer API calls) - ✅ Better for long sentences **Cons:** - ❌ High latency (~4 seconds) - ❌ Feels "laggy" for real-time use --- ## Recommended Settings by Use Case ### For Live Streaming (Lower Latency Priority) ```yaml audio: chunk_duration: 1.5 # ← Change this overlap_duration: 0.3 ``` **Result:** - Latency: ~2-2.5 seconds (much better!) - Accuracy: Still good for most speech - CPU: Moderate increase ### For Podcasting (Accuracy Priority) ```yaml audio: chunk_duration: 4.0 overlap_duration: 0.5 ``` **Result:** - Latency: ~5 seconds (high) - Accuracy: Best (full sentences) - CPU: Lowest ### For Real-Time Captions (Lowest Latency) ```yaml audio: chunk_duration: 1.0 # Aggressive! overlap_duration: 0.2 ``` **Result:** - Latency: ~1.5 seconds (best possible) - Accuracy: Lower (may cut mid-word) - CPU: Higher (more frequent processing) **Warning:** Chunks < 1 second may cut words and reduce accuracy significantly. ### For Gaming/Commentary (Balanced) ```yaml audio: chunk_duration: 2.0 overlap_duration: 0.3 ``` **Result:** - Latency: ~2.5-3 seconds (good balance) - Accuracy: Good - CPU: Moderate --- ## How to Change Settings ### Method 1: Settings Dialog (Recommended) 1. Open Local Transcription app 2. Click **Settings** 3. Find "Audio" section 4. Adjust "Chunk Duration" slider 5. Click **Save** 6. Restart transcription ### Method 2: Edit Config File 1. Stop the app 2. Edit: `~/.local-transcription/config.yaml` 3. Change: ```yaml audio: chunk_duration: 1.5 # Your desired value ``` 4. Save file 5. Restart app --- ## Testing Different Settings **Quick test procedure:** 1. Set chunk_duration to different values 2. Start transcription 3. Speak a sentence 4. Note the time until it appears 5. Check accuracy **Example results:** | Chunk Duration | Latency | Accuracy | CPU Usage | Best For | |----------------|---------|----------|-----------|----------| | 1.0s | ~1.5s | Fair | High | Real-time captions | | 1.5s | ~2.0s | Good | Medium-High | Live streaming | | 2.0s | ~2.5s | Good | Medium | Gaming commentary | | 3.0s | ~4.0s | Very Good | Low | Default (balanced) | | 4.0s | ~5.0s | Excellent | Very Low | Podcasts | | 5.0s | ~6.0s | Best | Lowest | Post-production | --- ## Model Size Impact The model size also affects processing time: | Model | Parameters | GPU Time | CPU Time | Accuracy | |-------|------------|----------|----------|----------| | tiny | 39M | ~200ms | ~800ms | Fair | | base | 74M | ~400ms | ~1500ms | Good | | small | 244M | ~800ms | ~3000ms | Very Good | | medium | 769M | ~1500ms | ~6000ms | Excellent | | large | 1550M | ~3000ms | ~12000ms | Best | **For low latency:** - Use `base` or `tiny` model - Use GPU if available - Reduce chunk_duration **Example fast setup:** ```yaml transcription: model: base # or tiny device: cuda # if you have GPU audio: chunk_duration: 1.5 ``` **Result:** ~2 second total latency! --- ## Advanced: Streaming Transcription For the absolute lowest latency (experimental): ```yaml audio: chunk_duration: 0.8 # Very aggressive! overlap_duration: 0.4 # High overlap to prevent cutoffs processing: use_vad: true # Skip silent chunks min_confidence: 0.3 # Lower threshold (more permissive) ``` **Trade-offs:** - ✅ Latency: ~1 second - ❌ May cut words frequently - ❌ More processing overhead - ❌ Some gibberish in output --- ## Why Not Make It Instant? **Q:** Why can't chunk_duration be 0.1 seconds for instant transcription? **A:** Several reasons: 1. **Whisper needs context** - It performs better with full sentences 2. **Word boundaries** - Too short and you cut words mid-syllable 3. **Processing overhead** - Each chunk has startup cost 4. **Model design** - Whisper expects 0.5-30 second chunks **Physical limit:** ~1 second is the practical minimum for decent accuracy. --- ## Server Sync Is NOT the Bottleneck With the recent fixes, server sync adds only **~50-200ms** of delay: ``` Local display: [3.5s] "Hello everyone" ↓ Queue: [3.5s] Instant ↓ HTTP request: [3.6s] 100ms network ↓ Server display: [3.6s] "Hello everyone" Server sync delay: Only 100ms! ``` **The real delay is audio buffering (chunk_duration).** --- ## Recommended Settings for Your Use Case Based on "4 seconds feels too slow": ### Try This First ```yaml audio: chunk_duration: 2.0 # Half the current 4-second delay overlap_duration: 0.3 ``` **Expected result:** ~2.5 second total latency (much better!) ### If Still Too Slow ```yaml audio: chunk_duration: 1.5 # More aggressive overlap_duration: 0.3 transcription: model: base # Use smaller/faster model if not already ``` **Expected result:** ~2 second total latency ### If You Want FAST (Accept Lower Accuracy) ```yaml audio: chunk_duration: 1.0 overlap_duration: 0.2 transcription: model: tiny # Fastest model device: cuda # Use GPU ``` **Expected result:** ~1.2 second total latency --- ## Monitoring Latency With the debug logging we just added, you'll see: ``` [GUI] Sending to server sync: 'Hello everyone...' [GUI] Queued for sync in: 0.2ms [Server Sync] Queue delay: 15ms [Server Sync] HTTP request: 89ms, Status: 200 ``` **If you see:** - Queue delay > 100ms → Server sync is slow (rare) - HTTP request > 500ms → Network/server issue - Nothing printed for 3+ seconds → Waiting for chunk to fill --- ## Summary **Your 4-second delay breakdown:** - 🐢 3.0s - Audio buffering (chunk_duration) ← **MAIN CULPRIT** - ⚡ 0.5-1.0s - Transcription processing (model inference) - ⚡ 0.1s - Server sync (network) **To reduce to ~2 seconds:** 1. Open Settings 2. Change chunk_duration to **2.0** 3. Restart transcription 4. Enjoy 2x faster captions! **To reduce to ~1.5 seconds:** 1. Change chunk_duration to **1.5** 2. Use `base` or `tiny` model 3. Use GPU if available 4. Accept slightly lower accuracy