322 lines
8.0 KiB
Markdown
322 lines
8.0 KiB
Markdown
|
|
# Transcription Latency Guide
|
||
|
|
|
||
|
|
## Understanding the Delay
|
||
|
|
|
||
|
|
The delay you see between speaking and the transcription appearing is **NOT from server sync** - it's from the **audio processing pipeline**.
|
||
|
|
|
||
|
|
### Where the Time Goes
|
||
|
|
|
||
|
|
```
|
||
|
|
You speak: "Hello everyone"
|
||
|
|
↓
|
||
|
|
┌─────────────────────────────────────────────┐
|
||
|
|
│ 1. Audio Buffer (chunk_duration) │
|
||
|
|
│ Default: 3.0 seconds │ ← MAIN SOURCE OF DELAY!
|
||
|
|
│ Waiting for enough audio... │
|
||
|
|
└─────────────────────────────────────────────┘
|
||
|
|
↓ (3.0 seconds later)
|
||
|
|
┌─────────────────────────────────────────────┐
|
||
|
|
│ 2. Transcription Processing │
|
||
|
|
│ Whisper model inference │
|
||
|
|
│ Time: 0.5-1.5 seconds │ ← Depends on model size & device
|
||
|
|
│ (base model on GPU: ~500ms) │
|
||
|
|
│ (base model on CPU: ~1500ms) │
|
||
|
|
└─────────────────────────────────────────────┘
|
||
|
|
↓ (0.5-1.5 seconds later)
|
||
|
|
┌─────────────────────────────────────────────┐
|
||
|
|
│ 3. Display & Server Sync │
|
||
|
|
│ - Display locally: instant │
|
||
|
|
│ - Queue for sync: instant │
|
||
|
|
│ - HTTP request: 50-200ms │ ← Network time
|
||
|
|
└─────────────────────────────────────────────┘
|
||
|
|
↓
|
||
|
|
Total Delay: 3.5-4.5 seconds (mostly buffer time!)
|
||
|
|
```
|
||
|
|
|
||
|
|
## The Chunk Duration Trade-off
|
||
|
|
|
||
|
|
### Current Setting: 3.0 seconds
|
||
|
|
**Location:** Settings → Audio → Chunk Duration (or `~/.local-transcription/config.yaml`)
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 3.0 # Current setting
|
||
|
|
overlap_duration: 0.5
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- ✅ Good accuracy (Whisper has full sentence context)
|
||
|
|
- ✅ Lower CPU usage (fewer API calls)
|
||
|
|
- ✅ Better for long sentences
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- ❌ High latency (~4 seconds)
|
||
|
|
- ❌ Feels "laggy" for real-time use
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Settings by Use Case
|
||
|
|
|
||
|
|
### For Live Streaming (Lower Latency Priority)
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 1.5 # ← Change this
|
||
|
|
overlap_duration: 0.3
|
||
|
|
```
|
||
|
|
|
||
|
|
**Result:**
|
||
|
|
- Latency: ~2-2.5 seconds (much better!)
|
||
|
|
- Accuracy: Still good for most speech
|
||
|
|
- CPU: Moderate increase
|
||
|
|
|
||
|
|
### For Podcasting (Accuracy Priority)
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 4.0
|
||
|
|
overlap_duration: 0.5
|
||
|
|
```
|
||
|
|
|
||
|
|
**Result:**
|
||
|
|
- Latency: ~5 seconds (high)
|
||
|
|
- Accuracy: Best (full sentences)
|
||
|
|
- CPU: Lowest
|
||
|
|
|
||
|
|
### For Real-Time Captions (Lowest Latency)
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 1.0 # Aggressive!
|
||
|
|
overlap_duration: 0.2
|
||
|
|
```
|
||
|
|
|
||
|
|
**Result:**
|
||
|
|
- Latency: ~1.5 seconds (best possible)
|
||
|
|
- Accuracy: Lower (may cut mid-word)
|
||
|
|
- CPU: Higher (more frequent processing)
|
||
|
|
|
||
|
|
**Warning:** Chunks < 1 second may cut words and reduce accuracy significantly.
|
||
|
|
|
||
|
|
### For Gaming/Commentary (Balanced)
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 2.0
|
||
|
|
overlap_duration: 0.3
|
||
|
|
```
|
||
|
|
|
||
|
|
**Result:**
|
||
|
|
- Latency: ~2.5-3 seconds (good balance)
|
||
|
|
- Accuracy: Good
|
||
|
|
- CPU: Moderate
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## How to Change Settings
|
||
|
|
|
||
|
|
### Method 1: Settings Dialog (Recommended)
|
||
|
|
1. Open Local Transcription app
|
||
|
|
2. Click **Settings**
|
||
|
|
3. Find "Audio" section
|
||
|
|
4. Adjust "Chunk Duration" slider
|
||
|
|
5. Click **Save**
|
||
|
|
6. Restart transcription
|
||
|
|
|
||
|
|
### Method 2: Edit Config File
|
||
|
|
1. Stop the app
|
||
|
|
2. Edit: `~/.local-transcription/config.yaml`
|
||
|
|
3. Change:
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 1.5 # Your desired value
|
||
|
|
```
|
||
|
|
4. Save file
|
||
|
|
5. Restart app
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Testing Different Settings
|
||
|
|
|
||
|
|
**Quick test procedure:**
|
||
|
|
|
||
|
|
1. Set chunk_duration to different values
|
||
|
|
2. Start transcription
|
||
|
|
3. Speak a sentence
|
||
|
|
4. Note the time until it appears
|
||
|
|
5. Check accuracy
|
||
|
|
|
||
|
|
**Example results:**
|
||
|
|
|
||
|
|
| Chunk Duration | Latency | Accuracy | CPU Usage | Best For |
|
||
|
|
|----------------|---------|----------|-----------|----------|
|
||
|
|
| 1.0s | ~1.5s | Fair | High | Real-time captions |
|
||
|
|
| 1.5s | ~2.0s | Good | Medium-High | Live streaming |
|
||
|
|
| 2.0s | ~2.5s | Good | Medium | Gaming commentary |
|
||
|
|
| 3.0s | ~4.0s | Very Good | Low | Default (balanced) |
|
||
|
|
| 4.0s | ~5.0s | Excellent | Very Low | Podcasts |
|
||
|
|
| 5.0s | ~6.0s | Best | Lowest | Post-production |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Model Size Impact
|
||
|
|
|
||
|
|
The model size also affects processing time:
|
||
|
|
|
||
|
|
| Model | Parameters | GPU Time | CPU Time | Accuracy |
|
||
|
|
|-------|------------|----------|----------|----------|
|
||
|
|
| tiny | 39M | ~200ms | ~800ms | Fair |
|
||
|
|
| base | 74M | ~400ms | ~1500ms | Good |
|
||
|
|
| small | 244M | ~800ms | ~3000ms | Very Good |
|
||
|
|
| medium | 769M | ~1500ms | ~6000ms | Excellent |
|
||
|
|
| large | 1550M | ~3000ms | ~12000ms | Best |
|
||
|
|
|
||
|
|
**For low latency:**
|
||
|
|
- Use `base` or `tiny` model
|
||
|
|
- Use GPU if available
|
||
|
|
- Reduce chunk_duration
|
||
|
|
|
||
|
|
**Example fast setup:**
|
||
|
|
```yaml
|
||
|
|
transcription:
|
||
|
|
model: base # or tiny
|
||
|
|
device: cuda # if you have GPU
|
||
|
|
|
||
|
|
audio:
|
||
|
|
chunk_duration: 1.5
|
||
|
|
```
|
||
|
|
|
||
|
|
**Result:** ~2 second total latency!
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Advanced: Streaming Transcription
|
||
|
|
|
||
|
|
For the absolute lowest latency (experimental):
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 0.8 # Very aggressive!
|
||
|
|
overlap_duration: 0.4 # High overlap to prevent cutoffs
|
||
|
|
|
||
|
|
processing:
|
||
|
|
use_vad: true # Skip silent chunks
|
||
|
|
min_confidence: 0.3 # Lower threshold (more permissive)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Trade-offs:**
|
||
|
|
- ✅ Latency: ~1 second
|
||
|
|
- ❌ May cut words frequently
|
||
|
|
- ❌ More processing overhead
|
||
|
|
- ❌ Some gibberish in output
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why Not Make It Instant?
|
||
|
|
|
||
|
|
**Q:** Why can't chunk_duration be 0.1 seconds for instant transcription?
|
||
|
|
|
||
|
|
**A:** Several reasons:
|
||
|
|
|
||
|
|
1. **Whisper needs context** - It performs better with full sentences
|
||
|
|
2. **Word boundaries** - Too short and you cut words mid-syllable
|
||
|
|
3. **Processing overhead** - Each chunk has startup cost
|
||
|
|
4. **Model design** - Whisper expects 0.5-30 second chunks
|
||
|
|
|
||
|
|
**Physical limit:** ~1 second is the practical minimum for decent accuracy.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Server Sync Is NOT the Bottleneck
|
||
|
|
|
||
|
|
With the recent fixes, server sync adds only **~50-200ms** of delay:
|
||
|
|
|
||
|
|
```
|
||
|
|
Local display: [3.5s] "Hello everyone"
|
||
|
|
↓
|
||
|
|
Queue: [3.5s] Instant
|
||
|
|
↓
|
||
|
|
HTTP request: [3.6s] 100ms network
|
||
|
|
↓
|
||
|
|
Server display: [3.6s] "Hello everyone"
|
||
|
|
|
||
|
|
Server sync delay: Only 100ms!
|
||
|
|
```
|
||
|
|
|
||
|
|
**The real delay is audio buffering (chunk_duration).**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Settings for Your Use Case
|
||
|
|
|
||
|
|
Based on "4 seconds feels too slow":
|
||
|
|
|
||
|
|
### Try This First
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 2.0 # Half the current 4-second delay
|
||
|
|
overlap_duration: 0.3
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected result:** ~2.5 second total latency (much better!)
|
||
|
|
|
||
|
|
### If Still Too Slow
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 1.5 # More aggressive
|
||
|
|
overlap_duration: 0.3
|
||
|
|
|
||
|
|
transcription:
|
||
|
|
model: base # Use smaller/faster model if not already
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected result:** ~2 second total latency
|
||
|
|
|
||
|
|
### If You Want FAST (Accept Lower Accuracy)
|
||
|
|
```yaml
|
||
|
|
audio:
|
||
|
|
chunk_duration: 1.0
|
||
|
|
overlap_duration: 0.2
|
||
|
|
|
||
|
|
transcription:
|
||
|
|
model: tiny # Fastest model
|
||
|
|
device: cuda # Use GPU
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected result:** ~1.2 second total latency
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Monitoring Latency
|
||
|
|
|
||
|
|
With the debug logging we just added, you'll see:
|
||
|
|
|
||
|
|
```
|
||
|
|
[GUI] Sending to server sync: 'Hello everyone...'
|
||
|
|
[GUI] Queued for sync in: 0.2ms
|
||
|
|
[Server Sync] Queue delay: 15ms
|
||
|
|
[Server Sync] HTTP request: 89ms, Status: 200
|
||
|
|
```
|
||
|
|
|
||
|
|
**If you see:**
|
||
|
|
- Queue delay > 100ms → Server sync is slow (rare)
|
||
|
|
- HTTP request > 500ms → Network/server issue
|
||
|
|
- Nothing printed for 3+ seconds → Waiting for chunk to fill
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
|
||
|
|
**Your 4-second delay breakdown:**
|
||
|
|
- 🐢 3.0s - Audio buffering (chunk_duration) ← **MAIN CULPRIT**
|
||
|
|
- ⚡ 0.5-1.0s - Transcription processing (model inference)
|
||
|
|
- ⚡ 0.1s - Server sync (network)
|
||
|
|
|
||
|
|
**To reduce to ~2 seconds:**
|
||
|
|
1. Open Settings
|
||
|
|
2. Change chunk_duration to **2.0**
|
||
|
|
3. Restart transcription
|
||
|
|
4. Enjoy 2x faster captions!
|
||
|
|
|
||
|
|
**To reduce to ~1.5 seconds:**
|
||
|
|
1. Change chunk_duration to **1.5**
|
||
|
|
2. Use `base` or `tiny` model
|
||
|
|
3. Use GPU if available
|
||
|
|
4. Accept slightly lower accuracy
|