LATENCY_GUIDE.md

# Transcription Latency Guide

## Understanding the Delay

The delay you see between speaking and the transcription appearing is **NOT from server sync** - it's from the **audio processing pipeline**.

### Where the Time Goes

```
You speak: "Hello everyone"
    ↓
┌─────────────────────────────────────────────┐
│ 1. Audio Buffer (chunk_duration)            │
│    Default: 3.0 seconds                     │ ← MAIN SOURCE OF DELAY!
│    Waiting for enough audio...              │
└─────────────────────────────────────────────┘
    ↓ (3.0 seconds later)
┌─────────────────────────────────────────────┐
│ 2. Transcription Processing                 │
│    Whisper model inference                  │
│    Time: 0.5-1.5 seconds                    │ ← Depends on model size & device
│    (base model on GPU: ~500ms)              │
│    (base model on CPU: ~1500ms)             │
└─────────────────────────────────────────────┘
    ↓ (0.5-1.5 seconds later)
┌─────────────────────────────────────────────┐
│ 3. Display & Server Sync                    │
│    - Display locally: instant               │
│    - Queue for sync: instant                │
│    - HTTP request: 50-200ms                 │ ← Network time
└─────────────────────────────────────────────┘
    ↓
Total Delay: 3.5-4.5 seconds (mostly buffer time!)
```

## The Chunk Duration Trade-off

### Current Setting: 3.0 seconds
**Location:** Settings → Audio → Chunk Duration (or `~/.local-transcription/config.yaml`)

```yaml
audio:
  chunk_duration: 3.0  # Current setting
  overlap_duration: 0.5
```

**Pros:**
- ✅ Good accuracy (Whisper has full sentence context)
- ✅ Lower CPU usage (fewer API calls)
- ✅ Better for long sentences

**Cons:**
- ❌ High latency (~4 seconds)
- ❌ Feels "laggy" for real-time use

---

## Recommended Settings by Use Case

### For Live Streaming (Lower Latency Priority)
```yaml
audio:
  chunk_duration: 1.5  # ← Change this
  overlap_duration: 0.3
```

**Result:**
- Latency: ~2-2.5 seconds (much better!)
- Accuracy: Still good for most speech
- CPU: Moderate increase

### For Podcasting (Accuracy Priority)
```yaml
audio:
  chunk_duration: 4.0
  overlap_duration: 0.5
```

**Result:**
- Latency: ~5 seconds (high)
- Accuracy: Best (full sentences)
- CPU: Lowest

### For Real-Time Captions (Lowest Latency)
```yaml
audio:
  chunk_duration: 1.0  # Aggressive!
  overlap_duration: 0.2
```

**Result:**
- Latency: ~1.5 seconds (best possible)
- Accuracy: Lower (may cut mid-word)
- CPU: Higher (more frequent processing)

**Warning:** Chunks < 1 second may cut words and reduce accuracy significantly.

### For Gaming/Commentary (Balanced)
```yaml
audio:
  chunk_duration: 2.0
  overlap_duration: 0.3
```

**Result:**
- Latency: ~2.5-3 seconds (good balance)
- Accuracy: Good
- CPU: Moderate

---

## How to Change Settings

### Method 1: Settings Dialog (Recommended)
1. Open Local Transcription app
2. Click **Settings**
3. Find "Audio" section
4. Adjust "Chunk Duration" slider
5. Click **Save**
6. Restart transcription

### Method 2: Edit Config File
1. Stop the app
2. Edit: `~/.local-transcription/config.yaml`
3. Change:
   ```yaml
   audio:
     chunk_duration: 1.5  # Your desired value
   ```
4. Save file
5. Restart app

---

## Testing Different Settings

**Quick test procedure:**

1. Set chunk_duration to different values
2. Start transcription
3. Speak a sentence
4. Note the time until it appears
5. Check accuracy

**Example results:**

| Chunk Duration | Latency | Accuracy | CPU Usage | Best For |
|----------------|---------|----------|-----------|----------|
| 1.0s | ~1.5s | Fair | High | Real-time captions |
| 1.5s | ~2.0s | Good | Medium-High | Live streaming |
| 2.0s | ~2.5s | Good | Medium | Gaming commentary |
| 3.0s | ~4.0s | Very Good | Low | Default (balanced) |
| 4.0s | ~5.0s | Excellent | Very Low | Podcasts |
| 5.0s | ~6.0s | Best | Lowest | Post-production |

---

## Model Size Impact

The model size also affects processing time:

| Model | Parameters | GPU Time | CPU Time | Accuracy |
|-------|------------|----------|----------|----------|
| tiny | 39M | ~200ms | ~800ms | Fair |
| base | 74M | ~400ms | ~1500ms | Good |
| small | 244M | ~800ms | ~3000ms | Very Good |
| medium | 769M | ~1500ms | ~6000ms | Excellent |
| large | 1550M | ~3000ms | ~12000ms | Best |

**For low latency:**
- Use `base` or `tiny` model
- Use GPU if available
- Reduce chunk_duration

**Example fast setup:**
```yaml
transcription:
  model: base  # or tiny
  device: cuda  # if you have GPU

audio:
  chunk_duration: 1.5
```

**Result:** ~2 second total latency!

---

## Advanced: Streaming Transcription

For the absolute lowest latency (experimental):

```yaml
audio:
  chunk_duration: 0.8  # Very aggressive!
  overlap_duration: 0.4  # High overlap to prevent cutoffs

processing:
  use_vad: true  # Skip silent chunks
  min_confidence: 0.3  # Lower threshold (more permissive)
```

**Trade-offs:**
- ✅ Latency: ~1 second
- ❌ May cut words frequently
- ❌ More processing overhead
- ❌ Some gibberish in output

---

## Why Not Make It Instant?

**Q:** Why can't chunk_duration be 0.1 seconds for instant transcription?

**A:** Several reasons:

1. **Whisper needs context** - It performs better with full sentences
2. **Word boundaries** - Too short and you cut words mid-syllable
3. **Processing overhead** - Each chunk has startup cost
4. **Model design** - Whisper expects 0.5-30 second chunks

**Physical limit:** ~1 second is the practical minimum for decent accuracy.

---

## Server Sync Is NOT the Bottleneck

With the recent fixes, server sync adds only **~50-200ms** of delay:

```
Local display:  [3.5s] "Hello everyone"
                  ↓
Queue:            [3.5s] Instant
                  ↓
HTTP request:     [3.6s] 100ms network
                  ↓
Server display:   [3.6s] "Hello everyone"

Server sync delay: Only 100ms!
```

**The real delay is audio buffering (chunk_duration).**

---

## Recommended Settings for Your Use Case

Based on "4 seconds feels too slow":

### Try This First
```yaml
audio:
  chunk_duration: 2.0  # Half the current 4-second delay
  overlap_duration: 0.3
```

**Expected result:** ~2.5 second total latency (much better!)

### If Still Too Slow
```yaml
audio:
  chunk_duration: 1.5  # More aggressive
  overlap_duration: 0.3

transcription:
  model: base  # Use smaller/faster model if not already
```

**Expected result:** ~2 second total latency

### If You Want FAST (Accept Lower Accuracy)
```yaml
audio:
  chunk_duration: 1.0
  overlap_duration: 0.2

transcription:
  model: tiny  # Fastest model
  device: cuda  # Use GPU
```

**Expected result:** ~1.2 second total latency

---

## Monitoring Latency

With the debug logging we just added, you'll see:

```
[GUI] Sending to server sync: 'Hello everyone...'
[GUI] Queued for sync in: 0.2ms
[Server Sync] Queue delay: 15ms
[Server Sync] HTTP request: 89ms, Status: 200
```

**If you see:**
- Queue delay > 100ms → Server sync is slow (rare)
- HTTP request > 500ms → Network/server issue
- Nothing printed for 3+ seconds → Waiting for chunk to fill

---

## Summary

**Your 4-second delay breakdown:**
- 🐢 3.0s - Audio buffering (chunk_duration) ← **MAIN CULPRIT**
- ⚡ 0.5-1.0s - Transcription processing (model inference)
- ⚡ 0.1s - Server sync (network)

**To reduce to ~2 seconds:**
1. Open Settings
2. Change chunk_duration to **2.0**
3. Restart transcription
4. Enjoy 2x faster captions!

**To reduce to ~1.5 seconds:**
1. Change chunk_duration to **1.5**
2. Use `base` or `tiny` model
3. Use GPU if available
4. Accept slightly lower accuracy
Fix multi-user server sync performance and integration Major fixes: - Integrated ServerSyncClient into GUI for actual multi-user sync - Fixed CUDA device display to show actual hardware used - Optimized server sync with parallel HTTP requests (5x faster) - Fixed 2-second DNS delay by using 127.0.0.1 instead of localhost - Added comprehensive debugging and performance logging Performance improvements: - HTTP requests: 2045ms → 52ms (97% faster) - Multi-user sync lag: ~4s → ~100ms (97% faster) - Parallel request processing with ThreadPoolExecutor (3 workers) New features: - Room generator with one-click copy on Node.js landing page - Auto-detection of PHP vs Node.js server types - Localhost warning banner for WSL2 users - Comprehensive debug logging throughout sync pipeline Files modified: - gui/main_window_qt.py - Server sync integration, device display fix - client/server_sync.py - Parallel HTTP, server type detection - server/nodejs/server.js - Room generator, warnings, debug logs Documentation added: - PERFORMANCE_FIX.md - Server sync optimization details - FIX_2_SECOND_HTTP_DELAY.md - DNS/localhost issue solution - LATENCY_GUIDE.md - Audio chunk duration tuning guide - DEBUG_4_SECOND_LAG.md - Comprehensive debugging guide - SESSION_SUMMARY.md - Complete session summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-12-26 16:44:55 -08:00			`# Transcription Latency Guide`

			`## Understanding the Delay`

			`The delay you see between speaking and the transcription appearing is NOT from server sync - it's from the audio processing pipeline.`

			`### Where the Time Goes`

			```
			`You speak: "Hello everyone"`
			`↓`
			`┌─────────────────────────────────────────────┐`
			`│ 1. Audio Buffer (chunk_duration) │`
			`│ Default: 3.0 seconds │ ← MAIN SOURCE OF DELAY!`
			`│ Waiting for enough audio... │`
			`└─────────────────────────────────────────────┘`
			`↓ (3.0 seconds later)`
			`┌─────────────────────────────────────────────┐`
			`│ 2. Transcription Processing │`
			`│ Whisper model inference │`
			`│ Time: 0.5-1.5 seconds │ ← Depends on model size & device`
			`│ (base model on GPU: ~500ms) │`
			`│ (base model on CPU: ~1500ms) │`
			`└─────────────────────────────────────────────┘`
			`↓ (0.5-1.5 seconds later)`
			`┌─────────────────────────────────────────────┐`
			`│ 3. Display & Server Sync │`
			`│ - Display locally: instant │`
			`│ - Queue for sync: instant │`
			`│ - HTTP request: 50-200ms │ ← Network time`
			`└─────────────────────────────────────────────┘`
			`↓`
			`Total Delay: 3.5-4.5 seconds (mostly buffer time!)`
			```

			`## The Chunk Duration Trade-off`

			`### Current Setting: 3.0 seconds`
			Location: Settings → Audio → Chunk Duration (or `~/.local-transcription/config.yaml`)

			```yaml
			`audio:`
			`chunk_duration: 3.0 # Current setting`
			`overlap_duration: 0.5`
			```

			`Pros:`
			`- ✅ Good accuracy (Whisper has full sentence context)`
			`- ✅ Lower CPU usage (fewer API calls)`
			`- ✅ Better for long sentences`

			`Cons:`
			`- ❌ High latency (~4 seconds)`
			`- ❌ Feels "laggy" for real-time use`

			`---`

			`## Recommended Settings by Use Case`

			`### For Live Streaming (Lower Latency Priority)`
			```yaml
			`audio:`
			`chunk_duration: 1.5 # ← Change this`
			`overlap_duration: 0.3`
			```

			`Result:`
			`- Latency: ~2-2.5 seconds (much better!)`
			`- Accuracy: Still good for most speech`
			`- CPU: Moderate increase`

			`### For Podcasting (Accuracy Priority)`
			```yaml
			`audio:`
			`chunk_duration: 4.0`
			`overlap_duration: 0.5`
			```

			`Result:`
			`- Latency: ~5 seconds (high)`
			`- Accuracy: Best (full sentences)`
			`- CPU: Lowest`

			`### For Real-Time Captions (Lowest Latency)`
			```yaml
			`audio:`
			`chunk_duration: 1.0 # Aggressive!`
			`overlap_duration: 0.2`
			```

			`Result:`
			`- Latency: ~1.5 seconds (best possible)`
			`- Accuracy: Lower (may cut mid-word)`
			`- CPU: Higher (more frequent processing)`

			`Warning: Chunks < 1 second may cut words and reduce accuracy significantly.`

			`### For Gaming/Commentary (Balanced)`
			```yaml
			`audio:`
			`chunk_duration: 2.0`
			`overlap_duration: 0.3`
			```

			`Result:`
			`- Latency: ~2.5-3 seconds (good balance)`
			`- Accuracy: Good`
			`- CPU: Moderate`

			`---`

			`## How to Change Settings`

			`### Method 1: Settings Dialog (Recommended)`
			`1. Open Local Transcription app`
			`2. Click Settings`
			`3. Find "Audio" section`
			`4. Adjust "Chunk Duration" slider`
			`5. Click Save`
			`6. Restart transcription`

			`### Method 2: Edit Config File`
			`1. Stop the app`
			2. Edit: `~/.local-transcription/config.yaml`
			`3. Change:`
			```yaml
			`audio:`
			`chunk_duration: 1.5 # Your desired value`
			```
			`4. Save file`
			`5. Restart app`

			`---`

			`## Testing Different Settings`

			`Quick test procedure:`

			`1. Set chunk_duration to different values`
			`2. Start transcription`
			`3. Speak a sentence`
			`4. Note the time until it appears`
			`5. Check accuracy`

			`Example results:`

			`\| Chunk Duration \| Latency \| Accuracy \| CPU Usage \| Best For \|`
			`\|----------------\|---------\|----------\|-----------\|----------\|`
			`\| 1.0s \| ~1.5s \| Fair \| High \| Real-time captions \|`
			`\| 1.5s \| ~2.0s \| Good \| Medium-High \| Live streaming \|`
			`\| 2.0s \| ~2.5s \| Good \| Medium \| Gaming commentary \|`
			`\| 3.0s \| ~4.0s \| Very Good \| Low \| Default (balanced) \|`
			`\| 4.0s \| ~5.0s \| Excellent \| Very Low \| Podcasts \|`
			`\| 5.0s \| ~6.0s \| Best \| Lowest \| Post-production \|`

			`---`

			`## Model Size Impact`

			`The model size also affects processing time:`

			`\| Model \| Parameters \| GPU Time \| CPU Time \| Accuracy \|`
			`\|-------\|------------\|----------\|----------\|----------\|`
			`\| tiny \| 39M \| ~200ms \| ~800ms \| Fair \|`
			`\| base \| 74M \| ~400ms \| ~1500ms \| Good \|`
			`\| small \| 244M \| ~800ms \| ~3000ms \| Very Good \|`
			`\| medium \| 769M \| ~1500ms \| ~6000ms \| Excellent \|`
			`\| large \| 1550M \| ~3000ms \| ~12000ms \| Best \|`

			`For low latency:`
			- Use `base` or `tiny` model
			`- Use GPU if available`
			`- Reduce chunk_duration`

			`Example fast setup:`
			```yaml
			`transcription:`
			`model: base # or tiny`
			`device: cuda # if you have GPU`

			`audio:`
			`chunk_duration: 1.5`
			```

			`Result: ~2 second total latency!`

			`---`

			`## Advanced: Streaming Transcription`

			`For the absolute lowest latency (experimental):`

			```yaml
			`audio:`
			`chunk_duration: 0.8 # Very aggressive!`
			`overlap_duration: 0.4 # High overlap to prevent cutoffs`

			`processing:`
			`use_vad: true # Skip silent chunks`
			`min_confidence: 0.3 # Lower threshold (more permissive)`
			```

			`Trade-offs:`
			`- ✅ Latency: ~1 second`
			`- ❌ May cut words frequently`
			`- ❌ More processing overhead`
			`- ❌ Some gibberish in output`

			`---`

			`## Why Not Make It Instant?`

			`Q: Why can't chunk_duration be 0.1 seconds for instant transcription?`

			`A: Several reasons:`

			`1. Whisper needs context - It performs better with full sentences`
			`2. Word boundaries - Too short and you cut words mid-syllable`
			`3. Processing overhead - Each chunk has startup cost`
			`4. Model design - Whisper expects 0.5-30 second chunks`

			`Physical limit: ~1 second is the practical minimum for decent accuracy.`

			`---`

			`## Server Sync Is NOT the Bottleneck`

			`With the recent fixes, server sync adds only ~50-200ms of delay:`

			```
			`Local display: [3.5s] "Hello everyone"`
			`↓`
			`Queue: [3.5s] Instant`
			`↓`
			`HTTP request: [3.6s] 100ms network`
			`↓`
			`Server display: [3.6s] "Hello everyone"`

			`Server sync delay: Only 100ms!`
			```

			`The real delay is audio buffering (chunk_duration).`

			`---`

			`## Recommended Settings for Your Use Case`

			`Based on "4 seconds feels too slow":`

			`### Try This First`
			```yaml
			`audio:`
			`chunk_duration: 2.0 # Half the current 4-second delay`
			`overlap_duration: 0.3`
			```

			`Expected result: ~2.5 second total latency (much better!)`

			`### If Still Too Slow`
			```yaml
			`audio:`
			`chunk_duration: 1.5 # More aggressive`
			`overlap_duration: 0.3`

			`transcription:`
			`model: base # Use smaller/faster model if not already`
			```

			`Expected result: ~2 second total latency`

			`### If You Want FAST (Accept Lower Accuracy)`
			```yaml
			`audio:`
			`chunk_duration: 1.0`
			`overlap_duration: 0.2`

			`transcription:`
			`model: tiny # Fastest model`
			`device: cuda # Use GPU`
			```

			`Expected result: ~1.2 second total latency`

			`---`

			`## Monitoring Latency`

			`With the debug logging we just added, you'll see:`

			```
			`[GUI] Sending to server sync: 'Hello everyone...'`
			`[GUI] Queued for sync in: 0.2ms`
			`[Server Sync] Queue delay: 15ms`
			`[Server Sync] HTTP request: 89ms, Status: 200`
			```

			`If you see:`
			`- Queue delay > 100ms → Server sync is slow (rare)`
			`- HTTP request > 500ms → Network/server issue`
			`- Nothing printed for 3+ seconds → Waiting for chunk to fill`

			`---`

			`## Summary`

			`Your 4-second delay breakdown:`
			`- 🐢 3.0s - Audio buffering (chunk_duration) ← MAIN CULPRIT`
			`- ⚡ 0.5-1.0s - Transcription processing (model inference)`
			`- ⚡ 0.1s - Server sync (network)`

			`To reduce to ~2 seconds:`
			`1. Open Settings`
			`2. Change chunk_duration to 2.0`
			`3. Restart transcription`
			`4. Enjoy 2x faster captions!`

			`To reduce to ~1.5 seconds:`
			`1. Change chunk_duration to 1.5`
			2. Use `base` or `tiny` model
			`3. Use GPU if available`
			`4. Accept slightly lower accuracy`