Files
local-transcription/LATENCY_GUIDE.md
jknapp 64c864b0f0 Fix multi-user server sync performance and integration
Major fixes:
- Integrated ServerSyncClient into GUI for actual multi-user sync
- Fixed CUDA device display to show actual hardware used
- Optimized server sync with parallel HTTP requests (5x faster)
- Fixed 2-second DNS delay by using 127.0.0.1 instead of localhost
- Added comprehensive debugging and performance logging

Performance improvements:
- HTTP requests: 2045ms → 52ms (97% faster)
- Multi-user sync lag: ~4s → ~100ms (97% faster)
- Parallel request processing with ThreadPoolExecutor (3 workers)

New features:
- Room generator with one-click copy on Node.js landing page
- Auto-detection of PHP vs Node.js server types
- Localhost warning banner for WSL2 users
- Comprehensive debug logging throughout sync pipeline

Files modified:
- gui/main_window_qt.py - Server sync integration, device display fix
- client/server_sync.py - Parallel HTTP, server type detection
- server/nodejs/server.js - Room generator, warnings, debug logs

Documentation added:
- PERFORMANCE_FIX.md - Server sync optimization details
- FIX_2_SECOND_HTTP_DELAY.md - DNS/localhost issue solution
- LATENCY_GUIDE.md - Audio chunk duration tuning guide
- DEBUG_4_SECOND_LAG.md - Comprehensive debugging guide
- SESSION_SUMMARY.md - Complete session summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-26 16:44:55 -08:00

8.0 KiB

Transcription Latency Guide

Understanding the Delay

The delay you see between speaking and the transcription appearing is NOT from server sync - it's from the audio processing pipeline.

Where the Time Goes

You speak: "Hello everyone"
    ↓
┌─────────────────────────────────────────────┐
│ 1. Audio Buffer (chunk_duration)            │
│    Default: 3.0 seconds                     │ ← MAIN SOURCE OF DELAY!
│    Waiting for enough audio...              │
└─────────────────────────────────────────────┘
    ↓ (3.0 seconds later)
┌─────────────────────────────────────────────┐
│ 2. Transcription Processing                 │
│    Whisper model inference                  │
│    Time: 0.5-1.5 seconds                    │ ← Depends on model size & device
│    (base model on GPU: ~500ms)              │
│    (base model on CPU: ~1500ms)             │
└─────────────────────────────────────────────┘
    ↓ (0.5-1.5 seconds later)
┌─────────────────────────────────────────────┐
│ 3. Display & Server Sync                    │
│    - Display locally: instant               │
│    - Queue for sync: instant                │
│    - HTTP request: 50-200ms                 │ ← Network time
└─────────────────────────────────────────────┘
    ↓
Total Delay: 3.5-4.5 seconds (mostly buffer time!)

The Chunk Duration Trade-off

Current Setting: 3.0 seconds

Location: Settings → Audio → Chunk Duration (or ~/.local-transcription/config.yaml)

audio:
  chunk_duration: 3.0  # Current setting
  overlap_duration: 0.5

Pros:

  • Good accuracy (Whisper has full sentence context)
  • Lower CPU usage (fewer API calls)
  • Better for long sentences

Cons:

  • High latency (~4 seconds)
  • Feels "laggy" for real-time use

For Live Streaming (Lower Latency Priority)

audio:
  chunk_duration: 1.5  # ← Change this
  overlap_duration: 0.3

Result:

  • Latency: ~2-2.5 seconds (much better!)
  • Accuracy: Still good for most speech
  • CPU: Moderate increase

For Podcasting (Accuracy Priority)

audio:
  chunk_duration: 4.0
  overlap_duration: 0.5

Result:

  • Latency: ~5 seconds (high)
  • Accuracy: Best (full sentences)
  • CPU: Lowest

For Real-Time Captions (Lowest Latency)

audio:
  chunk_duration: 1.0  # Aggressive!
  overlap_duration: 0.2

Result:

  • Latency: ~1.5 seconds (best possible)
  • Accuracy: Lower (may cut mid-word)
  • CPU: Higher (more frequent processing)

Warning: Chunks < 1 second may cut words and reduce accuracy significantly.

For Gaming/Commentary (Balanced)

audio:
  chunk_duration: 2.0
  overlap_duration: 0.3

Result:

  • Latency: ~2.5-3 seconds (good balance)
  • Accuracy: Good
  • CPU: Moderate

How to Change Settings

  1. Open Local Transcription app
  2. Click Settings
  3. Find "Audio" section
  4. Adjust "Chunk Duration" slider
  5. Click Save
  6. Restart transcription

Method 2: Edit Config File

  1. Stop the app
  2. Edit: ~/.local-transcription/config.yaml
  3. Change:
    audio:
      chunk_duration: 1.5  # Your desired value
    
  4. Save file
  5. Restart app

Testing Different Settings

Quick test procedure:

  1. Set chunk_duration to different values
  2. Start transcription
  3. Speak a sentence
  4. Note the time until it appears
  5. Check accuracy

Example results:

Chunk Duration Latency Accuracy CPU Usage Best For
1.0s ~1.5s Fair High Real-time captions
1.5s ~2.0s Good Medium-High Live streaming
2.0s ~2.5s Good Medium Gaming commentary
3.0s ~4.0s Very Good Low Default (balanced)
4.0s ~5.0s Excellent Very Low Podcasts
5.0s ~6.0s Best Lowest Post-production

Model Size Impact

The model size also affects processing time:

Model Parameters GPU Time CPU Time Accuracy
tiny 39M ~200ms ~800ms Fair
base 74M ~400ms ~1500ms Good
small 244M ~800ms ~3000ms Very Good
medium 769M ~1500ms ~6000ms Excellent
large 1550M ~3000ms ~12000ms Best

For low latency:

  • Use base or tiny model
  • Use GPU if available
  • Reduce chunk_duration

Example fast setup:

transcription:
  model: base  # or tiny
  device: cuda  # if you have GPU

audio:
  chunk_duration: 1.5

Result: ~2 second total latency!


Advanced: Streaming Transcription

For the absolute lowest latency (experimental):

audio:
  chunk_duration: 0.8  # Very aggressive!
  overlap_duration: 0.4  # High overlap to prevent cutoffs

processing:
  use_vad: true  # Skip silent chunks
  min_confidence: 0.3  # Lower threshold (more permissive)

Trade-offs:

  • Latency: ~1 second
  • May cut words frequently
  • More processing overhead
  • Some gibberish in output

Why Not Make It Instant?

Q: Why can't chunk_duration be 0.1 seconds for instant transcription?

A: Several reasons:

  1. Whisper needs context - It performs better with full sentences
  2. Word boundaries - Too short and you cut words mid-syllable
  3. Processing overhead - Each chunk has startup cost
  4. Model design - Whisper expects 0.5-30 second chunks

Physical limit: ~1 second is the practical minimum for decent accuracy.


Server Sync Is NOT the Bottleneck

With the recent fixes, server sync adds only ~50-200ms of delay:

Local display:  [3.5s] "Hello everyone"
                  ↓
Queue:            [3.5s] Instant
                  ↓
HTTP request:     [3.6s] 100ms network
                  ↓
Server display:   [3.6s] "Hello everyone"

Server sync delay: Only 100ms!

The real delay is audio buffering (chunk_duration).


Based on "4 seconds feels too slow":

Try This First

audio:
  chunk_duration: 2.0  # Half the current 4-second delay
  overlap_duration: 0.3

Expected result: ~2.5 second total latency (much better!)

If Still Too Slow

audio:
  chunk_duration: 1.5  # More aggressive
  overlap_duration: 0.3

transcription:
  model: base  # Use smaller/faster model if not already

Expected result: ~2 second total latency

If You Want FAST (Accept Lower Accuracy)

audio:
  chunk_duration: 1.0
  overlap_duration: 0.2

transcription:
  model: tiny  # Fastest model
  device: cuda  # Use GPU

Expected result: ~1.2 second total latency


Monitoring Latency

With the debug logging we just added, you'll see:

[GUI] Sending to server sync: 'Hello everyone...'
[GUI] Queued for sync in: 0.2ms
[Server Sync] Queue delay: 15ms
[Server Sync] HTTP request: 89ms, Status: 200

If you see:

  • Queue delay > 100ms → Server sync is slow (rare)
  • HTTP request > 500ms → Network/server issue
  • Nothing printed for 3+ seconds → Waiting for chunk to fill

Summary

Your 4-second delay breakdown:

  • 🐢 3.0s - Audio buffering (chunk_duration) ← MAIN CULPRIT
  • 0.5-1.0s - Transcription processing (model inference)
  • 0.1s - Server sync (network)

To reduce to ~2 seconds:

  1. Open Settings
  2. Change chunk_duration to 2.0
  3. Restart transcription
  4. Enjoy 2x faster captions!

To reduce to ~1.5 seconds:

  1. Change chunk_duration to 1.5
  2. Use base or tiny model
  3. Use GPU if available
  4. Accept slightly lower accuracy