Major fixes: - Integrated ServerSyncClient into GUI for actual multi-user sync - Fixed CUDA device display to show actual hardware used - Optimized server sync with parallel HTTP requests (5x faster) - Fixed 2-second DNS delay by using 127.0.0.1 instead of localhost - Added comprehensive debugging and performance logging Performance improvements: - HTTP requests: 2045ms → 52ms (97% faster) - Multi-user sync lag: ~4s → ~100ms (97% faster) - Parallel request processing with ThreadPoolExecutor (3 workers) New features: - Room generator with one-click copy on Node.js landing page - Auto-detection of PHP vs Node.js server types - Localhost warning banner for WSL2 users - Comprehensive debug logging throughout sync pipeline Files modified: - gui/main_window_qt.py - Server sync integration, device display fix - client/server_sync.py - Parallel HTTP, server type detection - server/nodejs/server.js - Room generator, warnings, debug logs Documentation added: - PERFORMANCE_FIX.md - Server sync optimization details - FIX_2_SECOND_HTTP_DELAY.md - DNS/localhost issue solution - LATENCY_GUIDE.md - Audio chunk duration tuning guide - DEBUG_4_SECOND_LAG.md - Comprehensive debugging guide - SESSION_SUMMARY.md - Complete session summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8.0 KiB
Transcription Latency Guide
Understanding the Delay
The delay you see between speaking and the transcription appearing is NOT from server sync - it's from the audio processing pipeline.
Where the Time Goes
You speak: "Hello everyone"
↓
┌─────────────────────────────────────────────┐
│ 1. Audio Buffer (chunk_duration) │
│ Default: 3.0 seconds │ ← MAIN SOURCE OF DELAY!
│ Waiting for enough audio... │
└─────────────────────────────────────────────┘
↓ (3.0 seconds later)
┌─────────────────────────────────────────────┐
│ 2. Transcription Processing │
│ Whisper model inference │
│ Time: 0.5-1.5 seconds │ ← Depends on model size & device
│ (base model on GPU: ~500ms) │
│ (base model on CPU: ~1500ms) │
└─────────────────────────────────────────────┘
↓ (0.5-1.5 seconds later)
┌─────────────────────────────────────────────┐
│ 3. Display & Server Sync │
│ - Display locally: instant │
│ - Queue for sync: instant │
│ - HTTP request: 50-200ms │ ← Network time
└─────────────────────────────────────────────┘
↓
Total Delay: 3.5-4.5 seconds (mostly buffer time!)
The Chunk Duration Trade-off
Current Setting: 3.0 seconds
Location: Settings → Audio → Chunk Duration (or ~/.local-transcription/config.yaml)
audio:
chunk_duration: 3.0 # Current setting
overlap_duration: 0.5
Pros:
- ✅ Good accuracy (Whisper has full sentence context)
- ✅ Lower CPU usage (fewer API calls)
- ✅ Better for long sentences
Cons:
- ❌ High latency (~4 seconds)
- ❌ Feels "laggy" for real-time use
Recommended Settings by Use Case
For Live Streaming (Lower Latency Priority)
audio:
chunk_duration: 1.5 # ← Change this
overlap_duration: 0.3
Result:
- Latency: ~2-2.5 seconds (much better!)
- Accuracy: Still good for most speech
- CPU: Moderate increase
For Podcasting (Accuracy Priority)
audio:
chunk_duration: 4.0
overlap_duration: 0.5
Result:
- Latency: ~5 seconds (high)
- Accuracy: Best (full sentences)
- CPU: Lowest
For Real-Time Captions (Lowest Latency)
audio:
chunk_duration: 1.0 # Aggressive!
overlap_duration: 0.2
Result:
- Latency: ~1.5 seconds (best possible)
- Accuracy: Lower (may cut mid-word)
- CPU: Higher (more frequent processing)
Warning: Chunks < 1 second may cut words and reduce accuracy significantly.
For Gaming/Commentary (Balanced)
audio:
chunk_duration: 2.0
overlap_duration: 0.3
Result:
- Latency: ~2.5-3 seconds (good balance)
- Accuracy: Good
- CPU: Moderate
How to Change Settings
Method 1: Settings Dialog (Recommended)
- Open Local Transcription app
- Click Settings
- Find "Audio" section
- Adjust "Chunk Duration" slider
- Click Save
- Restart transcription
Method 2: Edit Config File
- Stop the app
- Edit:
~/.local-transcription/config.yaml - Change:
audio: chunk_duration: 1.5 # Your desired value - Save file
- Restart app
Testing Different Settings
Quick test procedure:
- Set chunk_duration to different values
- Start transcription
- Speak a sentence
- Note the time until it appears
- Check accuracy
Example results:
| Chunk Duration | Latency | Accuracy | CPU Usage | Best For |
|---|---|---|---|---|
| 1.0s | ~1.5s | Fair | High | Real-time captions |
| 1.5s | ~2.0s | Good | Medium-High | Live streaming |
| 2.0s | ~2.5s | Good | Medium | Gaming commentary |
| 3.0s | ~4.0s | Very Good | Low | Default (balanced) |
| 4.0s | ~5.0s | Excellent | Very Low | Podcasts |
| 5.0s | ~6.0s | Best | Lowest | Post-production |
Model Size Impact
The model size also affects processing time:
| Model | Parameters | GPU Time | CPU Time | Accuracy |
|---|---|---|---|---|
| tiny | 39M | ~200ms | ~800ms | Fair |
| base | 74M | ~400ms | ~1500ms | Good |
| small | 244M | ~800ms | ~3000ms | Very Good |
| medium | 769M | ~1500ms | ~6000ms | Excellent |
| large | 1550M | ~3000ms | ~12000ms | Best |
For low latency:
- Use
baseortinymodel - Use GPU if available
- Reduce chunk_duration
Example fast setup:
transcription:
model: base # or tiny
device: cuda # if you have GPU
audio:
chunk_duration: 1.5
Result: ~2 second total latency!
Advanced: Streaming Transcription
For the absolute lowest latency (experimental):
audio:
chunk_duration: 0.8 # Very aggressive!
overlap_duration: 0.4 # High overlap to prevent cutoffs
processing:
use_vad: true # Skip silent chunks
min_confidence: 0.3 # Lower threshold (more permissive)
Trade-offs:
- ✅ Latency: ~1 second
- ❌ May cut words frequently
- ❌ More processing overhead
- ❌ Some gibberish in output
Why Not Make It Instant?
Q: Why can't chunk_duration be 0.1 seconds for instant transcription?
A: Several reasons:
- Whisper needs context - It performs better with full sentences
- Word boundaries - Too short and you cut words mid-syllable
- Processing overhead - Each chunk has startup cost
- Model design - Whisper expects 0.5-30 second chunks
Physical limit: ~1 second is the practical minimum for decent accuracy.
Server Sync Is NOT the Bottleneck
With the recent fixes, server sync adds only ~50-200ms of delay:
Local display: [3.5s] "Hello everyone"
↓
Queue: [3.5s] Instant
↓
HTTP request: [3.6s] 100ms network
↓
Server display: [3.6s] "Hello everyone"
Server sync delay: Only 100ms!
The real delay is audio buffering (chunk_duration).
Recommended Settings for Your Use Case
Based on "4 seconds feels too slow":
Try This First
audio:
chunk_duration: 2.0 # Half the current 4-second delay
overlap_duration: 0.3
Expected result: ~2.5 second total latency (much better!)
If Still Too Slow
audio:
chunk_duration: 1.5 # More aggressive
overlap_duration: 0.3
transcription:
model: base # Use smaller/faster model if not already
Expected result: ~2 second total latency
If You Want FAST (Accept Lower Accuracy)
audio:
chunk_duration: 1.0
overlap_duration: 0.2
transcription:
model: tiny # Fastest model
device: cuda # Use GPU
Expected result: ~1.2 second total latency
Monitoring Latency
With the debug logging we just added, you'll see:
[GUI] Sending to server sync: 'Hello everyone...'
[GUI] Queued for sync in: 0.2ms
[Server Sync] Queue delay: 15ms
[Server Sync] HTTP request: 89ms, Status: 200
If you see:
- Queue delay > 100ms → Server sync is slow (rare)
- HTTP request > 500ms → Network/server issue
- Nothing printed for 3+ seconds → Waiting for chunk to fill
Summary
Your 4-second delay breakdown:
- 🐢 3.0s - Audio buffering (chunk_duration) ← MAIN CULPRIT
- ⚡ 0.5-1.0s - Transcription processing (model inference)
- ⚡ 0.1s - Server sync (network)
To reduce to ~2 seconds:
- Open Settings
- Change chunk_duration to 2.0
- Restart transcription
- Enjoy 2x faster captions!
To reduce to ~1.5 seconds:
- Change chunk_duration to 1.5
- Use
baseortinymodel - Use GPU if available
- Accept slightly lower accuracy