Files

jknapp 64c864b0f0 Fix multi-user server sync performance and integration

Major fixes:
- Integrated ServerSyncClient into GUI for actual multi-user sync
- Fixed CUDA device display to show actual hardware used
- Optimized server sync with parallel HTTP requests (5x faster)
- Fixed 2-second DNS delay by using 127.0.0.1 instead of localhost
- Added comprehensive debugging and performance logging

Performance improvements:
- HTTP requests: 2045ms → 52ms (97% faster)
- Multi-user sync lag: ~4s → ~100ms (97% faster)
- Parallel request processing with ThreadPoolExecutor (3 workers)

New features:
- Room generator with one-click copy on Node.js landing page
- Auto-detection of PHP vs Node.js server types
- Localhost warning banner for WSL2 users
- Comprehensive debug logging throughout sync pipeline

Files modified:
- gui/main_window_qt.py - Server sync integration, device display fix
- client/server_sync.py - Parallel HTTP, server type detection
- server/nodejs/server.js - Room generator, warnings, debug logs

Documentation added:
- PERFORMANCE_FIX.md - Server sync optimization details
- FIX_2_SECOND_HTTP_DELAY.md - DNS/localhost issue solution
- LATENCY_GUIDE.md - Audio chunk duration tuning guide
- DEBUG_4_SECOND_LAG.md - Comprehensive debugging guide
- SESSION_SUMMARY.md - Complete session summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-26 16:44:55 -08:00

8.0 KiB

Raw Blame History

Transcription Latency Guide

Understanding the Delay

The delay you see between speaking and the transcription appearing is NOT from server sync - it's from the audio processing pipeline.

Where the Time Goes

You speak: "Hello everyone"
    ↓
┌─────────────────────────────────────────────┐
│ 1. Audio Buffer (chunk_duration)            │
│    Default: 3.0 seconds                     │ ← MAIN SOURCE OF DELAY!
│    Waiting for enough audio...              │
└─────────────────────────────────────────────┘
    ↓ (3.0 seconds later)
┌─────────────────────────────────────────────┐
│ 2. Transcription Processing                 │
│    Whisper model inference                  │
│    Time: 0.5-1.5 seconds                    │ ← Depends on model size & device
│    (base model on GPU: ~500ms)              │
│    (base model on CPU: ~1500ms)             │
└─────────────────────────────────────────────┘
    ↓ (0.5-1.5 seconds later)
┌─────────────────────────────────────────────┐
│ 3. Display & Server Sync                    │
│    - Display locally: instant               │
│    - Queue for sync: instant                │
│    - HTTP request: 50-200ms                 │ ← Network time
└─────────────────────────────────────────────┘
    ↓
Total Delay: 3.5-4.5 seconds (mostly buffer time!)

The Chunk Duration Trade-off

Current Setting: 3.0 seconds

Location: Settings → Audio → Chunk Duration (or ~/.local-transcription/config.yaml)

audio:
  chunk_duration: 3.0  # Current setting
  overlap_duration: 0.5

Pros:

✅ Good accuracy (Whisper has full sentence context)
✅ Lower CPU usage (fewer API calls)
✅ Better for long sentences

Cons:

❌ High latency (~4 seconds)
❌ Feels "laggy" for real-time use

Recommended Settings by Use Case

For Live Streaming (Lower Latency Priority)

audio:
  chunk_duration: 1.5  # ← Change this
  overlap_duration: 0.3

Result:

Latency: ~2-2.5 seconds (much better!)
Accuracy: Still good for most speech
CPU: Moderate increase

For Podcasting (Accuracy Priority)

audio:
  chunk_duration: 4.0
  overlap_duration: 0.5

Result:

Latency: ~5 seconds (high)
Accuracy: Best (full sentences)
CPU: Lowest

For Real-Time Captions (Lowest Latency)

audio:
  chunk_duration: 1.0  # Aggressive!
  overlap_duration: 0.2

Result:

Latency: ~1.5 seconds (best possible)
Accuracy: Lower (may cut mid-word)
CPU: Higher (more frequent processing)

Warning: Chunks < 1 second may cut words and reduce accuracy significantly.

For Gaming/Commentary (Balanced)

audio:
  chunk_duration: 2.0
  overlap_duration: 0.3

Result:

Latency: ~2.5-3 seconds (good balance)
Accuracy: Good
CPU: Moderate

How to Change Settings

Method 1: Settings Dialog (Recommended)

Open Local Transcription app
Click Settings
Find "Audio" section
Adjust "Chunk Duration" slider
Click Save
Restart transcription

Method 2: Edit Config File

Stop the app
Edit: ~/.local-transcription/config.yaml

Change:

audio:
  chunk_duration: 1.5  # Your desired value

Save file
Restart app

Testing Different Settings

Quick test procedure:

Set chunk_duration to different values
Start transcription
Speak a sentence
Note the time until it appears
Check accuracy

Example results:

Chunk Duration	Latency	Accuracy	CPU Usage	Best For
1.0s	~1.5s	Fair	High	Real-time captions
1.5s	~2.0s	Good	Medium-High	Live streaming
2.0s	~2.5s	Good	Medium	Gaming commentary
3.0s	~4.0s	Very Good	Low	Default (balanced)
4.0s	~5.0s	Excellent	Very Low	Podcasts
5.0s	~6.0s	Best	Lowest	Post-production

Model Size Impact

The model size also affects processing time:

Model	Parameters	GPU Time	CPU Time	Accuracy
tiny	39M	~200ms	~800ms	Fair
base	74M	~400ms	~1500ms	Good
small	244M	~800ms	~3000ms	Very Good
medium	769M	~1500ms	~6000ms	Excellent
large	1550M	~3000ms	~12000ms	Best

For low latency:

Use base or tiny model
Use GPU if available
Reduce chunk_duration

Example fast setup:

transcription:
  model: base  # or tiny
  device: cuda  # if you have GPU

audio:
  chunk_duration: 1.5

Result: ~2 second total latency!

Advanced: Streaming Transcription

For the absolute lowest latency (experimental):

audio:
  chunk_duration: 0.8  # Very aggressive!
  overlap_duration: 0.4  # High overlap to prevent cutoffs

processing:
  use_vad: true  # Skip silent chunks
  min_confidence: 0.3  # Lower threshold (more permissive)

Trade-offs:

✅ Latency: ~1 second
❌ May cut words frequently
❌ More processing overhead
❌ Some gibberish in output

Why Not Make It Instant?

Q: Why can't chunk_duration be 0.1 seconds for instant transcription?

A: Several reasons:

Whisper needs context - It performs better with full sentences
Word boundaries - Too short and you cut words mid-syllable
Processing overhead - Each chunk has startup cost
Model design - Whisper expects 0.5-30 second chunks

Physical limit: ~1 second is the practical minimum for decent accuracy.

Server Sync Is NOT the Bottleneck

With the recent fixes, server sync adds only ~50-200ms of delay:

Local display:  [3.5s] "Hello everyone"
                  ↓
Queue:            [3.5s] Instant
                  ↓
HTTP request:     [3.6s] 100ms network
                  ↓
Server display:   [3.6s] "Hello everyone"

Server sync delay: Only 100ms!

The real delay is audio buffering (chunk_duration).

Recommended Settings for Your Use Case

Based on "4 seconds feels too slow":

Try This First

audio:
  chunk_duration: 2.0  # Half the current 4-second delay
  overlap_duration: 0.3

Expected result: ~2.5 second total latency (much better!)

If Still Too Slow

audio:
  chunk_duration: 1.5  # More aggressive
  overlap_duration: 0.3

transcription:
  model: base  # Use smaller/faster model if not already

Expected result: ~2 second total latency

If You Want FAST (Accept Lower Accuracy)

audio:
  chunk_duration: 1.0
  overlap_duration: 0.2

transcription:
  model: tiny  # Fastest model
  device: cuda  # Use GPU

Expected result: ~1.2 second total latency

Monitoring Latency

With the debug logging we just added, you'll see:

[GUI] Sending to server sync: 'Hello everyone...'
[GUI] Queued for sync in: 0.2ms
[Server Sync] Queue delay: 15ms
[Server Sync] HTTP request: 89ms, Status: 200

If you see:

Queue delay > 100ms → Server sync is slow (rare)
HTTP request > 500ms → Network/server issue
Nothing printed for 3+ seconds → Waiting for chunk to fill

Summary

Your 4-second delay breakdown:

🐢 3.0s - Audio buffering (chunk_duration) ← MAIN CULPRIT
⚡ 0.5-1.0s - Transcription processing (model inference)
⚡ 0.1s - Server sync (network)

To reduce to ~2 seconds:

Open Settings
Change chunk_duration to 2.0
Restart transcription
Enjoy 2x faster captions!

To reduce to ~1.5 seconds:

Change chunk_duration to 1.5
Use base or tiny model
Use GPU if available
Accept slightly lower accuracy

8.0 KiB Raw Blame History

Transcription Latency Guide

Understanding the Delay

Where the Time Goes

The Chunk Duration Trade-off

Current Setting: 3.0 seconds

Recommended Settings by Use Case

For Live Streaming (Lower Latency Priority)

For Podcasting (Accuracy Priority)

For Real-Time Captions (Lowest Latency)

For Gaming/Commentary (Balanced)

How to Change Settings

Method 1: Settings Dialog (Recommended)

Method 2: Edit Config File

Testing Different Settings

Model Size Impact

Advanced: Streaming Transcription

Why Not Make It Instant?

Server Sync Is NOT the Bottleneck

Recommended Settings for Your Use Case

Try This First

If Still Too Slow

If You Want FAST (Accept Lower Accuracy)

Monitoring Latency

Summary

8.0 KiB

Raw Blame History