# Real-Time Whisper Streaming: Solving Chunk Boundary Word Loss The chunk boundary word loss problem in streaming Whisper transcription is best solved by replacing time-based chunking with **VAD-based segmentation** combined with the **LocalAgreement algorithm**. The most effective 2025 solutions are **WhisperLiveKit** for a turnkey approach, **RealtimeSTT** for simple integration, or implementing **faster-whisper with Silero VAD** for maximum control. Each approach eliminates word loss by processing complete speech utterances and confirming transcriptions only when consecutive outputs agree. ## The core problem and why your current approach fails Time-based chunking (e.g., every 3 seconds) creates artificial boundaries that frequently cut words mid-utterance. Whisper was trained on **30-second segments** and performs poorly when given truncated audio at arbitrary points. The result is word loss at chunk boundaries, hallucinations on silence-padded segments, and inconsistent transcription quality. The solution combines two techniques: **VAD-based segmentation** to detect natural speech boundaries instead of arbitrary time cuts, and the **LocalAgreement algorithm** to confirm only stable transcriptions that appear consistently across multiple processing passes. ## whisper-streaming and the LocalAgreement algorithm The **ufal/whisper_streaming** library (3.4k stars, MIT license) pioneered the LocalAgreement-n approach for streaming Whisper. However, it's now **being superseded by SimulStreaming** in 2025—the authors recommend transitioning to the newer project for optimal performance. **How LocalAgreement-2 works:** 1. Maintain a rolling audio buffer (up to ~30 seconds) 2. Process the entire buffer through Whisper, getting transcription T1 3. Add a new audio chunk, process again, getting T2 4. Find the longest common prefix between T1 and T2 5. Emit only the matching prefix as "confirmed" output 6. Display the unmatched portion as "tentative" (may change) 7. Trim the buffer at sentence boundaries to prevent memory growth This approach solves word loss because text is only emitted when **two consecutive Whisper passes agree**, ensuring stability. The expected latency is approximately **2× the chunk size** (e.g., 2 seconds latency for 1-second chunks). ```python from whisper_online import FasterWhisperASR, OnlineASRProcessor # Initialize with faster-whisper backend asr = FasterWhisperASR("en", "large-v2") asr.use_vad() # Enable Silero VAD online = OnlineASRProcessor(asr) # Main processing loop while audio_has_not_ended: chunk = get_audio_chunk() # 16kHz mono float32 online.insert_audio_chunk(chunk) output = online.process_iter() if output: beg, end, text = output print(f"[{beg:.1f}s-{end:.1f}s] {text}") # Finalize remaining audio final = online.finish() ``` **Key parameters for low-latency captioning:** - `--min-chunk-size 0.5` — Process every 500ms (lower = more responsive) - `--buffer_trimming segment` — Trim at Whisper segment boundaries (default) - `--vac` — Enable Voice Activity Controller for paused speech - `--backend faster-whisper` — Use GPU-accelerated backend **Installation:** ```bash pip install librosa soundfile pip install faster-whisper # GPU: requires CUDA 11.7+ and cuDNN 8.5+ pip install torch torchaudio # For Silero VAD ``` ## RealtimeSTT offers the simplest integration **RealtimeSTT** (KoljaB/RealtimeSTT, **8.9k stars**) provides the most straightforward integration path. It uses a dual-layer VAD system—WebRTC for fast detection plus Silero for accurate verification—and handles chunk boundaries through pre-recording buffers rather than algorithmic agreement. **How it prevents word loss:** - **Pre-recording buffer** (default 0.2s): Captures audio before VAD triggers, preventing missed word starts - **Post-speech silence detection** (default 0.2s): Waits for silence before ending, preventing truncated endings - **Dual-model architecture**: Uses a tiny model for real-time preview, larger model for final transcription ```python from RealtimeSTT import AudioToTextRecorder def on_realtime_update(text): print(f"\r[LIVE] {text}", end="", flush=True) def on_final_text(text): print(f"\n[FINAL] {text}") if __name__ == '__main__': recorder = AudioToTextRecorder( # Model configuration model="small.en", # Final transcription model language="en", # Skip language detection device="cuda", compute_type="float16", # Real-time preview enable_realtime_transcription=True, realtime_model_type="tiny.en", # Fast model for live updates realtime_processing_pause=0.1, # Update every 100ms use_main_model_for_realtime=False, # VAD tuning for low latency silero_sensitivity=0.4, # Lower = fewer false positives silero_use_onnx=True, # Faster VAD inference webrtc_sensitivity=3, # Most aggressive post_speech_silence_duration=0.3, # End sentence after 300ms silence pre_recording_buffer_duration=0.2, # Capture 200ms before VAD triggers # Performance optimization beam_size=2, # Speed/accuracy balance beam_size_realtime=1, # Fastest for preview early_transcription_on_silence=200, # Start transcribing 200ms into silence # Callbacks on_realtime_transcription_update=on_realtime_update, ) while True: recorder.text(on_final_text) ``` **Installation:** ```bash pip install RealtimeSTT # GPU support (highly recommended) pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118 # Linux prerequisites sudo apt-get install python3-dev portaudio19-dev ``` **Important caveat:** RealtimeSTT is now **community-maintained**—the original author no longer actively develops new features. It remains functional and widely used, but for maximum future-proofing, consider WhisperLiveKit. ## faster-whisper with Silero VAD gives maximum control For a custom implementation with full control, **faster-whisper** (SYSTRAN, 19k stars) with **Silero VAD** integration provides the best foundation. This approach replaces time-based chunking with speech-boundary segmentation. **faster-whisper VAD parameters for real-time use:** | Parameter | Default | Real-Time Recommended | Purpose | |-----------|---------|----------------------|---------| | `threshold` | 0.5 | 0.5 | Speech probability threshold | | `min_speech_duration_ms` | 250 | 250 | Minimum speech chunk length | | `min_silence_duration_ms` | **2000** | **500** | Silence duration to split segments | | `speech_pad_ms` | **400** | **100** | Padding added to speech segments | | `max_speech_duration_s` | inf | 30.0 | Limit segment length | The defaults are conservative for batch processing. For real-time captioning, **reduce `min_silence_duration_ms` to 500ms** and **`speech_pad_ms` to 100ms** for faster response. ```python """ Complete real-time transcription with faster-whisper and Silero VAD """ import torch import numpy as np import sounddevice as sd from faster_whisper import WhisperModel import queue import threading SAMPLE_RATE = 16000 CHUNK_MS = 100 CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_MS / 1000) MIN_SPEECH_SAMPLES = int(SAMPLE_RATE * 0.5) # 500ms minimum SILENCE_CHUNKS_TO_END = 7 # 700ms of silence ends speech class RealtimeTranscriber: def __init__(self, model_size="small", device="cuda"): # Load Whisper self.whisper = WhisperModel( model_size, device=device, compute_type="float16" if device == "cuda" else "int8" ) # Load Silero VAD self.vad_model, _ = torch.hub.load( 'snakers4/silero-vad', 'silero_vad', force_reload=False ) # State self.audio_queue = queue.Queue() self.speech_buffer = [] self.pre_roll_buffer = [] # Captures audio before speech starts self.is_speaking = False self.silence_count = 0 self.running = False def audio_callback(self, indata, frames, time, status): self.audio_queue.put(indata.copy()) def process_audio(self): while self.running: try: audio_chunk = self.audio_queue.get(timeout=0.1) audio_chunk = audio_chunk.flatten().astype(np.float32) # Pre-roll buffer (keeps last ~200ms before speech) self.pre_roll_buffer.append(audio_chunk) if len(self.pre_roll_buffer) > 2: self.pre_roll_buffer.pop(0) # VAD check tensor = torch.FloatTensor(audio_chunk) speech_prob = self.vad_model(tensor, SAMPLE_RATE).item() if speech_prob > 0.5: if not self.is_speaking: # Speech started - include pre-roll buffer self.is_speaking = True for pre_chunk in self.pre_roll_buffer: self.speech_buffer.extend(pre_chunk) else: self.speech_buffer.extend(audio_chunk) self.silence_count = 0 elif self.is_speaking: self.speech_buffer.extend(audio_chunk) self.silence_count += 1 if self.silence_count >= SILENCE_CHUNKS_TO_END: self.transcribe_and_reset() except queue.Empty: continue def transcribe_and_reset(self): if len(self.speech_buffer) < MIN_SPEECH_SAMPLES: self.reset_state() return audio_array = np.array(self.speech_buffer, dtype=np.float32) segments, _ = self.whisper.transcribe( audio_array, beam_size=2, language="en", vad_filter=False, # Already VAD-processed condition_on_previous_text=False ) text = " ".join(seg.text.strip() for seg in segments) if text: print(f"\n🎤 {text}") self.reset_state() def reset_state(self): self.speech_buffer = [] self.is_speaking = False self.silence_count = 0 def start(self): self.running = True threading.Thread(target=self.process_audio, daemon=True).start() print("🎙️ Listening... (Ctrl+C to stop)") with sd.InputStream( samplerate=SAMPLE_RATE, channels=1, dtype=np.float32, blocksize=CHUNK_SIZE, callback=self.audio_callback ): try: while True: sd.sleep(100) except KeyboardInterrupt: self.running = False print("\n⏹️ Stopped") if __name__ == "__main__": transcriber = RealtimeTranscriber(model_size="small", device="cuda") transcriber.start() ``` ## WhisperLiveKit is the most complete 2025 solution **WhisperLiveKit** (QuentinFuxa/WhisperLiveKit, **9.3k stars**) represents the most complete streaming solution in 2025. It integrates both LocalAgreement and the newer SimulStreaming (AlignAtt) policies, supports speaker diarization, and provides a full WebSocket server with web UI. **Key advantages:** - Supports **both** streaming policies (LocalAgreement and AlignAtt) - **Speaker diarization** via Streaming Sortformer (2025 SOTA) - **200-language translation** via NLLB - Auto-selects optimal backend (MLX on macOS, faster-whisper on Linux/Windows) - Docker-ready deployment ```bash pip install whisperlivekit # Basic usage wlk --model small --language en # With diarization and low latency wlk --model medium --language en --diarization # Open http://localhost:8000 for web UI ``` **Python API integration:** ```python from whisperlivekit import AudioProcessor, TranscriptionEngine engine = TranscriptionEngine( model="small", lan="en", diarization=False # Enable for speaker identification ) processor = AudioProcessor(transcription_engine=engine) ``` ## Implementing the LocalAgreement algorithm from scratch For maximum control, here's a complete implementation of LocalAgreement-2 with faster-whisper: ```python """ LocalAgreement-2 streaming transcription implementation """ from faster_whisper import WhisperModel import numpy as np class LocalAgreementTranscriber: def __init__(self, model_size="small", device="cuda"): self.model = WhisperModel( model_size, device=device, compute_type="float16" if device == "cuda" else "int8" ) self.sample_rate = 16000 self.min_chunk_size = 1.0 # seconds self.buffer_max = 30.0 # seconds # State self.audio_buffer = np.array([], dtype=np.float32) self.confirmed_words = [] self.previous_output = None self.prompt_words = [] # Last 200 words for context def add_audio(self, audio: np.ndarray): """Add new audio chunk to buffer.""" self.audio_buffer = np.concatenate([self.audio_buffer, audio]) def process(self) -> tuple[str, str]: """Process buffer, return (confirmed_text, tentative_text).""" buffer_duration = len(self.audio_buffer) / self.sample_rate if buffer_duration < self.min_chunk_size: return "", "" # Build context prompt from confirmed words prompt = ' '.join(self.prompt_words[-200:]) if self.prompt_words else None # Transcribe entire buffer segments, _ = self.model.transcribe( self.audio_buffer, initial_prompt=prompt, word_timestamps=True, beam_size=2, language="en" ) # Extract words with timestamps current_words = [] for segment in segments: if segment.words: for word in segment.words: current_words.append({ 'text': word.word.strip(), 'start': word.start, 'end': word.end }) # First pass - no comparison possible yet if self.previous_output is None: self.previous_output = current_words tentative = ' '.join(w['text'] for w in current_words) return "", tentative # LocalAgreement-2: Find longest common prefix confirmed = [] for prev, curr in zip(self.previous_output, current_words): if prev['text'].lower() == curr['text'].lower(): confirmed.append(curr) else: break # Update state confirmed_text = ' '.join(w['text'] for w in confirmed) tentative_text = ' '.join(w['text'] for w in current_words[len(confirmed):]) if confirmed: self.confirmed_words.extend([w['text'] for w in confirmed]) self.prompt_words.extend([w['text'] for w in confirmed]) # Trim buffer if too long if buffer_duration > self.buffer_max: self._trim_buffer_at_sentence() self.previous_output = current_words return confirmed_text, tentative_text def _trim_buffer_at_sentence(self): """Trim buffer at last sentence boundary.""" # Find last confirmed word ending with punctuation for i, word in reversed(list(enumerate(self.confirmed_words))): if word.endswith(('.', '?', '!')): # Keep buffer from this point forward # (In practice, need timestamp tracking - simplified here) trim_samples = int(15 * self.sample_rate) # Keep last 15s if len(self.audio_buffer) > trim_samples: self.audio_buffer = self.audio_buffer[-trim_samples:] break def finish(self) -> str: """Finalize any remaining audio.""" if len(self.audio_buffer) > 0: segments, _ = self.model.transcribe(self.audio_buffer) return ' '.join(seg.text.strip() for seg in segments) return "" ``` ## Performance tuning and parameter recommendations **Model selection by use case:** | Use Case | Model | GPU VRAM | Latency | Notes | |----------|-------|----------|---------|-------| | Ultra-low latency | `tiny.en` | ~1GB | Fastest | For real-time preview only | | Streaming captioning | `small.en` | ~2GB | ~2-3s | **Best balance for streamers** | | High accuracy | `medium.en` | ~5GB | ~4-5s | Near-real-time | | Maximum quality | `distil-large-v3` | ~6GB | ~5s | Distilled, faster than large | **Optimal configuration for streamer captioning:** ```python # Recommended settings for real-time captioning config = { # Model "model": "small.en", # or "base.en" for lower latency "device": "cuda", "compute_type": "float16", # Transcription "beam_size": 2, # 1 for speed, 5 for accuracy "language": "en", # Always specify to skip detection "condition_on_previous_text": False, # Reduces latency # VAD (if using faster-whisper built-in) "vad_filter": True, "vad_parameters": { "threshold": 0.5, "min_speech_duration_ms": 250, "min_silence_duration_ms": 500, # Down from 2000ms default "speech_pad_ms": 100, # Down from 400ms default }, # Streaming "min_chunk_size": 0.5, # seconds between processing "buffer_max": 30.0, # seconds before trimming } ``` **Latency breakdown with LocalAgreement-2:** - Chunk collection: 0.5-1.0s (configurable) - Whisper inference: 0.2-0.5s (depends on model/GPU) - Agreement confirmation: requires 2 passes = 2× chunk time - **Total end-to-end: ~2-4 seconds** for confirmed text ## Step-by-step integration for Claude Code To upgrade the existing Python desktop application from time-based chunking to VAD-based streaming: **Option 1: Quickest integration with RealtimeSTT** ```bash pip install RealtimeSTT pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118 ``` Replace the time-based chunking code with the `AudioToTextRecorder` configuration shown in the RealtimeSTT section above. This handles all VAD, buffering, and deduplication automatically. **Option 2: Maximum control with faster-whisper + Silero VAD** 1. Install dependencies: ```bash pip install faster-whisper sounddevice numpy pip install torch torchaudio # For Silero VAD ``` 2. Implement the `RealtimeTranscriber` class from the faster-whisper section above 3. Key changes from time-based chunking: - Replace fixed-interval processing with VAD-triggered segmentation - Add pre-roll buffer to capture word starts - Use silence detection instead of timers for utterance boundaries - Process complete utterances, not arbitrary chunks **Option 3: Production-ready with WhisperLiveKit** For the most robust solution with WebSocket architecture: ```bash pip install whisperlivekit wlk --model small --language en --port 8000 ``` Connect your desktop application as a WebSocket client to `ws://localhost:8000`. ## Conclusion The chunk boundary word loss problem is definitively solved by combining **VAD-based segmentation** with the **LocalAgreement confirmation algorithm**. For a streamer captioning application, **RealtimeSTT** offers the fastest integration path with its dual-layer VAD and pre-recording buffers. For maximum performance and future-proofing, **WhisperLiveKit** provides a complete solution with the latest SimulStreaming research. The custom **faster-whisper + Silero VAD** approach gives full control when specific optimizations are needed. The key insight is that Whisper performs best when given complete speech utterances at natural boundaries—let VAD find those boundaries rather than imposing arbitrary time cuts. With proper implementation, real-time captioning latency of **2-4 seconds** is achievable with **no word loss** at chunk boundaries.