# Real-Time Whisper Streaming: Solving Chunk Boundary Word Loss

The chunk boundary word loss problem in streaming Whisper transcription is best solved by replacing time-based chunking with **VAD-based segmentation** combined with the **LocalAgreement algorithm**. The most effective 2025 solutions are **WhisperLiveKit** for a turnkey approach, **RealtimeSTT** for simple integration, or implementing **faster-whisper with Silero VAD** for maximum control. Each approach eliminates word loss by processing complete speech utterances and confirming transcriptions only when consecutive outputs agree.

## The core problem and why your current approach fails

Time-based chunking (e.g., every 3 seconds) creates artificial boundaries that frequently cut words mid-utterance. Whisper was trained on **30-second segments** and performs poorly when given truncated audio at arbitrary points. The result is word loss at chunk boundaries, hallucinations on silence-padded segments, and inconsistent transcription quality.

The solution combines two techniques: **VAD-based segmentation** to detect natural speech boundaries instead of arbitrary time cuts, and the **LocalAgreement algorithm** to confirm only stable transcriptions that appear consistently across multiple processing passes.

## whisper-streaming and the LocalAgreement algorithm

The **ufal/whisper_streaming** library (3.4k stars, MIT license) pioneered the LocalAgreement-n approach for streaming Whisper. However, it's now **being superseded by SimulStreaming** in 2025—the authors recommend transitioning to the newer project for optimal performance.

**How LocalAgreement-2 works:**
1. Maintain a rolling audio buffer (up to ~30 seconds)
2. Process the entire buffer through Whisper, getting transcription T1
3. Add a new audio chunk, process again, getting T2
4. Find the longest common prefix between T1 and T2
5. Emit only the matching prefix as "confirmed" output
6. Display the unmatched portion as "tentative" (may change)
7. Trim the buffer at sentence boundaries to prevent memory growth

This approach solves word loss because text is only emitted when **two consecutive Whisper passes agree**, ensuring stability. The expected latency is approximately **2× the chunk size** (e.g., 2 seconds latency for 1-second chunks).

```python
from whisper_online import FasterWhisperASR, OnlineASRProcessor

# Initialize with faster-whisper backend
asr = FasterWhisperASR("en", "large-v2")
asr.use_vad()  # Enable Silero VAD

online = OnlineASRProcessor(asr)

# Main processing loop
while audio_has_not_ended:
    chunk = get_audio_chunk()  # 16kHz mono float32
    online.insert_audio_chunk(chunk)
    output = online.process_iter()
    if output:
        beg, end, text = output
        print(f"[{beg:.1f}s-{end:.1f}s] {text}")

# Finalize remaining audio
final = online.finish()
```

**Key parameters for low-latency captioning:**
- `--min-chunk-size 0.5` — Process every 500ms (lower = more responsive)
- `--buffer_trimming segment` — Trim at Whisper segment boundaries (default)
- `--vac` — Enable Voice Activity Controller for paused speech
- `--backend faster-whisper` — Use GPU-accelerated backend

**Installation:**
```bash
pip install librosa soundfile
pip install faster-whisper  # GPU: requires CUDA 11.7+ and cuDNN 8.5+
pip install torch torchaudio  # For Silero VAD
```

## RealtimeSTT offers the simplest integration

**RealtimeSTT** (KoljaB/RealtimeSTT, **8.9k stars**) provides the most straightforward integration path. It uses a dual-layer VAD system—WebRTC for fast detection plus Silero for accurate verification—and handles chunk boundaries through pre-recording buffers rather than algorithmic agreement.

**How it prevents word loss:**
- **Pre-recording buffer** (default 0.2s): Captures audio before VAD triggers, preventing missed word starts
- **Post-speech silence detection** (default 0.2s): Waits for silence before ending, preventing truncated endings
- **Dual-model architecture**: Uses a tiny model for real-time preview, larger model for final transcription

```python
from RealtimeSTT import AudioToTextRecorder

def on_realtime_update(text):
    print(f"\r[LIVE] {text}", end="", flush=True)

def on_final_text(text):
    print(f"\n[FINAL] {text}")

if __name__ == '__main__':
    recorder = AudioToTextRecorder(
        # Model configuration
        model="small.en",                    # Final transcription model
        language="en",                       # Skip language detection
        device="cuda",
        compute_type="float16",
        
        # Real-time preview
        enable_realtime_transcription=True,
        realtime_model_type="tiny.en",       # Fast model for live updates
        realtime_processing_pause=0.1,       # Update every 100ms
        use_main_model_for_realtime=False,
        
        # VAD tuning for low latency
        silero_sensitivity=0.4,              # Lower = fewer false positives
        silero_use_onnx=True,                # Faster VAD inference
        webrtc_sensitivity=3,                # Most aggressive
        post_speech_silence_duration=0.3,    # End sentence after 300ms silence
        pre_recording_buffer_duration=0.2,   # Capture 200ms before VAD triggers
        
        # Performance optimization
        beam_size=2,                         # Speed/accuracy balance
        beam_size_realtime=1,                # Fastest for preview
        early_transcription_on_silence=200,  # Start transcribing 200ms into silence
        
        # Callbacks
        on_realtime_transcription_update=on_realtime_update,
    )
    
    while True:
        recorder.text(on_final_text)
```

**Installation:**
```bash
pip install RealtimeSTT

# GPU support (highly recommended)
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118

# Linux prerequisites
sudo apt-get install python3-dev portaudio19-dev
```

**Important caveat:** RealtimeSTT is now **community-maintained**—the original author no longer actively develops new features. It remains functional and widely used, but for maximum future-proofing, consider WhisperLiveKit.

## faster-whisper with Silero VAD gives maximum control

For a custom implementation with full control, **faster-whisper** (SYSTRAN, 19k stars) with **Silero VAD** integration provides the best foundation. This approach replaces time-based chunking with speech-boundary segmentation.

**faster-whisper VAD parameters for real-time use:**

| Parameter | Default | Real-Time Recommended | Purpose |
|-----------|---------|----------------------|---------|
| `threshold` | 0.5 | 0.5 | Speech probability threshold |
| `min_speech_duration_ms` | 250 | 250 | Minimum speech chunk length |
| `min_silence_duration_ms` | **2000** | **500** | Silence duration to split segments |
| `speech_pad_ms` | **400** | **100** | Padding added to speech segments |
| `max_speech_duration_s` | inf | 30.0 | Limit segment length |

The defaults are conservative for batch processing. For real-time captioning, **reduce `min_silence_duration_ms` to 500ms** and **`speech_pad_ms` to 100ms** for faster response.

```python
"""
Complete real-time transcription with faster-whisper and Silero VAD
"""
import torch
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import queue
import threading

SAMPLE_RATE = 16000
CHUNK_MS = 100
CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_MS / 1000)
MIN_SPEECH_SAMPLES = int(SAMPLE_RATE * 0.5)  # 500ms minimum
SILENCE_CHUNKS_TO_END = 7  # 700ms of silence ends speech

class RealtimeTranscriber:
    def __init__(self, model_size="small", device="cuda"):
        # Load Whisper
        self.whisper = WhisperModel(
            model_size, 
            device=device, 
            compute_type="float16" if device == "cuda" else "int8"
        )
        
        # Load Silero VAD
        self.vad_model, _ = torch.hub.load(
            'snakers4/silero-vad', 'silero_vad', force_reload=False
        )
        
        # State
        self.audio_queue = queue.Queue()
        self.speech_buffer = []
        self.pre_roll_buffer = []  # Captures audio before speech starts
        self.is_speaking = False
        self.silence_count = 0
        self.running = False
        
    def audio_callback(self, indata, frames, time, status):
        self.audio_queue.put(indata.copy())
    
    def process_audio(self):
        while self.running:
            try:
                audio_chunk = self.audio_queue.get(timeout=0.1)
                audio_chunk = audio_chunk.flatten().astype(np.float32)
                
                # Pre-roll buffer (keeps last ~200ms before speech)
                self.pre_roll_buffer.append(audio_chunk)
                if len(self.pre_roll_buffer) > 2:
                    self.pre_roll_buffer.pop(0)
                
                # VAD check
                tensor = torch.FloatTensor(audio_chunk)
                speech_prob = self.vad_model(tensor, SAMPLE_RATE).item()
                
                if speech_prob > 0.5:
                    if not self.is_speaking:
                        # Speech started - include pre-roll buffer
                        self.is_speaking = True
                        for pre_chunk in self.pre_roll_buffer:
                            self.speech_buffer.extend(pre_chunk)
                    else:
                        self.speech_buffer.extend(audio_chunk)
                    self.silence_count = 0
                    
                elif self.is_speaking:
                    self.speech_buffer.extend(audio_chunk)
                    self.silence_count += 1
                    
                    if self.silence_count >= SILENCE_CHUNKS_TO_END:
                        self.transcribe_and_reset()
                        
            except queue.Empty:
                continue
                
    def transcribe_and_reset(self):
        if len(self.speech_buffer) < MIN_SPEECH_SAMPLES:
            self.reset_state()
            return
            
        audio_array = np.array(self.speech_buffer, dtype=np.float32)
        
        segments, _ = self.whisper.transcribe(
            audio_array,
            beam_size=2,
            language="en",
            vad_filter=False,  # Already VAD-processed
            condition_on_previous_text=False
        )
        
        text = " ".join(seg.text.strip() for seg in segments)
        if text:
            print(f"\n🎤 {text}")
        
        self.reset_state()
        
    def reset_state(self):
        self.speech_buffer = []
        self.is_speaking = False
        self.silence_count = 0
        
    def start(self):
        self.running = True
        threading.Thread(target=self.process_audio, daemon=True).start()
        
        print("🎙️ Listening... (Ctrl+C to stop)")
        with sd.InputStream(
            samplerate=SAMPLE_RATE, channels=1, dtype=np.float32,
            blocksize=CHUNK_SIZE, callback=self.audio_callback
        ):
            try:
                while True:
                    sd.sleep(100)
            except KeyboardInterrupt:
                self.running = False
                print("\n⏹️ Stopped")

if __name__ == "__main__":
    transcriber = RealtimeTranscriber(model_size="small", device="cuda")
    transcriber.start()
```

## WhisperLiveKit is the most complete 2025 solution

**WhisperLiveKit** (QuentinFuxa/WhisperLiveKit, **9.3k stars**) represents the most complete streaming solution in 2025. It integrates both LocalAgreement and the newer SimulStreaming (AlignAtt) policies, supports speaker diarization, and provides a full WebSocket server with web UI.

**Key advantages:**
- Supports **both** streaming policies (LocalAgreement and AlignAtt)
- **Speaker diarization** via Streaming Sortformer (2025 SOTA)
- **200-language translation** via NLLB
- Auto-selects optimal backend (MLX on macOS, faster-whisper on Linux/Windows)
- Docker-ready deployment

```bash
pip install whisperlivekit

# Basic usage
wlk --model small --language en

# With diarization and low latency
wlk --model medium --language en --diarization

# Open http://localhost:8000 for web UI
```

**Python API integration:**
```python
from whisperlivekit import AudioProcessor, TranscriptionEngine

engine = TranscriptionEngine(
    model="small",
    lan="en",
    diarization=False  # Enable for speaker identification
)
processor = AudioProcessor(transcription_engine=engine)
```

## Implementing the LocalAgreement algorithm from scratch

For maximum control, here's a complete implementation of LocalAgreement-2 with faster-whisper:

```python
"""
LocalAgreement-2 streaming transcription implementation
"""
from faster_whisper import WhisperModel
import numpy as np

class LocalAgreementTranscriber:
    def __init__(self, model_size="small", device="cuda"):
        self.model = WhisperModel(
            model_size, device=device, 
            compute_type="float16" if device == "cuda" else "int8"
        )
        self.sample_rate = 16000
        self.min_chunk_size = 1.0  # seconds
        self.buffer_max = 30.0  # seconds
        
        # State
        self.audio_buffer = np.array([], dtype=np.float32)
        self.confirmed_words = []
        self.previous_output = None
        self.prompt_words = []  # Last 200 words for context
        
    def add_audio(self, audio: np.ndarray):
        """Add new audio chunk to buffer."""
        self.audio_buffer = np.concatenate([self.audio_buffer, audio])
        
    def process(self) -> tuple[str, str]:
        """Process buffer, return (confirmed_text, tentative_text)."""
        buffer_duration = len(self.audio_buffer) / self.sample_rate
        if buffer_duration < self.min_chunk_size:
            return "", ""
            
        # Build context prompt from confirmed words
        prompt = ' '.join(self.prompt_words[-200:]) if self.prompt_words else None
        
        # Transcribe entire buffer
        segments, _ = self.model.transcribe(
            self.audio_buffer,
            initial_prompt=prompt,
            word_timestamps=True,
            beam_size=2,
            language="en"
        )
        
        # Extract words with timestamps
        current_words = []
        for segment in segments:
            if segment.words:
                for word in segment.words:
                    current_words.append({
                        'text': word.word.strip(),
                        'start': word.start,
                        'end': word.end
                    })
        
        # First pass - no comparison possible yet
        if self.previous_output is None:
            self.previous_output = current_words
            tentative = ' '.join(w['text'] for w in current_words)
            return "", tentative
        
        # LocalAgreement-2: Find longest common prefix
        confirmed = []
        for prev, curr in zip(self.previous_output, current_words):
            if prev['text'].lower() == curr['text'].lower():
                confirmed.append(curr)
            else:
                break
        
        # Update state
        confirmed_text = ' '.join(w['text'] for w in confirmed)
        tentative_text = ' '.join(w['text'] for w in current_words[len(confirmed):])
        
        if confirmed:
            self.confirmed_words.extend([w['text'] for w in confirmed])
            self.prompt_words.extend([w['text'] for w in confirmed])
            
            # Trim buffer if too long
            if buffer_duration > self.buffer_max:
                self._trim_buffer_at_sentence()
        
        self.previous_output = current_words
        return confirmed_text, tentative_text
    
    def _trim_buffer_at_sentence(self):
        """Trim buffer at last sentence boundary."""
        # Find last confirmed word ending with punctuation
        for i, word in reversed(list(enumerate(self.confirmed_words))):
            if word.endswith(('.', '?', '!')):
                # Keep buffer from this point forward
                # (In practice, need timestamp tracking - simplified here)
                trim_samples = int(15 * self.sample_rate)  # Keep last 15s
                if len(self.audio_buffer) > trim_samples:
                    self.audio_buffer = self.audio_buffer[-trim_samples:]
                break
    
    def finish(self) -> str:
        """Finalize any remaining audio."""
        if len(self.audio_buffer) > 0:
            segments, _ = self.model.transcribe(self.audio_buffer)
            return ' '.join(seg.text.strip() for seg in segments)
        return ""
```

## Performance tuning and parameter recommendations

**Model selection by use case:**

| Use Case | Model | GPU VRAM | Latency | Notes |
|----------|-------|----------|---------|-------|
| Ultra-low latency | `tiny.en` | ~1GB | Fastest | For real-time preview only |
| Streaming captioning | `small.en` | ~2GB | ~2-3s | **Best balance for streamers** |
| High accuracy | `medium.en` | ~5GB | ~4-5s | Near-real-time |
| Maximum quality | `distil-large-v3` | ~6GB | ~5s | Distilled, faster than large |

**Optimal configuration for streamer captioning:**

```python
# Recommended settings for real-time captioning
config = {
    # Model
    "model": "small.en",  # or "base.en" for lower latency
    "device": "cuda",
    "compute_type": "float16",
    
    # Transcription
    "beam_size": 2,  # 1 for speed, 5 for accuracy
    "language": "en",  # Always specify to skip detection
    "condition_on_previous_text": False,  # Reduces latency
    
    # VAD (if using faster-whisper built-in)
    "vad_filter": True,
    "vad_parameters": {
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "min_silence_duration_ms": 500,  # Down from 2000ms default
        "speech_pad_ms": 100,  # Down from 400ms default
    },
    
    # Streaming
    "min_chunk_size": 0.5,  # seconds between processing
    "buffer_max": 30.0,  # seconds before trimming
}
```

**Latency breakdown with LocalAgreement-2:**
- Chunk collection: 0.5-1.0s (configurable)
- Whisper inference: 0.2-0.5s (depends on model/GPU)
- Agreement confirmation: requires 2 passes = 2× chunk time
- **Total end-to-end: ~2-4 seconds** for confirmed text

## Step-by-step integration for Claude Code

To upgrade the existing Python desktop application from time-based chunking to VAD-based streaming:

**Option 1: Quickest integration with RealtimeSTT**
```bash
pip install RealtimeSTT
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
```

Replace the time-based chunking code with the `AudioToTextRecorder` configuration shown in the RealtimeSTT section above. This handles all VAD, buffering, and deduplication automatically.

**Option 2: Maximum control with faster-whisper + Silero VAD**

1. Install dependencies:
```bash
pip install faster-whisper sounddevice numpy
pip install torch torchaudio  # For Silero VAD
```

2. Implement the `RealtimeTranscriber` class from the faster-whisper section above

3. Key changes from time-based chunking:
   - Replace fixed-interval processing with VAD-triggered segmentation
   - Add pre-roll buffer to capture word starts
   - Use silence detection instead of timers for utterance boundaries
   - Process complete utterances, not arbitrary chunks

**Option 3: Production-ready with WhisperLiveKit**

For the most robust solution with WebSocket architecture:
```bash
pip install whisperlivekit
wlk --model small --language en --port 8000
```

Connect your desktop application as a WebSocket client to `ws://localhost:8000`.

## Conclusion

The chunk boundary word loss problem is definitively solved by combining **VAD-based segmentation** with the **LocalAgreement confirmation algorithm**. For a streamer captioning application, **RealtimeSTT** offers the fastest integration path with its dual-layer VAD and pre-recording buffers. For maximum performance and future-proofing, **WhisperLiveKit** provides a complete solution with the latest SimulStreaming research. The custom **faster-whisper + Silero VAD** approach gives full control when specific optimizations are needed.

The key insight is that Whisper performs best when given complete speech utterances at natural boundaries—let VAD find those boundaries rather than imposing arbitrary time cuts. With proper implementation, real-time captioning latency of **2-4 seconds** is achievable with **no word loss** at chunk boundaries.