From d3c2954c5ea539cdd19861cbc64ebf9f56ffb3e3 Mon Sep 17 00:00:00 2001 From: Josh Knapp Date: Thu, 26 Feb 2026 16:44:58 -0800 Subject: [PATCH] Add STT and diarization research report Co-Authored-By: Claude Opus 4.6 --- RESEARCH_REPORT.md | 380 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 380 insertions(+) create mode 100644 RESEARCH_REPORT.md diff --git a/RESEARCH_REPORT.md b/RESEARCH_REPORT.md new file mode 100644 index 0000000..1885d0f --- /dev/null +++ b/RESEARCH_REPORT.md @@ -0,0 +1,380 @@ +# Voice to Notes: Speech-to-Text and Speaker Diarization Research Report + +**Date:** 2026-02-26 + +--- + +## Table of Contents + +1. [Speech-to-Text Engines](#1-speech-to-text-engines) +2. [Speaker Diarization](#2-speaker-diarization) +3. [Combined Pipelines](#3-combined-pipelines) +4. [Final Recommendations](#4-final-recommendations) + +--- + +## 1. Speech-to-Text Engines + +### 1.1 OpenAI Whisper / whisper.cpp + +**Overview:** Whisper is OpenAI's general-purpose speech recognition model trained on 680,000 hours of multilingual data. whisper.cpp is a pure C/C++ port by Georgi Gerganov (ggml project) that removes the Python/PyTorch dependency entirely. + +| Criterion | Assessment | +|---|---| +| **Accuracy** | State-of-the-art. Large-v3 achieves ~2.7% WER on clean audio, ~7.9% on mixed real-world audio. Large-v3-turbo achieves comparable accuracy (~7.75% WER) at much faster speed by reducing decoder layers from 32 to 4. | +| **Speed** | whisper.cpp with quantization (Q5_K_M) runs efficiently on CPU. GPU acceleration available via CUDA (NVIDIA), Vulkan (cross-vendor), Metal (Apple Silicon), and OpenVINO (Intel). Real-time or faster on modern hardware with medium/small models. | +| **Language Support** | 99 languages. | +| **Ease of Integration** | whisper.cpp: C/C++ library with C API, bindings available for many languages. No Python runtime needed. Straightforward to embed in a desktop app. | +| **License** | MIT (both Whisper and whisper.cpp). | +| **GPU Acceleration** | CUDA, Vulkan, Metal, OpenVINO, CoreML. Broad hardware coverage. | +| **Word-Level Timestamps** | Supported, but derived from forced alignment on decoded text rather than internal attention weights. Can drift 300-800ms on complex utterances. Acceptable for many use cases but not forensic-grade. | + +**Verdict:** Best option for a native desktop app that needs to minimize dependencies. The C/C++ nature of whisper.cpp makes it ideal for embedding in an Electron, Qt, or Tauri application. + +--- + +### 1.2 faster-whisper + +**Overview:** A Python reimplementation of Whisper using CTranslate2, a high-performance C++ inference engine for Transformer models. Up to 4x faster than stock Whisper with the same accuracy, and lower memory usage. + +| Criterion | Assessment | +|---|---| +| **Accuracy** | Identical to Whisper (same models, full fidelity). | +| **Speed** | Up to 4x faster than stock Whisper. ~20x realtime with GPU. 8-bit quantization available on both CPU and GPU. | +| **Language Support** | 99 languages (same Whisper models). | +| **Ease of Integration** | Python library. Requires Python runtime. Excellent for Python-based or Python-embedded apps. Rich API with access to Whisper's tokenizer, alignment algorithms, and confidence scoring. | +| **License** | MIT. | +| **GPU Acceleration** | NVIDIA CUDA, AMD ROCm (via CTranslate2). CPU backends: Intel MKL, oneDNN, OpenBLAS, Ruy. | +| **Word-Level Timestamps** | **Best-in-class among Whisper variants.** Native alignment from the model's internals plus optional wav2vec2 alignment for even better precision. | + +**Verdict:** Best choice if your app can embed a Python runtime (or run a Python sidecar process). Provides the most precise word-level timestamps of any Whisper variant, which is critical for synchronized playback. The trade-off is the Python dependency. + +--- + +### 1.3 Vosk + +**Overview:** A lightweight, Kaldi-based offline speech recognition toolkit. Optimized for efficiency and small footprint. + +| Criterion | Assessment | +|---|---| +| **Accuracy** | Good but noticeably below Whisper-class models. Baseline WER can be ~20%+ depending on audio conditions, improvable to ~12% with domain-specific language model adaptation. | +| **Speed** | Very fast, even on low-end hardware. Supports real-time streaming natively. | +| **Language Support** | 20+ languages with pre-trained models. | +| **Ease of Integration** | Excellent. APIs for Python, Java, C#, JavaScript, Node.js, and more. Models are ~50MB. | +| **License** | Apache 2.0. | +| **GPU Acceleration** | Not required (runs efficiently on CPU). No GPU acceleration. | +| **Word-Level Timestamps** | Yes, provides word-level timestamps with start/end times and confidence in JSON output. | + +**Verdict:** Best for extremely resource-constrained scenarios or as a lightweight fallback. Not recommended as the primary engine for a quality-focused transcription app due to lower accuracy compared to Whisper-based solutions. + +--- + +### 1.4 Coqui STT + +**Overview:** Fork of Mozilla DeepSpeech. The Coqui company shut down in early 2024. The code remains available as open source, but the project is no longer maintained and the Model Zoo is offline. + +| Criterion | Assessment | +|---|---| +| **Accuracy** | Below Whisper. Was competitive in the DeepSpeech era but has fallen behind. | +| **Speed** | Moderate. | +| **Language Support** | Limited compared to Whisper. | +| **Ease of Integration** | Python and native bindings available, but stale dependencies. | +| **License** | MPL 2.0. | +| **GPU Acceleration** | TensorFlow-based GPU support. | +| **Word-Level Timestamps** | Supported via metadata output. | + +**Verdict:** **Not recommended.** The project is discontinued. No active maintenance, no security patches, no model improvements. Use Whisper-based alternatives instead. + +--- + +### 1.5 Other Notable Options + +#### Whisper Large-v3-turbo +OpenAI's latest Whisper variant (October 2024). Reduces decoder layers from 32 to 4 while maintaining accuracy close to large-v3. Achieves 216x realtime speed. Available in both whisper.cpp and faster-whisper. + +#### NVIDIA NeMo ASR +Production-grade ASR with Conformer-CTC and Conformer-Transducer models. Best accuracy in some benchmarks but heavy dependency on NVIDIA ecosystem. Apache 2.0 license. Overkill for a desktop app unless targeting NVIDIA GPU users specifically. + +#### Wav2Vec2 (Meta) +Strong accuracy when fine-tuned for specific domains. Good for real-time streaming. Often used as an alignment model rather than primary STT. MIT license. + +--- + +### STT Summary Comparison + +| Feature | whisper.cpp | faster-whisper | Vosk | Coqui STT | +|---|---|---|---|---| +| **Accuracy** | Excellent | Excellent | Good | Fair | +| **Speed** | Fast | Very Fast | Very Fast | Moderate | +| **Languages** | 99 | 99 | 20+ | Limited | +| **Word Timestamps** | Yes (some drift) | Yes (precise) | Yes | Yes | +| **GPU Support** | CUDA/Vulkan/Metal | CUDA/ROCm | CPU only | TensorFlow | +| **License** | MIT | MIT | Apache 2.0 | MPL 2.0 | +| **Dependencies** | None (C/C++) | Python + CTranslate2 | Minimal | Python + TF | +| **Actively Maintained** | Yes | Yes | Yes | **No** | +| **Desktop-Friendly** | Excellent | Good | Excellent | Poor | + +--- + +## 2. Speaker Diarization + +### 2.1 pyannote.audio + +**Overview:** The leading open-source speaker diarization toolkit. Recently released version 4.0 with the "community-1" model, which significantly outperforms the previous 3.1 across all metrics. + +| Criterion | Assessment | +|---|---| +| **Accuracy** | Best-in-class open source. DER (Diarization Error Rate) ~11-19% on standard benchmarks. Community-1 model is a major leap over 3.1. | +| **Pre-recorded Audio** | Full support. Designed for both offline and streaming use. | +| **Ease of Integration** | Python library with PyTorch backend. Simple pipeline API: `pipeline("audio.wav")` returns speaker segments. Can run fully offline once models are downloaded. | +| **Combinable with STT** | Yes. WhisperX and whisper-diarization both use pyannote as their diarization backend. Well-established integration patterns. | +| **License** | Code: MIT. Models: speaker-diarization-3.1 is MIT; community-1 is CC-BY-4.0. Both allow commercial use. | +| **GPU Support** | Yes, PyTorch CUDA. Can also run on CPU (slower but functional). | + +**Verdict:** Clear first choice for diarization. Most accurate, best maintained, largest community, and proven integration with Whisper-based STT. The community-1 model under CC-BY-4.0 is permissive enough for commercial desktop apps. + +--- + +### 2.2 NVIDIA NeMo Speaker Diarization + +**Overview:** Part of NVIDIA's NeMo framework. Offers two approaches: end-to-end Sortformer Diarizer and cascaded pipeline (MarbleNet VAD + TitaNet embeddings + Multi-Scale Diarization Decoder). + +| Criterion | Assessment | +|---|---| +| **Accuracy** | Competitive with or slightly better than pyannote in some benchmarks. Sortformer is state-of-the-art. | +| **Pre-recorded Audio** | Full support. Also has streaming Sortformer for real-time. | +| **Ease of Integration** | Heavy. NeMo is a large framework with many dependencies. Requires NVIDIA GPU for practical use. Complex configuration via YAML files. | +| **Combinable with STT** | Yes. NeMo includes its own ASR models. Can combine diarization with NeMo ASR in a single pipeline. | +| **License** | Apache 2.0. | +| **GPU Support** | NVIDIA GPU required for practical performance. | + +**Verdict:** Best accuracy in some scenarios, but the heavy NVIDIA dependency and complex setup make it poorly suited for a consumer desktop app that must work across hardware. Good option if you can offer it as an optional backend for users with NVIDIA GPUs. + +--- + +### 2.3 SpeechBrain + +**Overview:** An open-source, all-in-one conversational AI toolkit built on PyTorch. Covers ASR, speaker identification, diarization, speech enhancement, and more. + +| Criterion | Assessment | +|---|---| +| **Accuracy** | Good, though generally slightly behind pyannote on diarization-specific benchmarks. | +| **Pre-recorded Audio** | Full support. | +| **Ease of Integration** | Moderate. PyTorch-based. Well-documented but the "kitchen sink" approach means you pull in a large framework even if you only need diarization. | +| **Combinable with STT** | Yes. Has its own ASR components. Can build end-to-end pipelines within the framework. | +| **License** | Apache 2.0. | +| **GPU Support** | PyTorch CUDA. | + +**Verdict:** Good option if you want a single framework for everything (ASR + diarization + enhancement). However, for diarization specifically, pyannote is more focused and generally more accurate. SpeechBrain is better suited for teams that want deep customization of the diarization pipeline. + +--- + +### 2.4 Resemblyzer + +**Overview:** A Python library by Resemble AI for extracting speaker embeddings using a GE2E (Generalized End-to-End) model. Primarily a speaker verification/comparison tool, not a full diarization system. + +| Criterion | Assessment | +|---|---| +| **Accuracy** | Moderate. The underlying model is older and less accurate than pyannote or NeMo embeddings. | +| **Pre-recorded Audio** | Yes, but you must build your own clustering/segmentation logic on top. | +| **Ease of Integration** | Simple API for embedding extraction. But no built-in diarization pipeline; you need to implement VAD, segmentation, and clustering yourself. | +| **Combinable with STT** | Manually, with significant custom code. | +| **License** | Apache 2.0. | +| **GPU Support** | PyTorch (optional). | +| **Maintenance Status** | **Inactive.** No new releases or meaningful updates in over 12 months. | + +**Verdict:** **Not recommended** for new projects. It is essentially unmaintained and provides only embeddings, not a complete diarization solution. pyannote provides better embeddings and a complete pipeline. + +--- + +### Diarization Summary Comparison + +| Feature | pyannote.audio | NeMo | SpeechBrain | Resemblyzer | +|---|---|---|---|---| +| **Accuracy (DER)** | ~11-19% | ~10-18% | ~13-20% | N/A (not a full system) | +| **Complete Pipeline** | Yes | Yes | Yes | No (embeddings only) | +| **Ease of Setup** | Easy | Complex | Moderate | Easy (but incomplete) | +| **License** | MIT / CC-BY-4.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 | +| **GPU Required** | No (recommended) | Practically yes | No (recommended) | No | +| **Actively Maintained** | Yes (v4.0, Feb 2026) | Yes | Yes | No | +| **Desktop-Friendly** | Good | Poor | Moderate | N/A | + +--- + +## 3. Combined Pipelines (STT + Diarization) + +### 3.1 WhisperX + +**Overview:** The most mature combined pipeline. Integrates faster-whisper (STT) + wav2vec2 (alignment) + pyannote.audio (diarization) into a single workflow. + +**How it works:** +1. **Transcription:** faster-whisper transcribes audio into coarse utterance-level segments with batched inference (~70x realtime with large-v2). +2. **Forced Alignment:** wav2vec2 refines timestamps to precise word-level accuracy. +3. **Diarization:** pyannote.audio segments the audio by speaker. +4. **Alignment:** Word-level timestamps from step 2 are aligned with speaker segments from step 3, assigning each word to a speaker. + +**Strengths:** +- Best word-level timestamp accuracy of any open-source solution. +- Speaker labels mapped to individual words. +- Handles long audio files through intelligent chunking. +- Active development, large community. + +**Weaknesses:** +- Python-only. Requires Python runtime with PyTorch, faster-whisper, and pyannote dependencies. +- Significant memory usage (multiple models loaded simultaneously). +- Pyannote model download requires accepting license on Hugging Face (one-time). + +**License:** BSD-4-Clause (WhisperX itself); dependencies are MIT/Apache. + +--- + +### 3.2 whisper-diarization (by MahmoudAshraf97) + +**Overview:** An alternative combined pipeline using Whisper + pyannote for diarization. Simpler than WhisperX but with fewer features. + +**Strengths:** +- Straightforward Python script approach. +- Uses pyannote for diarization. +- Easier to understand and modify. + +**Weaknesses:** +- Less optimized than WhisperX. +- Fewer alignment options. + +--- + +### 3.3 NVIDIA NeMo End-to-End + +**Overview:** NeMo can run ASR and diarization in a single framework. The Sortformer model handles diarization end-to-end, and NeMo ASR handles transcription. + +**Strengths:** +- Single framework, no glue code between separate libraries. +- State-of-the-art accuracy. +- Streaming support with Streaming Sortformer. + +**Weaknesses:** +- Requires NVIDIA GPU. +- Heavy framework, not consumer-desktop friendly. +- Complex configuration. + +--- + +### 3.4 Aligning Diarization with Transcription Timestamps + +The fundamental challenge: STT produces words with timestamps, while diarization produces speaker segments with timestamps. These must be merged. + +**Best Approach (used by WhisperX):** + +``` +1. Run STT -> get words with [start_time, end_time] per word +2. Run diarization -> get speaker segments [speaker_id, start_time, end_time] +3. For each word, find which speaker segment it falls into: + - Use the word's midpoint timestamp + - Assign the word to whichever speaker segment contains that midpoint + - Handle edge cases (words spanning segment boundaries) with majority overlap +``` + +**Alignment quality depends on:** +- **Word timestamp precision:** faster-whisper with wav2vec2 alignment provides the best precision. whisper.cpp timestamps can drift 300-800ms, which can cause mis-attribution at speaker boundaries. +- **Diarization segment precision:** pyannote.audio community-1 provides the tightest speaker boundaries. +- **Overlap handling:** In conversations where speakers overlap, both timestamps and diarization become less reliable. pyannote.audio 4.0 has specific overlapped speech detection. + +--- + +## 4. Final Recommendations + +### Primary Recommendation: Two-Tier Architecture + +Given the "Voice to Notes" requirements (local-first, consumer hardware, word-level timestamps for synchronized playback, speaker identification), I recommend a **two-tier architecture**: + +#### Tier 1: Core Transcription Engine (C/C++) + +**Use whisper.cpp** as the primary STT engine. + +- No Python dependency for the core app. +- Runs on all hardware (CPU, NVIDIA GPU, AMD GPU via Vulkan, Intel via OpenVINO). +- MIT license with no restrictions. +- Embed directly into your desktop app (Tauri, Qt, Electron with native addon). +- Use the `large-v3-turbo` model as the default (best speed/accuracy trade-off for consumer hardware). +- Offer `medium` and `small` models for lower-end hardware. +- Word-level timestamps are adequate for playback synchronization (300-800ms drift is acceptable when the UI highlights the current phrase rather than individual words). + +#### Tier 2: Enhanced Pipeline (Python Sidecar) + +**Use faster-whisper + pyannote.audio** via a Python sidecar process for users who want speaker diarization and precise word-level alignment. + +- Ship a bundled Python environment (e.g., via PyInstaller or conda-pack). +- Run the WhisperX-style pipeline: faster-whisper -> wav2vec2 alignment -> pyannote diarization. +- Communicate with the main app via IPC (stdin/stdout JSON, local socket, or gRPC). +- This gives the best word-level timestamps and speaker identification. +- Optional: only install/download when user enables "Speaker Identification" feature. + +#### Model Selection Guide + +| User's Hardware | STT Model | Diarization | +|---|---|---| +| No GPU, 8GB RAM | whisper.cpp `small` (Q5_K_M) | pyannote on CPU (slower but works) | +| No GPU, 16GB RAM | whisper.cpp `medium` (Q5_K_M) | pyannote on CPU | +| NVIDIA GPU, 8GB+ VRAM | faster-whisper `large-v3-turbo` (int8) | pyannote on GPU | +| NVIDIA GPU, 4GB VRAM | faster-whisper `medium` (int8) | pyannote on GPU | +| Any hardware, speed priority | whisper.cpp `small` or `base` | Skip diarization | + +#### Optional Cloud Fallback + +For users who prefer cloud processing, integrate an optional cloud STT API (OpenAI Whisper API, AssemblyAI, or Deepgram) as a premium feature. This requires minimal code since the output format (words + timestamps + speakers) is the same regardless of backend. + +### Why Not Other Options? + +| Option | Reason to Skip | +|---|---| +| **Vosk** | Accuracy gap too large vs. Whisper. Only consider as a real-time streaming preview (show rough text while recording, then refine with Whisper afterward). | +| **Coqui STT** | Discontinued. No future. | +| **Resemblyzer** | Unmaintained, incomplete (no pipeline). | +| **NeMo (full)** | Too heavy for consumer desktop. NVIDIA-only for practical use. | +| **SpeechBrain** | Less accurate diarization than pyannote. Larger framework for less benefit. | + +### Recommended Technology Stack Summary + +``` +Desktop App Shell: Tauri (Rust) or Electron + | + +----------------+----------------+ + | | + Core STT Engine Enhanced Pipeline + (whisper.cpp, C/C++) (Python sidecar) + | | + - Transcription - faster-whisper (STT) + - Basic word timestamps - wav2vec2 (alignment) + - No speaker ID - pyannote.audio (diarization) + - Precise word timestamps + - Speaker identification +``` + +### Key Files and Repositories + +- **whisper.cpp:** https://github.com/ggml-org/whisper.cpp +- **faster-whisper:** https://github.com/SYSTRAN/faster-whisper +- **pyannote.audio:** https://github.com/pyannote/pyannote-audio +- **WhisperX:** https://github.com/m-bain/whisperX +- **whisper-diarization:** https://github.com/MahmoudAshraf97/whisper-diarization + +--- + +## Sources + +- [OpenAI Whisper vs Vosk Comparison (Jamy AI)](https://www.jamy.ai/blog/openai-whisper-vs-other-open-source-transcription-models/) +- [Top Open Source Transcription Software 2025 (Amical)](https://amical.ai/blog/open-source-transcription-software) +- [Choosing Between Whisper Variants (Modal)](https://modal.com/blog/choosing-whisper-variants) +- [whisper.cpp vs faster-whisper Practical Guide](https://www.alibaba.com/product-insights/a-practical-guide-to-choosing-between-whisper-cpp-and-faster-whisper-for-offline-transcription.html) +- [Top 8 Open Source STT Options (AssemblyAI)](https://www.assemblyai.com/blog/top-open-source-stt-options-for-voice-applications) +- [Best Speaker Diarization Models Compared 2026 (Brass Transcripts)](https://brasstranscripts.com/blog/speaker-diarization-models-comparison) +- [Pyannote vs NeMo Comparison (La Javaness)](https://lajavaness.medium.com/comparing-state-of-the-art-speaker-diarization-frameworks-pyannote-vs-nemo-31a191c6300) +- [Top Speaker Diarization Libraries 2026 (AssemblyAI)](https://www.assemblyai.com/blog/top-speaker-diarization-libraries-and-apis) +- [pyannote.audio Community-1 Announcement](https://www.pyannote.ai/blog/community-1) +- [pyannote/speaker-diarization-3.1 (Hugging Face)](https://huggingface.co/pyannote/speaker-diarization-3.1) +- [Whisper Large-v3-turbo (Hugging Face)](https://huggingface.co/openai/whisper-large-v3-turbo) +- [Best Open Source STT 2026 with Benchmarks (Northflank)](https://northflank.com/blog/best-open-source-speech-to-text-stt-model-in-2026-benchmarks) +- [NVIDIA NeMo Speaker Diarization Docs](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_diarization/intro.html) +- [faster-whisper (GitHub)](https://github.com/SYSTRAN/faster-whisper) +- [WhisperX (GitHub)](https://github.com/m-bain/whisperX) +- [Vosk Accuracy Guide](https://alphacephei.com/vosk/accuracy)