Files

Josh Knapp d3c2954c5e Add STT and diarization research report

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-26 16:44:58 -08:00

20 KiB

Raw Permalink Blame History

Voice to Notes: Speech-to-Text and Speaker Diarization Research Report

Date: 2026-02-26

Speech-to-Text Engines
Speaker Diarization
Combined Pipelines
Final Recommendations

1. Speech-to-Text Engines

1.1 OpenAI Whisper / whisper.cpp

Overview: Whisper is OpenAI's general-purpose speech recognition model trained on 680,000 hours of multilingual data. whisper.cpp is a pure C/C++ port by Georgi Gerganov (ggml project) that removes the Python/PyTorch dependency entirely.

Criterion	Assessment
Accuracy	State-of-the-art. Large-v3 achieves ~2.7% WER on clean audio, ~7.9% on mixed real-world audio. Large-v3-turbo achieves comparable accuracy (~7.75% WER) at much faster speed by reducing decoder layers from 32 to 4.
Speed	whisper.cpp with quantization (Q5_K_M) runs efficiently on CPU. GPU acceleration available via CUDA (NVIDIA), Vulkan (cross-vendor), Metal (Apple Silicon), and OpenVINO (Intel). Real-time or faster on modern hardware with medium/small models.
Language Support	99 languages.
Ease of Integration	whisper.cpp: C/C++ library with C API, bindings available for many languages. No Python runtime needed. Straightforward to embed in a desktop app.
License	MIT (both Whisper and whisper.cpp).
GPU Acceleration	CUDA, Vulkan, Metal, OpenVINO, CoreML. Broad hardware coverage.
Word-Level Timestamps	Supported, but derived from forced alignment on decoded text rather than internal attention weights. Can drift 300-800ms on complex utterances. Acceptable for many use cases but not forensic-grade.

Verdict: Best option for a native desktop app that needs to minimize dependencies. The C/C++ nature of whisper.cpp makes it ideal for embedding in an Electron, Qt, or Tauri application.

1.2 faster-whisper

Overview: A Python reimplementation of Whisper using CTranslate2, a high-performance C++ inference engine for Transformer models. Up to 4x faster than stock Whisper with the same accuracy, and lower memory usage.

Criterion	Assessment
Accuracy	Identical to Whisper (same models, full fidelity).
Speed	Up to 4x faster than stock Whisper. ~20x realtime with GPU. 8-bit quantization available on both CPU and GPU.
Language Support	99 languages (same Whisper models).
Ease of Integration	Python library. Requires Python runtime. Excellent for Python-based or Python-embedded apps. Rich API with access to Whisper's tokenizer, alignment algorithms, and confidence scoring.
License	MIT.
GPU Acceleration	NVIDIA CUDA, AMD ROCm (via CTranslate2). CPU backends: Intel MKL, oneDNN, OpenBLAS, Ruy.
Word-Level Timestamps	Best-in-class among Whisper variants. Native alignment from the model's internals plus optional wav2vec2 alignment for even better precision.

Verdict: Best choice if your app can embed a Python runtime (or run a Python sidecar process). Provides the most precise word-level timestamps of any Whisper variant, which is critical for synchronized playback. The trade-off is the Python dependency.

1.3 Vosk

Overview: A lightweight, Kaldi-based offline speech recognition toolkit. Optimized for efficiency and small footprint.

Criterion	Assessment
Accuracy	Good but noticeably below Whisper-class models. Baseline WER can be ~20%+ depending on audio conditions, improvable to ~12% with domain-specific language model adaptation.
Speed	Very fast, even on low-end hardware. Supports real-time streaming natively.
Language Support	20+ languages with pre-trained models.
Ease of Integration	Excellent. APIs for Python, Java, C#, JavaScript, Node.js, and more. Models are ~50MB.
License	Apache 2.0.
GPU Acceleration	Not required (runs efficiently on CPU). No GPU acceleration.
Word-Level Timestamps	Yes, provides word-level timestamps with start/end times and confidence in JSON output.

Verdict: Best for extremely resource-constrained scenarios or as a lightweight fallback. Not recommended as the primary engine for a quality-focused transcription app due to lower accuracy compared to Whisper-based solutions.

1.4 Coqui STT

Overview: Fork of Mozilla DeepSpeech. The Coqui company shut down in early 2024. The code remains available as open source, but the project is no longer maintained and the Model Zoo is offline.

Criterion	Assessment
Accuracy	Below Whisper. Was competitive in the DeepSpeech era but has fallen behind.
Speed	Moderate.
Language Support	Limited compared to Whisper.
Ease of Integration	Python and native bindings available, but stale dependencies.
License	MPL 2.0.
GPU Acceleration	TensorFlow-based GPU support.
Word-Level Timestamps	Supported via metadata output.

Verdict: Not recommended. The project is discontinued. No active maintenance, no security patches, no model improvements. Use Whisper-based alternatives instead.

1.5 Other Notable Options

Whisper Large-v3-turbo

OpenAI's latest Whisper variant (October 2024). Reduces decoder layers from 32 to 4 while maintaining accuracy close to large-v3. Achieves 216x realtime speed. Available in both whisper.cpp and faster-whisper.

NVIDIA NeMo ASR

Production-grade ASR with Conformer-CTC and Conformer-Transducer models. Best accuracy in some benchmarks but heavy dependency on NVIDIA ecosystem. Apache 2.0 license. Overkill for a desktop app unless targeting NVIDIA GPU users specifically.

Wav2Vec2 (Meta)

Strong accuracy when fine-tuned for specific domains. Good for real-time streaming. Often used as an alignment model rather than primary STT. MIT license.

STT Summary Comparison

Feature	whisper.cpp	faster-whisper	Vosk	Coqui STT
Accuracy	Excellent	Excellent	Good	Fair
Speed	Fast	Very Fast	Very Fast	Moderate
Languages	99	99	20+	Limited
Word Timestamps	Yes (some drift)	Yes (precise)	Yes	Yes
GPU Support	CUDA/Vulkan/Metal	CUDA/ROCm	CPU only	TensorFlow
License	MIT	MIT	Apache 2.0	MPL 2.0
Dependencies	None (C/C++)	Python + CTranslate2	Minimal	Python + TF
Actively Maintained	Yes	Yes	Yes	No
Desktop-Friendly	Excellent	Good	Excellent	Poor

2. Speaker Diarization

2.1 pyannote.audio

Overview: The leading open-source speaker diarization toolkit. Recently released version 4.0 with the "community-1" model, which significantly outperforms the previous 3.1 across all metrics.

Criterion	Assessment
Accuracy	Best-in-class open source. DER (Diarization Error Rate) ~11-19% on standard benchmarks. Community-1 model is a major leap over 3.1.
Pre-recorded Audio	Full support. Designed for both offline and streaming use.
Ease of Integration	Python library with PyTorch backend. Simple pipeline API: `pipeline("audio.wav")` returns speaker segments. Can run fully offline once models are downloaded.
Combinable with STT	Yes. WhisperX and whisper-diarization both use pyannote as their diarization backend. Well-established integration patterns.
License	Code: MIT. Models: speaker-diarization-3.1 is MIT; community-1 is CC-BY-4.0. Both allow commercial use.
GPU Support	Yes, PyTorch CUDA. Can also run on CPU (slower but functional).

Verdict: Clear first choice for diarization. Most accurate, best maintained, largest community, and proven integration with Whisper-based STT. The community-1 model under CC-BY-4.0 is permissive enough for commercial desktop apps.

2.2 NVIDIA NeMo Speaker Diarization

Overview: Part of NVIDIA's NeMo framework. Offers two approaches: end-to-end Sortformer Diarizer and cascaded pipeline (MarbleNet VAD + TitaNet embeddings + Multi-Scale Diarization Decoder).

Criterion	Assessment
Accuracy	Competitive with or slightly better than pyannote in some benchmarks. Sortformer is state-of-the-art.
Pre-recorded Audio	Full support. Also has streaming Sortformer for real-time.
Ease of Integration	Heavy. NeMo is a large framework with many dependencies. Requires NVIDIA GPU for practical use. Complex configuration via YAML files.
Combinable with STT	Yes. NeMo includes its own ASR models. Can combine diarization with NeMo ASR in a single pipeline.
License	Apache 2.0.
GPU Support	NVIDIA GPU required for practical performance.

Verdict: Best accuracy in some scenarios, but the heavy NVIDIA dependency and complex setup make it poorly suited for a consumer desktop app that must work across hardware. Good option if you can offer it as an optional backend for users with NVIDIA GPUs.

2.3 SpeechBrain

Overview: An open-source, all-in-one conversational AI toolkit built on PyTorch. Covers ASR, speaker identification, diarization, speech enhancement, and more.

Criterion	Assessment
Accuracy	Good, though generally slightly behind pyannote on diarization-specific benchmarks.
Pre-recorded Audio	Full support.
Ease of Integration	Moderate. PyTorch-based. Well-documented but the "kitchen sink" approach means you pull in a large framework even if you only need diarization.
Combinable with STT	Yes. Has its own ASR components. Can build end-to-end pipelines within the framework.
License	Apache 2.0.
GPU Support	PyTorch CUDA.

Verdict: Good option if you want a single framework for everything (ASR + diarization + enhancement). However, for diarization specifically, pyannote is more focused and generally more accurate. SpeechBrain is better suited for teams that want deep customization of the diarization pipeline.

2.4 Resemblyzer

Overview: A Python library by Resemble AI for extracting speaker embeddings using a GE2E (Generalized End-to-End) model. Primarily a speaker verification/comparison tool, not a full diarization system.

Criterion	Assessment
Accuracy	Moderate. The underlying model is older and less accurate than pyannote or NeMo embeddings.
Pre-recorded Audio	Yes, but you must build your own clustering/segmentation logic on top.
Ease of Integration	Simple API for embedding extraction. But no built-in diarization pipeline; you need to implement VAD, segmentation, and clustering yourself.
Combinable with STT	Manually, with significant custom code.
License	Apache 2.0.
GPU Support	PyTorch (optional).
Maintenance Status	Inactive. No new releases or meaningful updates in over 12 months.

Verdict: Not recommended for new projects. It is essentially unmaintained and provides only embeddings, not a complete diarization solution. pyannote provides better embeddings and a complete pipeline.

Diarization Summary Comparison

Feature	pyannote.audio	NeMo	SpeechBrain	Resemblyzer
Accuracy (DER)	~11-19%	~10-18%	~13-20%	N/A (not a full system)
Complete Pipeline	Yes	Yes	Yes	No (embeddings only)
Ease of Setup	Easy	Complex	Moderate	Easy (but incomplete)
License	MIT / CC-BY-4.0	Apache 2.0	Apache 2.0	Apache 2.0
GPU Required	No (recommended)	Practically yes	No (recommended)	No
Actively Maintained	Yes (v4.0, Feb 2026)	Yes	Yes	No
Desktop-Friendly	Good	Poor	Moderate	N/A

3. Combined Pipelines (STT + Diarization)

3.1 WhisperX

Overview: The most mature combined pipeline. Integrates faster-whisper (STT) + wav2vec2 (alignment) + pyannote.audio (diarization) into a single workflow.

How it works:

Transcription: faster-whisper transcribes audio into coarse utterance-level segments with batched inference (~70x realtime with large-v2).
Forced Alignment: wav2vec2 refines timestamps to precise word-level accuracy.
Diarization: pyannote.audio segments the audio by speaker.
Alignment: Word-level timestamps from step 2 are aligned with speaker segments from step 3, assigning each word to a speaker.

Strengths:

Best word-level timestamp accuracy of any open-source solution.
Speaker labels mapped to individual words.
Handles long audio files through intelligent chunking.
Active development, large community.

Weaknesses:

Python-only. Requires Python runtime with PyTorch, faster-whisper, and pyannote dependencies.
Significant memory usage (multiple models loaded simultaneously).
Pyannote model download requires accepting license on Hugging Face (one-time).

License: BSD-4-Clause (WhisperX itself); dependencies are MIT/Apache.

3.2 whisper-diarization (by MahmoudAshraf97)

Overview: An alternative combined pipeline using Whisper + pyannote for diarization. Simpler than WhisperX but with fewer features.

Strengths:

Straightforward Python script approach.
Uses pyannote for diarization.
Easier to understand and modify.

Weaknesses:

Less optimized than WhisperX.
Fewer alignment options.

3.3 NVIDIA NeMo End-to-End

Overview: NeMo can run ASR and diarization in a single framework. The Sortformer model handles diarization end-to-end, and NeMo ASR handles transcription.

Strengths:

Single framework, no glue code between separate libraries.
State-of-the-art accuracy.
Streaming support with Streaming Sortformer.

Weaknesses:

Requires NVIDIA GPU.
Heavy framework, not consumer-desktop friendly.
Complex configuration.

3.4 Aligning Diarization with Transcription Timestamps

The fundamental challenge: STT produces words with timestamps, while diarization produces speaker segments with timestamps. These must be merged.

Best Approach (used by WhisperX):

1. Run STT -> get words with [start_time, end_time] per word
2. Run diarization -> get speaker segments [speaker_id, start_time, end_time]
3. For each word, find which speaker segment it falls into:
   - Use the word's midpoint timestamp
   - Assign the word to whichever speaker segment contains that midpoint
   - Handle edge cases (words spanning segment boundaries) with majority overlap

Alignment quality depends on:

Word timestamp precision: faster-whisper with wav2vec2 alignment provides the best precision. whisper.cpp timestamps can drift 300-800ms, which can cause mis-attribution at speaker boundaries.
Diarization segment precision: pyannote.audio community-1 provides the tightest speaker boundaries.
Overlap handling: In conversations where speakers overlap, both timestamps and diarization become less reliable. pyannote.audio 4.0 has specific overlapped speech detection.

4. Final Recommendations

Primary Recommendation: Two-Tier Architecture

Given the "Voice to Notes" requirements (local-first, consumer hardware, word-level timestamps for synchronized playback, speaker identification), I recommend a two-tier architecture:

Tier 1: Core Transcription Engine (C/C++)

Use whisper.cpp as the primary STT engine.

No Python dependency for the core app.
Runs on all hardware (CPU, NVIDIA GPU, AMD GPU via Vulkan, Intel via OpenVINO).
MIT license with no restrictions.
Embed directly into your desktop app (Tauri, Qt, Electron with native addon).
Use the large-v3-turbo model as the default (best speed/accuracy trade-off for consumer hardware).
Offer medium and small models for lower-end hardware.
Word-level timestamps are adequate for playback synchronization (300-800ms drift is acceptable when the UI highlights the current phrase rather than individual words).

Tier 2: Enhanced Pipeline (Python Sidecar)

Use faster-whisper + pyannote.audio via a Python sidecar process for users who want speaker diarization and precise word-level alignment.

Ship a bundled Python environment (e.g., via PyInstaller or conda-pack).
Run the WhisperX-style pipeline: faster-whisper -> wav2vec2 alignment -> pyannote diarization.
Communicate with the main app via IPC (stdin/stdout JSON, local socket, or gRPC).
This gives the best word-level timestamps and speaker identification.
Optional: only install/download when user enables "Speaker Identification" feature.

Model Selection Guide

User's Hardware	STT Model	Diarization
No GPU, 8GB RAM	whisper.cpp `small` (Q5_K_M)	pyannote on CPU (slower but works)
No GPU, 16GB RAM	whisper.cpp `medium` (Q5_K_M)	pyannote on CPU
NVIDIA GPU, 8GB+ VRAM	faster-whisper `large-v3-turbo` (int8)	pyannote on GPU
NVIDIA GPU, 4GB VRAM	faster-whisper `medium` (int8)	pyannote on GPU
Any hardware, speed priority	whisper.cpp `small` or `base`	Skip diarization

Optional Cloud Fallback

For users who prefer cloud processing, integrate an optional cloud STT API (OpenAI Whisper API, AssemblyAI, or Deepgram) as a premium feature. This requires minimal code since the output format (words + timestamps + speakers) is the same regardless of backend.

Why Not Other Options?

Option	Reason to Skip
Vosk	Accuracy gap too large vs. Whisper. Only consider as a real-time streaming preview (show rough text while recording, then refine with Whisper afterward).
Coqui STT	Discontinued. No future.
Resemblyzer	Unmaintained, incomplete (no pipeline).
NeMo (full)	Too heavy for consumer desktop. NVIDIA-only for practical use.
SpeechBrain	Less accurate diarization than pyannote. Larger framework for less benefit.

Recommended Technology Stack Summary

Desktop App Shell: Tauri (Rust) or Electron
                         |
        +----------------+----------------+
        |                                 |
  Core STT Engine                  Enhanced Pipeline
  (whisper.cpp, C/C++)            (Python sidecar)
        |                                 |
  - Transcription              - faster-whisper (STT)
  - Basic word timestamps      - wav2vec2 (alignment)
  - No speaker ID              - pyannote.audio (diarization)
                               - Precise word timestamps
                               - Speaker identification

Key Files and Repositories

whisper.cpp: https://github.com/ggml-org/whisper.cpp
faster-whisper: https://github.com/SYSTRAN/faster-whisper
pyannote.audio: https://github.com/pyannote/pyannote-audio
WhisperX: https://github.com/m-bain/whisperX
whisper-diarization: https://github.com/MahmoudAshraf97/whisper-diarization

20 KiB

Raw Permalink Blame History

Voice to Notes: Speech-to-Text and Speaker Diarization Research Report

Table of Contents

1. Speech-to-Text Engines

1.1 OpenAI Whisper / whisper.cpp

1.2 faster-whisper

1.3 Vosk

1.4 Coqui STT

1.5 Other Notable Options

Whisper Large-v3-turbo

NVIDIA NeMo ASR

Wav2Vec2 (Meta)

STT Summary Comparison

2. Speaker Diarization

2.1 pyannote.audio

2.2 NVIDIA NeMo Speaker Diarization

2.3 SpeechBrain

2.4 Resemblyzer

Diarization Summary Comparison

3. Combined Pipelines (STT + Diarization)

3.1 WhisperX

3.2 whisper-diarization (by MahmoudAshraf97)

3.3 NVIDIA NeMo End-to-End

3.4 Aligning Diarization with Transcription Timestamps

4. Final Recommendations

Primary Recommendation: Two-Tier Architecture

Tier 1: Core Transcription Engine (C/C++)

Tier 2: Enhanced Pipeline (Python Sidecar)

Model Selection Guide

Optional Cloud Fallback

Why Not Other Options?

Recommended Technology Stack Summary

Key Files and Repositories

Sources

20 KiB Raw Permalink Blame History

Voice to Notes: Speech-to-Text and Speaker Diarization Research Report

Table of Contents

1. Speech-to-Text Engines

1.1 OpenAI Whisper / whisper.cpp

1.2 faster-whisper

1.3 Vosk

1.4 Coqui STT

1.5 Other Notable Options

Whisper Large-v3-turbo

NVIDIA NeMo ASR

Wav2Vec2 (Meta)

STT Summary Comparison

2. Speaker Diarization

2.1 pyannote.audio

2.2 NVIDIA NeMo Speaker Diarization

2.3 SpeechBrain

2.4 Resemblyzer

Diarization Summary Comparison

3. Combined Pipelines (STT + Diarization)

3.1 WhisperX

3.2 whisper-diarization (by MahmoudAshraf97)

3.3 NVIDIA NeMo End-to-End

3.4 Aligning Diarization with Transcription Timestamps

4. Final Recommendations

Primary Recommendation: Two-Tier Architecture

Tier 1: Core Transcription Engine (C/C++)

Tier 2: Enhanced Pipeline (Python Sidecar)

Model Selection Guide

Optional Cloud Fallback

Why Not Other Options?

Recommended Technology Stack Summary

Key Files and Repositories

Sources

20 KiB

Raw Permalink Blame History