Files
voice-to-notes/RESEARCH_REPORT.md
2026-02-26 16:44:58 -08:00

20 KiB

Voice to Notes: Speech-to-Text and Speaker Diarization Research Report

Date: 2026-02-26


Table of Contents

  1. Speech-to-Text Engines
  2. Speaker Diarization
  3. Combined Pipelines
  4. Final Recommendations

1. Speech-to-Text Engines

1.1 OpenAI Whisper / whisper.cpp

Overview: Whisper is OpenAI's general-purpose speech recognition model trained on 680,000 hours of multilingual data. whisper.cpp is a pure C/C++ port by Georgi Gerganov (ggml project) that removes the Python/PyTorch dependency entirely.

Criterion Assessment
Accuracy State-of-the-art. Large-v3 achieves ~2.7% WER on clean audio, ~7.9% on mixed real-world audio. Large-v3-turbo achieves comparable accuracy (~7.75% WER) at much faster speed by reducing decoder layers from 32 to 4.
Speed whisper.cpp with quantization (Q5_K_M) runs efficiently on CPU. GPU acceleration available via CUDA (NVIDIA), Vulkan (cross-vendor), Metal (Apple Silicon), and OpenVINO (Intel). Real-time or faster on modern hardware with medium/small models.
Language Support 99 languages.
Ease of Integration whisper.cpp: C/C++ library with C API, bindings available for many languages. No Python runtime needed. Straightforward to embed in a desktop app.
License MIT (both Whisper and whisper.cpp).
GPU Acceleration CUDA, Vulkan, Metal, OpenVINO, CoreML. Broad hardware coverage.
Word-Level Timestamps Supported, but derived from forced alignment on decoded text rather than internal attention weights. Can drift 300-800ms on complex utterances. Acceptable for many use cases but not forensic-grade.

Verdict: Best option for a native desktop app that needs to minimize dependencies. The C/C++ nature of whisper.cpp makes it ideal for embedding in an Electron, Qt, or Tauri application.


1.2 faster-whisper

Overview: A Python reimplementation of Whisper using CTranslate2, a high-performance C++ inference engine for Transformer models. Up to 4x faster than stock Whisper with the same accuracy, and lower memory usage.

Criterion Assessment
Accuracy Identical to Whisper (same models, full fidelity).
Speed Up to 4x faster than stock Whisper. ~20x realtime with GPU. 8-bit quantization available on both CPU and GPU.
Language Support 99 languages (same Whisper models).
Ease of Integration Python library. Requires Python runtime. Excellent for Python-based or Python-embedded apps. Rich API with access to Whisper's tokenizer, alignment algorithms, and confidence scoring.
License MIT.
GPU Acceleration NVIDIA CUDA, AMD ROCm (via CTranslate2). CPU backends: Intel MKL, oneDNN, OpenBLAS, Ruy.
Word-Level Timestamps Best-in-class among Whisper variants. Native alignment from the model's internals plus optional wav2vec2 alignment for even better precision.

Verdict: Best choice if your app can embed a Python runtime (or run a Python sidecar process). Provides the most precise word-level timestamps of any Whisper variant, which is critical for synchronized playback. The trade-off is the Python dependency.


1.3 Vosk

Overview: A lightweight, Kaldi-based offline speech recognition toolkit. Optimized for efficiency and small footprint.

Criterion Assessment
Accuracy Good but noticeably below Whisper-class models. Baseline WER can be ~20%+ depending on audio conditions, improvable to ~12% with domain-specific language model adaptation.
Speed Very fast, even on low-end hardware. Supports real-time streaming natively.
Language Support 20+ languages with pre-trained models.
Ease of Integration Excellent. APIs for Python, Java, C#, JavaScript, Node.js, and more. Models are ~50MB.
License Apache 2.0.
GPU Acceleration Not required (runs efficiently on CPU). No GPU acceleration.
Word-Level Timestamps Yes, provides word-level timestamps with start/end times and confidence in JSON output.

Verdict: Best for extremely resource-constrained scenarios or as a lightweight fallback. Not recommended as the primary engine for a quality-focused transcription app due to lower accuracy compared to Whisper-based solutions.


1.4 Coqui STT

Overview: Fork of Mozilla DeepSpeech. The Coqui company shut down in early 2024. The code remains available as open source, but the project is no longer maintained and the Model Zoo is offline.

Criterion Assessment
Accuracy Below Whisper. Was competitive in the DeepSpeech era but has fallen behind.
Speed Moderate.
Language Support Limited compared to Whisper.
Ease of Integration Python and native bindings available, but stale dependencies.
License MPL 2.0.
GPU Acceleration TensorFlow-based GPU support.
Word-Level Timestamps Supported via metadata output.

Verdict: Not recommended. The project is discontinued. No active maintenance, no security patches, no model improvements. Use Whisper-based alternatives instead.


1.5 Other Notable Options

Whisper Large-v3-turbo

OpenAI's latest Whisper variant (October 2024). Reduces decoder layers from 32 to 4 while maintaining accuracy close to large-v3. Achieves 216x realtime speed. Available in both whisper.cpp and faster-whisper.

NVIDIA NeMo ASR

Production-grade ASR with Conformer-CTC and Conformer-Transducer models. Best accuracy in some benchmarks but heavy dependency on NVIDIA ecosystem. Apache 2.0 license. Overkill for a desktop app unless targeting NVIDIA GPU users specifically.

Wav2Vec2 (Meta)

Strong accuracy when fine-tuned for specific domains. Good for real-time streaming. Often used as an alignment model rather than primary STT. MIT license.


STT Summary Comparison

Feature whisper.cpp faster-whisper Vosk Coqui STT
Accuracy Excellent Excellent Good Fair
Speed Fast Very Fast Very Fast Moderate
Languages 99 99 20+ Limited
Word Timestamps Yes (some drift) Yes (precise) Yes Yes
GPU Support CUDA/Vulkan/Metal CUDA/ROCm CPU only TensorFlow
License MIT MIT Apache 2.0 MPL 2.0
Dependencies None (C/C++) Python + CTranslate2 Minimal Python + TF
Actively Maintained Yes Yes Yes No
Desktop-Friendly Excellent Good Excellent Poor

2. Speaker Diarization

2.1 pyannote.audio

Overview: The leading open-source speaker diarization toolkit. Recently released version 4.0 with the "community-1" model, which significantly outperforms the previous 3.1 across all metrics.

Criterion Assessment
Accuracy Best-in-class open source. DER (Diarization Error Rate) ~11-19% on standard benchmarks. Community-1 model is a major leap over 3.1.
Pre-recorded Audio Full support. Designed for both offline and streaming use.
Ease of Integration Python library with PyTorch backend. Simple pipeline API: pipeline("audio.wav") returns speaker segments. Can run fully offline once models are downloaded.
Combinable with STT Yes. WhisperX and whisper-diarization both use pyannote as their diarization backend. Well-established integration patterns.
License Code: MIT. Models: speaker-diarization-3.1 is MIT; community-1 is CC-BY-4.0. Both allow commercial use.
GPU Support Yes, PyTorch CUDA. Can also run on CPU (slower but functional).

Verdict: Clear first choice for diarization. Most accurate, best maintained, largest community, and proven integration with Whisper-based STT. The community-1 model under CC-BY-4.0 is permissive enough for commercial desktop apps.


2.2 NVIDIA NeMo Speaker Diarization

Overview: Part of NVIDIA's NeMo framework. Offers two approaches: end-to-end Sortformer Diarizer and cascaded pipeline (MarbleNet VAD + TitaNet embeddings + Multi-Scale Diarization Decoder).

Criterion Assessment
Accuracy Competitive with or slightly better than pyannote in some benchmarks. Sortformer is state-of-the-art.
Pre-recorded Audio Full support. Also has streaming Sortformer for real-time.
Ease of Integration Heavy. NeMo is a large framework with many dependencies. Requires NVIDIA GPU for practical use. Complex configuration via YAML files.
Combinable with STT Yes. NeMo includes its own ASR models. Can combine diarization with NeMo ASR in a single pipeline.
License Apache 2.0.
GPU Support NVIDIA GPU required for practical performance.

Verdict: Best accuracy in some scenarios, but the heavy NVIDIA dependency and complex setup make it poorly suited for a consumer desktop app that must work across hardware. Good option if you can offer it as an optional backend for users with NVIDIA GPUs.


2.3 SpeechBrain

Overview: An open-source, all-in-one conversational AI toolkit built on PyTorch. Covers ASR, speaker identification, diarization, speech enhancement, and more.

Criterion Assessment
Accuracy Good, though generally slightly behind pyannote on diarization-specific benchmarks.
Pre-recorded Audio Full support.
Ease of Integration Moderate. PyTorch-based. Well-documented but the "kitchen sink" approach means you pull in a large framework even if you only need diarization.
Combinable with STT Yes. Has its own ASR components. Can build end-to-end pipelines within the framework.
License Apache 2.0.
GPU Support PyTorch CUDA.

Verdict: Good option if you want a single framework for everything (ASR + diarization + enhancement). However, for diarization specifically, pyannote is more focused and generally more accurate. SpeechBrain is better suited for teams that want deep customization of the diarization pipeline.


2.4 Resemblyzer

Overview: A Python library by Resemble AI for extracting speaker embeddings using a GE2E (Generalized End-to-End) model. Primarily a speaker verification/comparison tool, not a full diarization system.

Criterion Assessment
Accuracy Moderate. The underlying model is older and less accurate than pyannote or NeMo embeddings.
Pre-recorded Audio Yes, but you must build your own clustering/segmentation logic on top.
Ease of Integration Simple API for embedding extraction. But no built-in diarization pipeline; you need to implement VAD, segmentation, and clustering yourself.
Combinable with STT Manually, with significant custom code.
License Apache 2.0.
GPU Support PyTorch (optional).
Maintenance Status Inactive. No new releases or meaningful updates in over 12 months.

Verdict: Not recommended for new projects. It is essentially unmaintained and provides only embeddings, not a complete diarization solution. pyannote provides better embeddings and a complete pipeline.


Diarization Summary Comparison

Feature pyannote.audio NeMo SpeechBrain Resemblyzer
Accuracy (DER) ~11-19% ~10-18% ~13-20% N/A (not a full system)
Complete Pipeline Yes Yes Yes No (embeddings only)
Ease of Setup Easy Complex Moderate Easy (but incomplete)
License MIT / CC-BY-4.0 Apache 2.0 Apache 2.0 Apache 2.0
GPU Required No (recommended) Practically yes No (recommended) No
Actively Maintained Yes (v4.0, Feb 2026) Yes Yes No
Desktop-Friendly Good Poor Moderate N/A

3. Combined Pipelines (STT + Diarization)

3.1 WhisperX

Overview: The most mature combined pipeline. Integrates faster-whisper (STT) + wav2vec2 (alignment) + pyannote.audio (diarization) into a single workflow.

How it works:

  1. Transcription: faster-whisper transcribes audio into coarse utterance-level segments with batched inference (~70x realtime with large-v2).
  2. Forced Alignment: wav2vec2 refines timestamps to precise word-level accuracy.
  3. Diarization: pyannote.audio segments the audio by speaker.
  4. Alignment: Word-level timestamps from step 2 are aligned with speaker segments from step 3, assigning each word to a speaker.

Strengths:

  • Best word-level timestamp accuracy of any open-source solution.
  • Speaker labels mapped to individual words.
  • Handles long audio files through intelligent chunking.
  • Active development, large community.

Weaknesses:

  • Python-only. Requires Python runtime with PyTorch, faster-whisper, and pyannote dependencies.
  • Significant memory usage (multiple models loaded simultaneously).
  • Pyannote model download requires accepting license on Hugging Face (one-time).

License: BSD-4-Clause (WhisperX itself); dependencies are MIT/Apache.


3.2 whisper-diarization (by MahmoudAshraf97)

Overview: An alternative combined pipeline using Whisper + pyannote for diarization. Simpler than WhisperX but with fewer features.

Strengths:

  • Straightforward Python script approach.
  • Uses pyannote for diarization.
  • Easier to understand and modify.

Weaknesses:

  • Less optimized than WhisperX.
  • Fewer alignment options.

3.3 NVIDIA NeMo End-to-End

Overview: NeMo can run ASR and diarization in a single framework. The Sortformer model handles diarization end-to-end, and NeMo ASR handles transcription.

Strengths:

  • Single framework, no glue code between separate libraries.
  • State-of-the-art accuracy.
  • Streaming support with Streaming Sortformer.

Weaknesses:

  • Requires NVIDIA GPU.
  • Heavy framework, not consumer-desktop friendly.
  • Complex configuration.

3.4 Aligning Diarization with Transcription Timestamps

The fundamental challenge: STT produces words with timestamps, while diarization produces speaker segments with timestamps. These must be merged.

Best Approach (used by WhisperX):

1. Run STT -> get words with [start_time, end_time] per word
2. Run diarization -> get speaker segments [speaker_id, start_time, end_time]
3. For each word, find which speaker segment it falls into:
   - Use the word's midpoint timestamp
   - Assign the word to whichever speaker segment contains that midpoint
   - Handle edge cases (words spanning segment boundaries) with majority overlap

Alignment quality depends on:

  • Word timestamp precision: faster-whisper with wav2vec2 alignment provides the best precision. whisper.cpp timestamps can drift 300-800ms, which can cause mis-attribution at speaker boundaries.
  • Diarization segment precision: pyannote.audio community-1 provides the tightest speaker boundaries.
  • Overlap handling: In conversations where speakers overlap, both timestamps and diarization become less reliable. pyannote.audio 4.0 has specific overlapped speech detection.

4. Final Recommendations

Primary Recommendation: Two-Tier Architecture

Given the "Voice to Notes" requirements (local-first, consumer hardware, word-level timestamps for synchronized playback, speaker identification), I recommend a two-tier architecture:

Tier 1: Core Transcription Engine (C/C++)

Use whisper.cpp as the primary STT engine.

  • No Python dependency for the core app.
  • Runs on all hardware (CPU, NVIDIA GPU, AMD GPU via Vulkan, Intel via OpenVINO).
  • MIT license with no restrictions.
  • Embed directly into your desktop app (Tauri, Qt, Electron with native addon).
  • Use the large-v3-turbo model as the default (best speed/accuracy trade-off for consumer hardware).
  • Offer medium and small models for lower-end hardware.
  • Word-level timestamps are adequate for playback synchronization (300-800ms drift is acceptable when the UI highlights the current phrase rather than individual words).

Tier 2: Enhanced Pipeline (Python Sidecar)

Use faster-whisper + pyannote.audio via a Python sidecar process for users who want speaker diarization and precise word-level alignment.

  • Ship a bundled Python environment (e.g., via PyInstaller or conda-pack).
  • Run the WhisperX-style pipeline: faster-whisper -> wav2vec2 alignment -> pyannote diarization.
  • Communicate with the main app via IPC (stdin/stdout JSON, local socket, or gRPC).
  • This gives the best word-level timestamps and speaker identification.
  • Optional: only install/download when user enables "Speaker Identification" feature.

Model Selection Guide

User's Hardware STT Model Diarization
No GPU, 8GB RAM whisper.cpp small (Q5_K_M) pyannote on CPU (slower but works)
No GPU, 16GB RAM whisper.cpp medium (Q5_K_M) pyannote on CPU
NVIDIA GPU, 8GB+ VRAM faster-whisper large-v3-turbo (int8) pyannote on GPU
NVIDIA GPU, 4GB VRAM faster-whisper medium (int8) pyannote on GPU
Any hardware, speed priority whisper.cpp small or base Skip diarization

Optional Cloud Fallback

For users who prefer cloud processing, integrate an optional cloud STT API (OpenAI Whisper API, AssemblyAI, or Deepgram) as a premium feature. This requires minimal code since the output format (words + timestamps + speakers) is the same regardless of backend.

Why Not Other Options?

Option Reason to Skip
Vosk Accuracy gap too large vs. Whisper. Only consider as a real-time streaming preview (show rough text while recording, then refine with Whisper afterward).
Coqui STT Discontinued. No future.
Resemblyzer Unmaintained, incomplete (no pipeline).
NeMo (full) Too heavy for consumer desktop. NVIDIA-only for practical use.
SpeechBrain Less accurate diarization than pyannote. Larger framework for less benefit.
Desktop App Shell: Tauri (Rust) or Electron
                         |
        +----------------+----------------+
        |                                 |
  Core STT Engine                  Enhanced Pipeline
  (whisper.cpp, C/C++)            (Python sidecar)
        |                                 |
  - Transcription              - faster-whisper (STT)
  - Basic word timestamps      - wav2vec2 (alignment)
  - No speaker ID              - pyannote.audio (diarization)
                               - Precise word timestamps
                               - Speaker identification

Key Files and Repositories


Sources