- Replace Ollama dependency with bundled llama-server (llama.cpp) so users need no separate install for local AI inference - Rust backend manages llama-server lifecycle (spawn, port, shutdown) - Add MIT license for open source release - Update architecture doc, CLAUDE.md, and README accordingly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
26 KiB
Voice to Notes — Architecture Document
1. Overview
Voice to Notes is a desktop application that transcribes audio/video recordings with speaker identification. It runs entirely on the user's computer. Cloud AI providers are optional and only used when explicitly configured by the user.
┌─────────────────────────────────────────────────────────────────┐
│ Tauri Application │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Frontend (Svelte + TS) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │
│ │ │ Waveform │ │ Transcript │ │ AI Chat │ │ │
│ │ │ Player │ │ Editor │ │ Panel │ │ │
│ │ │ (wavesurfer) │ │ (TipTap) │ │ │ │ │
│ │ └─────────────┘ └──────────────┘ └───────────────────┘ │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │
│ │ │ Speaker │ │ Export │ │ Project │ │ │
│ │ │ Manager │ │ Panel │ │ Manager │ │ │
│ │ └─────────────┘ └──────────────┘ └───────────────────┘ │ │
│ └──────────────────────────┬────────────────────────────────┘ │
│ │ tauri::invoke() │
│ ┌──────────────────────────┴────────────────────────────────┐ │
│ │ Rust Backend (thin layer) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │
│ │ │ Process │ │ File I/O │ │ SQLite │ │ │
│ │ │ Manager │ │ & Media │ │ (via rusqlite) │ │ │
│ │ └──────┬───────┘ └──────────────┘ └───────────────────┘ │ │
│ └─────────┼─────────────────────────────────────────────────┘ │
└────────────┼────────────────────────────────────────────────────┘
│ JSON-line IPC (stdin/stdout)
│
┌────────────┴────────────────────────────────────────────────────┐
│ Python Sidecar Process │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Transcribe │ │ Diarize │ │ AI Provider │ │
│ │ Service │ │ Service │ │ Service │ │
│ │ │ │ │ │ │ │
│ │ faster-whisper│ │ pyannote │ │ ┌──────────────────┐ │ │
│ │ + wav2vec2 │ │ .audio 4.0 │ │ │ LiteLLM adapter │ │ │
│ │ │ │ │ │ │ OpenAI adapter │ │ │
│ │ CPU: auto │ │ CPU: auto │ │ │ Anthropic adapter │ │ │
│ │ GPU: CUDA │ │ GPU: CUDA │ │ │ Ollama adapter │ │ │
│ └──────────────┘ └──────────────┘ │ └──────────────────┘ │ │
│ └────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
2. Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Desktop Shell | Tauri v2 | Window management, OS integration, native packaging |
| Frontend | Svelte + TypeScript | UI components, state management |
| Audio Waveform | wavesurfer.js | Waveform visualization, click-to-seek playback |
| Transcript Editor | TipTap (ProseMirror) | Rich text editing with speaker-colored labels |
| Backend | Rust (thin) | Process management, file I/O, SQLite access, IPC relay |
| Database | SQLite (via rusqlite) | Project data, transcripts, word timestamps, speaker info |
| ML Runtime | Python sidecar | Speech-to-text, diarization, AI provider integration |
| STT Engine | faster-whisper | Transcription with word-level timestamps |
| Timestamp Refinement | wav2vec2 | Precise word-level alignment |
| Speaker Diarization | pyannote.audio 4.0 | Speaker segment detection |
| AI Providers | LiteLLM / direct SDKs | Summarization, Q&A, notes |
| Caption Export | pysubs2 | SRT, WebVTT, ASS subtitle generation |
3. CPU / GPU Strategy
All ML components must work on CPU. GPU acceleration is used when available but never required.
Detection and Selection
App Launch
│
├─ Detect hardware (Python: torch.cuda.is_available(), etc.)
│
├─ NVIDIA GPU detected (CUDA)
│ ├─ VRAM >= 8GB → large-v3-turbo (int8), pyannote on GPU
│ ├─ VRAM >= 4GB → medium model (int8), pyannote on GPU
│ └─ VRAM < 4GB → fall back to CPU
│
├─ No GPU / unsupported GPU
│ ├─ RAM >= 16GB → medium model on CPU, pyannote on CPU
│ ├─ RAM >= 8GB → small model on CPU, pyannote on CPU
│ └─ RAM < 8GB → base model on CPU, pyannote on CPU (warn: slow)
│
└─ User can override in Settings
Model Recommendations by Hardware
| Hardware | STT Model | Diarization | Expected Speed |
|---|---|---|---|
| NVIDIA GPU, 8GB+ VRAM | large-v3-turbo (int8) | pyannote GPU | ~20x realtime |
| NVIDIA GPU, 4GB VRAM | medium (int8) | pyannote GPU | ~10x realtime |
| CPU only, 16GB RAM | medium (int8_cpu) | pyannote CPU | ~2-4x realtime |
| CPU only, 8GB RAM | small (int8_cpu) | pyannote CPU | ~3-5x realtime |
| CPU only, minimal | base | pyannote CPU | ~5-8x realtime |
Users can always override model selection in settings. The app displays estimated processing time before starting.
CTranslate2 CPU Backends
faster-whisper uses CTranslate2, which supports multiple CPU acceleration backends:
- Intel MKL — Best performance on Intel CPUs
- oneDNN — Good cross-platform alternative
- OpenBLAS — Fallback for any CPU
- Ruy — Lightweight option for ARM
The Python sidecar auto-detects and uses the best available backend.
4. Component Architecture
4.1 Frontend (Svelte + TypeScript)
src/
lib/
components/
WaveformPlayer.svelte # wavesurfer.js wrapper, playback controls
TranscriptEditor.svelte # TipTap editor with speaker labels
SpeakerManager.svelte # Assign names/colors to speakers
ExportPanel.svelte # Export format selection and options
AIChatPanel.svelte # Chat interface for AI Q&A
ProjectList.svelte # Project browser/manager
SettingsPanel.svelte # Model selection, AI config, preferences
ProgressOverlay.svelte # Transcription progress with cancel
stores/
project.ts # Current project state
transcript.ts # Segments, words, speakers
playback.ts # Audio position, playing state
ai.ts # AI provider config and chat history
services/
tauri-bridge.ts # Typed wrappers around tauri::invoke
audio-sync.ts # Sync playback position ↔ transcript highlight
export.ts # Trigger export via backend
types/
transcript.ts # Segment, Word, Speaker interfaces
project.ts # Project, MediaFile interfaces
routes/
+page.svelte # Main workspace
+layout.svelte # App shell with sidebar
Key UI interactions:
- Click a word in the transcript → audio seeks to that word's
start_ms - Audio plays → transcript auto-scrolls and highlights current word/segment
- Click speaker label → open rename dialog, changes propagate to all segments
- Drag to select text → option to re-assign speaker for selection
4.2 Rust Backend
The Rust layer is intentionally thin. It handles:
- Process Management — Spawn, monitor, and kill the Python sidecar and llama-server
- IPC Relay — Forward messages between frontend and Python process
- File Operations — Read/write project files, manage media
- SQLite — All database operations via rusqlite
- System Info — Detect GPU, RAM, CPU for hardware recommendations
- llama-server Lifecycle — Start/stop bundled llama-server, manage port allocation
src-tauri/
src/
main.rs # Tauri app entry point
commands/
project.rs # CRUD for projects
transcribe.rs # Start/stop/monitor transcription
export.rs # Trigger caption/text export
ai.rs # AI provider commands
settings.rs # App settings and preferences
system.rs # Hardware detection
llama_server.rs # llama-server process lifecycle
db/
mod.rs # SQLite connection pool
schema.rs # Table definitions and migrations
queries.rs # Prepared queries
sidecar/
mod.rs # Python process lifecycle
ipc.rs # JSON-line protocol handler
messages.rs # IPC message types (serde)
state.rs # App state (db handle, sidecar handle)
4.3 Python Sidecar
The Python process runs independently and communicates via JSON-line protocol over stdin/stdout.
python/
voice_to_notes/
__init__.py
main.py # Entry point, IPC message loop
ipc/
__init__.py
protocol.py # JSON-line read/write, message types
handlers.py # Route messages to services
services/
transcribe.py # faster-whisper + wav2vec2 pipeline
diarize.py # pyannote.audio diarization
pipeline.py # Combined transcribe + diarize workflow
ai_provider.py # AI provider abstraction
export.py # pysubs2 caption export, text export
providers/
__init__.py
base.py # Abstract AI provider interface
litellm_provider.py # LiteLLM (multi-provider gateway)
openai_provider.py # Direct OpenAI SDK
anthropic_provider.py # Direct Anthropic SDK
local_provider.py # Bundled llama-server (OpenAI-compatible API)
hardware/
__init__.py
detect.py # GPU/CPU detection, VRAM estimation
models.py # Model selection logic
utils/
audio.py # Audio format conversion (ffmpeg wrapper)
progress.py # Progress reporting via IPC
tests/
test_transcribe.py
test_diarize.py
test_pipeline.py
test_providers.py
test_export.py
pyproject.toml # Dependencies and build config
5. IPC Protocol
The Rust backend and Python sidecar communicate via newline-delimited JSON on stdin/stdout. Each message has a type, an optional request ID (for correlating responses), and a payload.
Message Format
{"id": "req-001", "type": "transcribe.start", "payload": {"file": "/path/to/audio.wav", "model": "large-v3-turbo", "device": "cuda", "language": "auto"}}
Message Types
Requests (Rust → Python):
| Type | Payload | Description |
|---|---|---|
transcribe.start |
{file, model, device, language} |
Start transcription |
transcribe.cancel |
{id} |
Cancel running transcription |
diarize.start |
{file, num_speakers?} |
Start speaker diarization |
pipeline.start |
{file, model, device, language, num_speakers?} |
Full transcribe + diarize |
ai.chat |
{provider, model, messages, transcript_context} |
Send AI chat message |
ai.summarize |
{provider, model, transcript, style} |
Generate summary/notes |
export.captions |
{segments, format, options} |
Export caption file |
export.text |
{segments, speakers, format, options} |
Export text document |
hardware.detect |
{} |
Detect available hardware |
Responses (Python → Rust):
| Type | Payload | Description |
|---|---|---|
progress |
{id, percent, stage, message} |
Progress update |
transcribe.result |
{segments: [{text, start_ms, end_ms, words: [...]}]} |
Transcription complete |
diarize.result |
{speakers: [{id, segments: [{start_ms, end_ms}]}]} |
Diarization complete |
pipeline.result |
{segments, speakers, words} |
Full pipeline result |
ai.response |
{content, tokens_used, provider} |
AI response |
ai.stream |
{id, delta, done} |
Streaming AI token |
export.done |
{path} |
Export file written |
error |
{id, code, message} |
Error response |
hardware.info |
{gpu, vram_mb, ram_mb, cpu_cores, recommended_model} |
Hardware info |
Progress Reporting
Long-running operations (transcription, diarization) send periodic progress messages:
{"id": "req-001", "type": "progress", "payload": {"percent": 45, "stage": "transcribing", "message": "Processing segment 23/51..."}}
Stages: loading_model → preprocessing → transcribing → aligning → diarizing → merging → done
6. Database Schema
SQLite database stored per-project at {project_dir}/project.db.
-- Projects metadata
CREATE TABLE projects (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
settings TEXT, -- JSON: project-specific overrides
status TEXT DEFAULT 'active'
);
-- Source media files
CREATE TABLE media_files (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL REFERENCES projects(id),
file_path TEXT NOT NULL, -- relative to project dir
file_hash TEXT, -- SHA-256 for integrity
duration_ms INTEGER,
sample_rate INTEGER,
channels INTEGER,
format TEXT,
file_size INTEGER,
created_at TEXT NOT NULL
);
-- Speakers identified in audio
CREATE TABLE speakers (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL REFERENCES projects(id),
label TEXT NOT NULL, -- auto-assigned: "Speaker 1"
display_name TEXT, -- user-assigned: "Sarah Chen"
color TEXT, -- hex color for UI
metadata TEXT -- JSON: voice embedding ref, notes
);
-- Transcript segments (one per speaker turn)
CREATE TABLE segments (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL REFERENCES projects(id),
media_file_id TEXT NOT NULL REFERENCES media_files(id),
speaker_id TEXT REFERENCES speakers(id),
start_ms INTEGER NOT NULL,
end_ms INTEGER NOT NULL,
text TEXT NOT NULL,
original_text TEXT, -- pre-edit text preserved
confidence REAL,
is_edited INTEGER DEFAULT 0,
edited_at TEXT,
segment_index INTEGER NOT NULL
);
-- Word-level timestamps (for click-to-seek and captions)
CREATE TABLE words (
id TEXT PRIMARY KEY,
segment_id TEXT NOT NULL REFERENCES segments(id),
word TEXT NOT NULL,
start_ms INTEGER NOT NULL,
end_ms INTEGER NOT NULL,
confidence REAL,
word_index INTEGER NOT NULL
);
-- AI-generated outputs
CREATE TABLE ai_outputs (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL REFERENCES projects(id),
output_type TEXT NOT NULL, -- summary, action_items, notes, qa
prompt TEXT,
content TEXT NOT NULL,
provider TEXT,
created_at TEXT NOT NULL,
metadata TEXT -- JSON: tokens, latency
);
-- User annotations and bookmarks
CREATE TABLE annotations (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL REFERENCES projects(id),
start_ms INTEGER NOT NULL,
end_ms INTEGER,
text TEXT NOT NULL,
type TEXT DEFAULT 'bookmark'
);
-- Performance indexes
CREATE INDEX idx_segments_project ON segments(project_id, segment_index);
CREATE INDEX idx_segments_time ON segments(media_file_id, start_ms);
CREATE INDEX idx_words_segment ON words(segment_id, word_index);
CREATE INDEX idx_words_time ON words(start_ms, end_ms);
CREATE INDEX idx_ai_outputs_project ON ai_outputs(project_id, output_type);
7. AI Provider Architecture
Provider Interface
class AIProvider(ABC):
@abstractmethod
async def chat(self, messages: list[dict], config: dict) -> str: ...
@abstractmethod
async def stream(self, messages: list[dict], config: dict) -> AsyncIterator[str]: ...
Supported Providers
| Provider | Package / Binary | Use Case |
|---|---|---|
| llama-server (bundled) | llama.cpp binary | Default local AI — bundled with app, no install needed. OpenAI-compatible API on localhost. |
| LiteLLM | litellm |
Gateway to 100+ providers via unified API |
| OpenAI | openai |
Direct OpenAI API (GPT-4o, etc.) |
| Anthropic | anthropic |
Direct Anthropic API (Claude) |
Local AI via llama-server (llama.cpp)
The app bundles llama-server from the llama.cpp project (MIT license). This is the default AI provider — it runs entirely on the user's machine with no internet connection or separate install required.
How it works:
- Rust backend spawns
llama-serveras a managed subprocess on app launch (or on first AI use) - llama-server exposes an OpenAI-compatible REST API on
localhost:{dynamic_port} - Python sidecar talks to it using the same OpenAI SDK interface as cloud providers
- On app exit, Rust backend cleanly shuts down the llama-server process
Model management:
- Models stored in
~/.voicetonotes/models/(GGUF format) - First-run setup downloads a recommended small model (e.g., Phi-3-mini, Llama-3-8B Q4)
- Users can download additional models or point to their own GGUF files
- Model selection in Settings UI with size/quality tradeoffs shown
Hardware utilization:
- CPU: Works on any machine, uses all available cores
- NVIDIA GPU: CUDA acceleration when available
- The same CPU/GPU auto-detection used for Whisper applies here
Context Window Strategy
| Transcript Length | Strategy |
|---|---|
| < 100K tokens | Send full transcript directly |
| 100K - 200K tokens | Use Claude (200K context) or chunk for smaller models |
| > 200K tokens | Map-reduce: summarize chunks, then combine |
| Q&A mode | Semantic search over chunks, send top-K relevant to model |
Configuration
Users configure AI providers in Settings. API keys for cloud providers stored in OS keychain (libsecret on Linux, Windows Credential Manager). The bundled llama-server requires no keys or internet.
{
"ai": {
"default_provider": "local",
"providers": {
"local": { "model": "phi-3-mini-Q4_K_M.gguf", "gpu_layers": "auto" },
"openai": { "model": "gpt-4o" },
"anthropic": { "model": "claude-sonnet-4-20250514" },
"litellm": { "model": "gpt-4o" }
}
}
}
8. Export Formats
Caption Formats
| Format | Speaker Support | Library |
|---|---|---|
| SRT | [Speaker]: prefix convention |
pysubs2 |
| WebVTT | Native <v Speaker> voice tags |
pysubs2 |
| ASS/SSA | Named styles per speaker with colors | pysubs2 |
Text Formats
| Format | Implementation |
|---|---|
| Plain text (.txt) | Custom formatter |
| Markdown (.md) | Custom formatter (bold speaker names) |
| DOCX | python-docx |
Text Output Example
[00:00:03] Sarah Chen:
Hello everyone, welcome to the meeting. I wanted to start by
discussing the Q3 results before we move on to planning.
[00:00:15] Michael Torres:
Thanks Sarah. The numbers look strong this quarter.
9. Project File Structure
~/VoiceToNotes/
config.json # Global app settings
projects/
{project-uuid}/
project.db # SQLite database
media/
recording.m4a # Original media file
exports/
transcript.srt
transcript.vtt
notes.md
10. Implementation Phases
Phase 1 — Foundation
Set up Tauri + Svelte project scaffold, Python sidecar with IPC protocol, SQLite schema, and basic project management UI.
Deliverables:
- Tauri app launches with a working Svelte frontend
- Python sidecar starts, communicates via JSON-line IPC
- SQLite database created per-project
- Create/open/list projects in the UI
Phase 2 — Core Transcription
Implement the transcription pipeline with audio playback and synchronized transcript display.
Deliverables:
- Import audio/video files (ffmpeg conversion to WAV)
- Run faster-whisper transcription with progress reporting
- Display transcript with word-level timestamps
- wavesurfer.js audio player with click-to-seek from transcript
- Auto-scroll transcript during playback
- Edit transcript text (corrections persist to DB)
Phase 3 — Speaker Diarization
Add speaker identification and management.
Deliverables:
- pyannote.audio diarization integrated into pipeline
- Speaker segments merged with word timestamps
- Speaker labels displayed in transcript with colors
- Rename speakers (persists across all segments)
- Re-assign speaker for selected text segments
- Hardware detection and model auto-selection (CPU/GPU)
Phase 4 — Export
Implement all export formats.
Deliverables:
- SRT, WebVTT, ASS caption export with speaker labels
- Plain text and Markdown export with speaker names
- Export options panel in UI
Phase 5 — AI Integration
Add AI provider support for Q&A and summarization.
Deliverables:
- Provider configuration UI with API key management
- Bundled llama-server for local AI (default, no internet required)
- Model download manager for local GGUF models
- OpenAI and Anthropic direct SDK support
- LiteLLM gateway support
- Chat panel for asking questions about the transcript
- Summary/notes generation with multiple styles
- Context window management for long transcripts
Phase 6 — Polish and Packaging
Production readiness.
Deliverables:
- Linux packaging (.deb, .AppImage)
- Windows packaging (.msi, .exe installer)
- Bundled Python environment (no user Python install required)
- Model download manager (first-run setup)
- Settings panel (model selection, hardware config, AI providers)
- Error handling, logging, crash recovery
11. Agent Work Breakdown
For parallel development, the codebase splits into these independent workstreams:
| Agent | Scope | Dependencies |
|---|---|---|
| Agent 1: Tauri + Frontend Shell | Tauri project setup, Svelte scaffold, routing, project manager UI, settings UI | None |
| Agent 2: Python Sidecar + IPC | Python project setup, IPC protocol, message loop, handler routing | None |
| Agent 3: Database Layer | SQLite schema, Rust query layer, migration system | None |
| Agent 4: Transcription Pipeline | faster-whisper integration, wav2vec2 alignment, hardware detection, model management | Agent 2 (IPC) |
| Agent 5: Diarization Pipeline | pyannote.audio integration, speaker-word alignment, combined pipeline | Agent 4 (transcription) |
| Agent 6: Audio Player + Transcript UI | wavesurfer.js integration, TipTap transcript editor, playback-transcript sync | Agent 1 (shell), Agent 3 (DB) |
| Agent 7: Export System | pysubs2 caption export, text formatters, export UI | Agent 2 (IPC), Agent 3 (DB) |
| Agent 8: AI Provider System | Provider abstraction, bundled llama-server, LiteLLM/OpenAI/Anthropic adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |
Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.