# Voice to Notes — Architecture Document ## 1. Overview Voice to Notes is a desktop application that transcribes audio/video recordings with speaker identification. It runs entirely on the user's computer. Cloud AI providers are optional and only used when explicitly configured by the user. ``` ┌─────────────────────────────────────────────────────────────────┐ │ Tauri Application │ │ │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Frontend (Svelte + TS) │ │ │ │ │ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │ │ │ │ Waveform │ │ Transcript │ │ AI Chat │ │ │ │ │ │ Player │ │ Editor │ │ Panel │ │ │ │ │ │ (wavesurfer) │ │ (TipTap) │ │ │ │ │ │ │ └─────────────┘ └──────────────┘ └───────────────────┘ │ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │ │ │ │ Speaker │ │ Export │ │ Project │ │ │ │ │ │ Manager │ │ Panel │ │ Manager │ │ │ │ │ └─────────────┘ └──────────────┘ └───────────────────┘ │ │ │ └──────────────────────────┬────────────────────────────────┘ │ │ │ tauri::invoke() │ │ ┌──────────────────────────┴────────────────────────────────┐ │ │ │ Rust Backend (thin layer) │ │ │ │ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │ │ │ │ Process │ │ File I/O │ │ SQLite │ │ │ │ │ │ Manager │ │ & Media │ │ (via rusqlite) │ │ │ │ │ └──────┬───────┘ └──────────────┘ └───────────────────┘ │ │ │ └─────────┼─────────────────────────────────────────────────┘ │ └────────────┼────────────────────────────────────────────────────┘ │ JSON-line IPC (stdin/stdout) │ ┌────────────┴────────────────────────────────────────────────────┐ │ Python Sidecar Process │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │ │ │ Transcribe │ │ Diarize │ │ AI Provider │ │ │ │ Service │ │ Service │ │ Service │ │ │ │ │ │ │ │ │ │ │ │ faster-whisper│ │ pyannote │ │ ┌──────────────────┐ │ │ │ │ + wav2vec2 │ │ .audio 4.0 │ │ │ LiteLLM adapter │ │ │ │ │ │ │ │ │ │ OpenAI adapter │ │ │ │ │ CPU: auto │ │ CPU: auto │ │ │ Anthropic adapter │ │ │ │ │ GPU: CUDA │ │ GPU: CUDA │ │ │ Ollama adapter │ │ │ │ └──────────────┘ └──────────────┘ │ └──────────────────┘ │ │ │ └────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘ ``` --- ## 2. Technology Stack | Layer | Technology | Purpose | |-------|-----------|---------| | **Desktop Shell** | Tauri v2 | Window management, OS integration, native packaging | | **Frontend** | Svelte + TypeScript | UI components, state management | | **Audio Waveform** | wavesurfer.js | Waveform visualization, click-to-seek playback | | **Transcript Editor** | TipTap (ProseMirror) | Rich text editing with speaker-colored labels | | **Backend** | Rust (thin) | Process management, file I/O, SQLite access, IPC relay | | **Database** | SQLite (via rusqlite) | Project data, transcripts, word timestamps, speaker info | | **ML Runtime** | Python sidecar | Speech-to-text, diarization, AI provider integration | | **STT Engine** | faster-whisper | Transcription with word-level timestamps | | **Timestamp Refinement** | wav2vec2 | Precise word-level alignment | | **Speaker Diarization** | pyannote.audio 4.0 | Speaker segment detection | | **AI Providers** | LiteLLM / direct SDKs | Summarization, Q&A, notes | | **Caption Export** | pysubs2 | SRT, WebVTT, ASS subtitle generation | --- ## 3. CPU / GPU Strategy All ML components must work on CPU. GPU acceleration is used when available but never required. ### Detection and Selection ``` App Launch │ ├─ Detect hardware (Python: torch.cuda.is_available(), etc.) │ ├─ NVIDIA GPU detected (CUDA) │ ├─ VRAM >= 8GB → large-v3-turbo (int8), pyannote on GPU │ ├─ VRAM >= 4GB → medium model (int8), pyannote on GPU │ └─ VRAM < 4GB → fall back to CPU │ ├─ No GPU / unsupported GPU │ ├─ RAM >= 16GB → medium model on CPU, pyannote on CPU │ ├─ RAM >= 8GB → small model on CPU, pyannote on CPU │ └─ RAM < 8GB → base model on CPU, pyannote on CPU (warn: slow) │ └─ User can override in Settings ``` ### Model Recommendations by Hardware | Hardware | STT Model | Diarization | Expected Speed | |----------|-----------|-------------|----------------| | NVIDIA GPU, 8GB+ VRAM | large-v3-turbo (int8) | pyannote GPU | ~20x realtime | | NVIDIA GPU, 4GB VRAM | medium (int8) | pyannote GPU | ~10x realtime | | CPU only, 16GB RAM | medium (int8_cpu) | pyannote CPU | ~2-4x realtime | | CPU only, 8GB RAM | small (int8_cpu) | pyannote CPU | ~3-5x realtime | | CPU only, minimal | base | pyannote CPU | ~5-8x realtime | Users can always override model selection in settings. The app displays estimated processing time before starting. ### CTranslate2 CPU Backends faster-whisper uses CTranslate2, which supports multiple CPU acceleration backends: - **Intel MKL** — Best performance on Intel CPUs - **oneDNN** — Good cross-platform alternative - **OpenBLAS** — Fallback for any CPU - **Ruy** — Lightweight option for ARM The Python sidecar auto-detects and uses the best available backend. --- ## 4. Component Architecture ### 4.1 Frontend (Svelte + TypeScript) ``` src/ lib/ components/ WaveformPlayer.svelte # wavesurfer.js wrapper, playback controls TranscriptEditor.svelte # TipTap editor with speaker labels SpeakerManager.svelte # Assign names/colors to speakers ExportPanel.svelte # Export format selection and options AIChatPanel.svelte # Chat interface for AI Q&A ProjectList.svelte # Project browser/manager SettingsPanel.svelte # Model selection, AI config, preferences ProgressOverlay.svelte # Transcription progress with cancel stores/ project.ts # Current project state transcript.ts # Segments, words, speakers playback.ts # Audio position, playing state ai.ts # AI provider config and chat history services/ tauri-bridge.ts # Typed wrappers around tauri::invoke audio-sync.ts # Sync playback position ↔ transcript highlight export.ts # Trigger export via backend types/ transcript.ts # Segment, Word, Speaker interfaces project.ts # Project, MediaFile interfaces routes/ +page.svelte # Main workspace +layout.svelte # App shell with sidebar ``` **Key UI interactions:** - Click a word in the transcript → audio seeks to that word's `start_ms` - Audio plays → transcript auto-scrolls and highlights current word/segment - Click speaker label → open rename dialog, changes propagate to all segments - Drag to select text → option to re-assign speaker for selection ### 4.2 Rust Backend The Rust layer is intentionally thin. It handles: 1. **Process Management** — Spawn, monitor, and kill the Python sidecar and llama-server 2. **IPC Relay** — Forward messages between frontend and Python process 3. **File Operations** — Read/write project files, manage media 4. **SQLite** — All database operations via rusqlite 5. **System Info** — Detect GPU, RAM, CPU for hardware recommendations 6. **llama-server Lifecycle** — Start/stop bundled llama-server, manage port allocation ``` src-tauri/ src/ main.rs # Tauri app entry point commands/ project.rs # CRUD for projects transcribe.rs # Start/stop/monitor transcription export.rs # Trigger caption/text export ai.rs # AI provider commands settings.rs # App settings and preferences system.rs # Hardware detection llama_server.rs # llama-server process lifecycle db/ mod.rs # SQLite connection pool schema.rs # Table definitions and migrations queries.rs # Prepared queries sidecar/ mod.rs # Python process lifecycle ipc.rs # JSON-line protocol handler messages.rs # IPC message types (serde) state.rs # App state (db handle, sidecar handle) ``` ### 4.3 Python Sidecar The Python process runs independently and communicates via JSON-line protocol over stdin/stdout. ``` python/ voice_to_notes/ __init__.py main.py # Entry point, IPC message loop ipc/ __init__.py protocol.py # JSON-line read/write, message types handlers.py # Route messages to services services/ transcribe.py # faster-whisper + wav2vec2 pipeline diarize.py # pyannote.audio diarization pipeline.py # Combined transcribe + diarize workflow ai_provider.py # AI provider abstraction export.py # pysubs2 caption export, text export providers/ __init__.py base.py # Abstract AI provider interface litellm_provider.py # LiteLLM (multi-provider gateway) openai_provider.py # Direct OpenAI SDK anthropic_provider.py # Direct Anthropic SDK local_provider.py # Bundled llama-server (OpenAI-compatible API) hardware/ __init__.py detect.py # GPU/CPU detection, VRAM estimation models.py # Model selection logic utils/ audio.py # Audio format conversion (ffmpeg wrapper) progress.py # Progress reporting via IPC tests/ test_transcribe.py test_diarize.py test_pipeline.py test_providers.py test_export.py pyproject.toml # Dependencies and build config ``` --- ## 5. IPC Protocol The Rust backend and Python sidecar communicate via newline-delimited JSON on stdin/stdout. Each message has a type, an optional request ID (for correlating responses), and a payload. ### Message Format ```json {"id": "req-001", "type": "transcribe.start", "payload": {"file": "/path/to/audio.wav", "model": "large-v3-turbo", "device": "cuda", "language": "auto"}} ``` ### Message Types **Requests (Rust → Python):** | Type | Payload | Description | |------|---------|-------------| | `transcribe.start` | `{file, model, device, language}` | Start transcription | | `transcribe.cancel` | `{id}` | Cancel running transcription | | `diarize.start` | `{file, num_speakers?}` | Start speaker diarization | | `pipeline.start` | `{file, model, device, language, num_speakers?}` | Full transcribe + diarize | | `ai.chat` | `{provider, model, messages, transcript_context}` | Send AI chat message | | `ai.summarize` | `{provider, model, transcript, style}` | Generate summary/notes | | `export.captions` | `{segments, format, options}` | Export caption file | | `export.text` | `{segments, speakers, format, options}` | Export text document | | `hardware.detect` | `{}` | Detect available hardware | **Responses (Python → Rust):** | Type | Payload | Description | |------|---------|-------------| | `progress` | `{id, percent, stage, message}` | Progress update | | `transcribe.result` | `{segments: [{text, start_ms, end_ms, words: [...]}]}` | Transcription complete | | `diarize.result` | `{speakers: [{id, segments: [{start_ms, end_ms}]}]}` | Diarization complete | | `pipeline.result` | `{segments, speakers, words}` | Full pipeline result | | `ai.response` | `{content, tokens_used, provider}` | AI response | | `ai.stream` | `{id, delta, done}` | Streaming AI token | | `export.done` | `{path}` | Export file written | | `error` | `{id, code, message}` | Error response | | `hardware.info` | `{gpu, vram_mb, ram_mb, cpu_cores, recommended_model}` | Hardware info | ### Progress Reporting Long-running operations (transcription, diarization) send periodic progress messages: ```json {"id": "req-001", "type": "progress", "payload": {"percent": 45, "stage": "transcribing", "message": "Processing segment 23/51..."}} ``` Stages: `loading_model` → `preprocessing` → `transcribing` → `aligning` → `diarizing` → `merging` → `done` --- ## 6. Database Schema SQLite database stored per-project at `{project_dir}/project.db`. ```sql -- Projects metadata CREATE TABLE projects ( id TEXT PRIMARY KEY, name TEXT NOT NULL, created_at TEXT NOT NULL, updated_at TEXT NOT NULL, settings TEXT, -- JSON: project-specific overrides status TEXT DEFAULT 'active' ); -- Source media files CREATE TABLE media_files ( id TEXT PRIMARY KEY, project_id TEXT NOT NULL REFERENCES projects(id), file_path TEXT NOT NULL, -- relative to project dir file_hash TEXT, -- SHA-256 for integrity duration_ms INTEGER, sample_rate INTEGER, channels INTEGER, format TEXT, file_size INTEGER, created_at TEXT NOT NULL ); -- Speakers identified in audio CREATE TABLE speakers ( id TEXT PRIMARY KEY, project_id TEXT NOT NULL REFERENCES projects(id), label TEXT NOT NULL, -- auto-assigned: "Speaker 1" display_name TEXT, -- user-assigned: "Sarah Chen" color TEXT, -- hex color for UI metadata TEXT -- JSON: voice embedding ref, notes ); -- Transcript segments (one per speaker turn) CREATE TABLE segments ( id TEXT PRIMARY KEY, project_id TEXT NOT NULL REFERENCES projects(id), media_file_id TEXT NOT NULL REFERENCES media_files(id), speaker_id TEXT REFERENCES speakers(id), start_ms INTEGER NOT NULL, end_ms INTEGER NOT NULL, text TEXT NOT NULL, original_text TEXT, -- pre-edit text preserved confidence REAL, is_edited INTEGER DEFAULT 0, edited_at TEXT, segment_index INTEGER NOT NULL ); -- Word-level timestamps (for click-to-seek and captions) CREATE TABLE words ( id TEXT PRIMARY KEY, segment_id TEXT NOT NULL REFERENCES segments(id), word TEXT NOT NULL, start_ms INTEGER NOT NULL, end_ms INTEGER NOT NULL, confidence REAL, word_index INTEGER NOT NULL ); -- AI-generated outputs CREATE TABLE ai_outputs ( id TEXT PRIMARY KEY, project_id TEXT NOT NULL REFERENCES projects(id), output_type TEXT NOT NULL, -- summary, action_items, notes, qa prompt TEXT, content TEXT NOT NULL, provider TEXT, created_at TEXT NOT NULL, metadata TEXT -- JSON: tokens, latency ); -- User annotations and bookmarks CREATE TABLE annotations ( id TEXT PRIMARY KEY, project_id TEXT NOT NULL REFERENCES projects(id), start_ms INTEGER NOT NULL, end_ms INTEGER, text TEXT NOT NULL, type TEXT DEFAULT 'bookmark' ); -- Performance indexes CREATE INDEX idx_segments_project ON segments(project_id, segment_index); CREATE INDEX idx_segments_time ON segments(media_file_id, start_ms); CREATE INDEX idx_words_segment ON words(segment_id, word_index); CREATE INDEX idx_words_time ON words(start_ms, end_ms); CREATE INDEX idx_ai_outputs_project ON ai_outputs(project_id, output_type); ``` --- ## 7. AI Provider Architecture ### Provider Interface ```python class AIProvider(ABC): @abstractmethod async def chat(self, messages: list[dict], config: dict) -> str: ... @abstractmethod async def stream(self, messages: list[dict], config: dict) -> AsyncIterator[str]: ... ``` ### Supported Providers | Provider | Package / Binary | Use Case | |----------|-----------------|----------| | **llama-server** (bundled) | llama.cpp binary | Default local AI — bundled with app, no install needed. OpenAI-compatible API on localhost. | | **LiteLLM** | `litellm` | Gateway to 100+ providers via unified API | | **OpenAI** | `openai` | Direct OpenAI API (GPT-4o, etc.) | | **Anthropic** | `anthropic` | Direct Anthropic API (Claude) | #### Local AI via llama-server (llama.cpp) The app bundles `llama-server` from the llama.cpp project (MIT license). This is the default AI provider — it runs entirely on the user's machine with no internet connection or separate install required. **How it works:** 1. Rust backend spawns `llama-server` as a managed subprocess on app launch (or on first AI use) 2. llama-server exposes an OpenAI-compatible REST API on `localhost:{dynamic_port}` 3. Python sidecar talks to it using the same OpenAI SDK interface as cloud providers 4. On app exit, Rust backend cleanly shuts down the llama-server process **Model management:** - Models stored in `~/.voicetonotes/models/` (GGUF format) - First-run setup downloads a recommended small model (e.g., Phi-3-mini, Llama-3-8B Q4) - Users can download additional models or point to their own GGUF files - Model selection in Settings UI with size/quality tradeoffs shown **Hardware utilization:** - CPU: Works on any machine, uses all available cores - NVIDIA GPU: CUDA acceleration when available - The same CPU/GPU auto-detection used for Whisper applies here ### Context Window Strategy | Transcript Length | Strategy | |-------------------|----------| | < 100K tokens | Send full transcript directly | | 100K - 200K tokens | Use Claude (200K context) or chunk for smaller models | | > 200K tokens | Map-reduce: summarize chunks, then combine | | Q&A mode | Semantic search over chunks, send top-K relevant to model | ### Configuration Users configure AI providers in Settings. API keys for cloud providers stored in OS keychain (libsecret on Linux, Windows Credential Manager). The bundled llama-server requires no keys or internet. ```json { "ai": { "default_provider": "local", "providers": { "local": { "model": "phi-3-mini-Q4_K_M.gguf", "gpu_layers": "auto" }, "openai": { "model": "gpt-4o" }, "anthropic": { "model": "claude-sonnet-4-20250514" }, "litellm": { "model": "gpt-4o" } } } } ``` --- ## 8. Export Formats ### Caption Formats | Format | Speaker Support | Library | |--------|----------------|---------| | **SRT** | `[Speaker]:` prefix convention | pysubs2 | | **WebVTT** | Native `` voice tags | pysubs2 | | **ASS/SSA** | Named styles per speaker with colors | pysubs2 | ### Text Formats | Format | Implementation | |--------|---------------| | **Plain text (.txt)** | Custom formatter | | **Markdown (.md)** | Custom formatter (bold speaker names) | | **DOCX** | python-docx | ### Text Output Example ``` [00:00:03] Sarah Chen: Hello everyone, welcome to the meeting. I wanted to start by discussing the Q3 results before we move on to planning. [00:00:15] Michael Torres: Thanks Sarah. The numbers look strong this quarter. ``` --- ## 9. Project File Structure ``` ~/VoiceToNotes/ config.json # Global app settings projects/ {project-uuid}/ project.db # SQLite database media/ recording.m4a # Original media file exports/ transcript.srt transcript.vtt notes.md ``` --- ## 10. Implementation Phases ### Phase 1 — Foundation Set up Tauri + Svelte project scaffold, Python sidecar with IPC protocol, SQLite schema, and basic project management UI. **Deliverables:** - Tauri app launches with a working Svelte frontend - Python sidecar starts, communicates via JSON-line IPC - SQLite database created per-project - Create/open/list projects in the UI ### Phase 2 — Core Transcription Implement the transcription pipeline with audio playback and synchronized transcript display. **Deliverables:** - Import audio/video files (ffmpeg conversion to WAV) - Run faster-whisper transcription with progress reporting - Display transcript with word-level timestamps - wavesurfer.js audio player with click-to-seek from transcript - Auto-scroll transcript during playback - Edit transcript text (corrections persist to DB) ### Phase 3 — Speaker Diarization Add speaker identification and management. **Deliverables:** - pyannote.audio diarization integrated into pipeline - Speaker segments merged with word timestamps - Speaker labels displayed in transcript with colors - Rename speakers (persists across all segments) - Re-assign speaker for selected text segments - Hardware detection and model auto-selection (CPU/GPU) ### Phase 4 — Export Implement all export formats. **Deliverables:** - SRT, WebVTT, ASS caption export with speaker labels - Plain text and Markdown export with speaker names - Export options panel in UI ### Phase 5 — AI Integration Add AI provider support for Q&A and summarization. **Deliverables:** - Provider configuration UI with API key management - Bundled llama-server for local AI (default, no internet required) - Model download manager for local GGUF models - OpenAI and Anthropic direct SDK support - LiteLLM gateway support - Chat panel for asking questions about the transcript - Summary/notes generation with multiple styles - Context window management for long transcripts ### Phase 6 — Polish and Packaging Production readiness. **Deliverables:** - Linux packaging (.deb, .AppImage) - Windows packaging (.msi, .exe installer) - Bundled Python environment (no user Python install required) - Model download manager (first-run setup) - Settings panel (model selection, hardware config, AI providers) - Error handling, logging, crash recovery --- ## 11. Agent Work Breakdown For parallel development, the codebase splits into these independent workstreams: | Agent | Scope | Dependencies | |-------|-------|-------------| | **Agent 1: Tauri + Frontend Shell** | Tauri project setup, Svelte scaffold, routing, project manager UI, settings UI | None | | **Agent 2: Python Sidecar + IPC** | Python project setup, IPC protocol, message loop, handler routing | None | | **Agent 3: Database Layer** | SQLite schema, Rust query layer, migration system | None | | **Agent 4: Transcription Pipeline** | faster-whisper integration, wav2vec2 alignment, hardware detection, model management | Agent 2 (IPC) | | **Agent 5: Diarization Pipeline** | pyannote.audio integration, speaker-word alignment, combined pipeline | Agent 4 (transcription) | | **Agent 6: Audio Player + Transcript UI** | wavesurfer.js integration, TipTap transcript editor, playback-transcript sync | Agent 1 (shell), Agent 3 (DB) | | **Agent 7: Export System** | pysubs2 caption export, text formatters, export UI | Agent 2 (IPC), Agent 3 (DB) | | **Agent 8: AI Provider System** | Provider abstraction, bundled llama-server, LiteLLM/OpenAI/Anthropic adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) | Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.