Files
voice-to-notes/docs/ARCHITECTURE.md
Josh Knapp 0edb06a913 Add architecture document and project guidelines
Detailed architecture covering Tauri + Svelte frontend, Rust backend,
Python sidecar for ML (faster-whisper, pyannote.audio), IPC protocol,
SQLite schema, AI provider system, export formats, and phased
implementation plan with agent work breakdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 08:37:45 -08:00

24 KiB

Voice to Notes — Architecture Document

1. Overview

Voice to Notes is a desktop application that transcribes audio/video recordings with speaker identification. It runs entirely on the user's computer. Cloud AI providers are optional and only used when explicitly configured by the user.

┌─────────────────────────────────────────────────────────────────┐
│                        Tauri Application                        │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                   Frontend (Svelte + TS)                  │  │
│  │                                                           │  │
│  │  ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
│  │  │  Waveform    │ │  Transcript  │ │  AI Chat          │  │  │
│  │  │  Player      │ │  Editor      │ │  Panel            │  │  │
│  │  │ (wavesurfer) │ │  (TipTap)    │ │                   │  │  │
│  │  └─────────────┘ └──────────────┘ └───────────────────┘  │  │
│  │  ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
│  │  │  Speaker     │ │  Export      │ │  Project          │  │  │
│  │  │  Manager     │ │  Panel       │ │  Manager          │  │  │
│  │  └─────────────┘ └──────────────┘ └───────────────────┘  │  │
│  └──────────────────────────┬────────────────────────────────┘  │
│                             │ tauri::invoke()                   │
│  ┌──────────────────────────┴────────────────────────────────┐  │
│  │                  Rust Backend (thin layer)                 │  │
│  │                                                           │  │
│  │  ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
│  │  │  Process      │ │  File I/O    │ │  SQLite           │  │  │
│  │  │  Manager      │ │  & Media     │ │  (via rusqlite)   │  │  │
│  │  └──────┬───────┘ └──────────────┘ └───────────────────┘  │  │
│  └─────────┼─────────────────────────────────────────────────┘  │
└────────────┼────────────────────────────────────────────────────┘
             │ JSON-line IPC (stdin/stdout)
             │
┌────────────┴────────────────────────────────────────────────────┐
│                     Python Sidecar Process                       │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
│  │  Transcribe   │  │  Diarize     │  │  AI Provider           │ │
│  │  Service      │  │  Service     │  │  Service               │ │
│  │              │  │              │  │                        │ │
│  │ faster-whisper│  │ pyannote     │  │  ┌──────────────────┐ │ │
│  │ + wav2vec2   │  │ .audio 4.0   │  │  │ LiteLLM adapter  │ │ │
│  │              │  │              │  │  │ OpenAI adapter    │ │ │
│  │ CPU: auto    │  │ CPU: auto    │  │  │ Anthropic adapter │ │ │
│  │ GPU: CUDA    │  │ GPU: CUDA    │  │  │ Ollama adapter    │ │ │
│  └──────────────┘  └──────────────┘  │  └──────────────────┘ │ │
│                                       └────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

2. Technology Stack

Layer Technology Purpose
Desktop Shell Tauri v2 Window management, OS integration, native packaging
Frontend Svelte + TypeScript UI components, state management
Audio Waveform wavesurfer.js Waveform visualization, click-to-seek playback
Transcript Editor TipTap (ProseMirror) Rich text editing with speaker-colored labels
Backend Rust (thin) Process management, file I/O, SQLite access, IPC relay
Database SQLite (via rusqlite) Project data, transcripts, word timestamps, speaker info
ML Runtime Python sidecar Speech-to-text, diarization, AI provider integration
STT Engine faster-whisper Transcription with word-level timestamps
Timestamp Refinement wav2vec2 Precise word-level alignment
Speaker Diarization pyannote.audio 4.0 Speaker segment detection
AI Providers LiteLLM / direct SDKs Summarization, Q&A, notes
Caption Export pysubs2 SRT, WebVTT, ASS subtitle generation

3. CPU / GPU Strategy

All ML components must work on CPU. GPU acceleration is used when available but never required.

Detection and Selection

App Launch
    │
    ├─ Detect hardware (Python: torch.cuda.is_available(), etc.)
    │
    ├─ NVIDIA GPU detected (CUDA)
    │   ├─ VRAM >= 8GB → large-v3-turbo (int8), pyannote on GPU
    │   ├─ VRAM >= 4GB → medium model (int8), pyannote on GPU
    │   └─ VRAM < 4GB  → fall back to CPU
    │
    ├─ No GPU / unsupported GPU
    │   ├─ RAM >= 16GB → medium model on CPU, pyannote on CPU
    │   ├─ RAM >= 8GB  → small model on CPU, pyannote on CPU
    │   └─ RAM < 8GB   → base model on CPU, pyannote on CPU (warn: slow)
    │
    └─ User can override in Settings

Model Recommendations by Hardware

Hardware STT Model Diarization Expected Speed
NVIDIA GPU, 8GB+ VRAM large-v3-turbo (int8) pyannote GPU ~20x realtime
NVIDIA GPU, 4GB VRAM medium (int8) pyannote GPU ~10x realtime
CPU only, 16GB RAM medium (int8_cpu) pyannote CPU ~2-4x realtime
CPU only, 8GB RAM small (int8_cpu) pyannote CPU ~3-5x realtime
CPU only, minimal base pyannote CPU ~5-8x realtime

Users can always override model selection in settings. The app displays estimated processing time before starting.

CTranslate2 CPU Backends

faster-whisper uses CTranslate2, which supports multiple CPU acceleration backends:

  • Intel MKL — Best performance on Intel CPUs
  • oneDNN — Good cross-platform alternative
  • OpenBLAS — Fallback for any CPU
  • Ruy — Lightweight option for ARM

The Python sidecar auto-detects and uses the best available backend.


4. Component Architecture

4.1 Frontend (Svelte + TypeScript)

src/
  lib/
    components/
      WaveformPlayer.svelte     # wavesurfer.js wrapper, playback controls
      TranscriptEditor.svelte   # TipTap editor with speaker labels
      SpeakerManager.svelte     # Assign names/colors to speakers
      ExportPanel.svelte        # Export format selection and options
      AIChatPanel.svelte        # Chat interface for AI Q&A
      ProjectList.svelte        # Project browser/manager
      SettingsPanel.svelte      # Model selection, AI config, preferences
      ProgressOverlay.svelte    # Transcription progress with cancel
    stores/
      project.ts                # Current project state
      transcript.ts             # Segments, words, speakers
      playback.ts               # Audio position, playing state
      ai.ts                     # AI provider config and chat history
    services/
      tauri-bridge.ts           # Typed wrappers around tauri::invoke
      audio-sync.ts             # Sync playback position ↔ transcript highlight
      export.ts                 # Trigger export via backend
    types/
      transcript.ts             # Segment, Word, Speaker interfaces
      project.ts                # Project, MediaFile interfaces
  routes/
    +page.svelte                # Main workspace
    +layout.svelte              # App shell with sidebar

Key UI interactions:

  • Click a word in the transcript → audio seeks to that word's start_ms
  • Audio plays → transcript auto-scrolls and highlights current word/segment
  • Click speaker label → open rename dialog, changes propagate to all segments
  • Drag to select text → option to re-assign speaker for selection

4.2 Rust Backend

The Rust layer is intentionally thin. It handles:

  1. Process Management — Spawn, monitor, and kill the Python sidecar
  2. IPC Relay — Forward messages between frontend and Python process
  3. File Operations — Read/write project files, manage media
  4. SQLite — All database operations via rusqlite
  5. System Info — Detect GPU, RAM, CPU for hardware recommendations
src-tauri/
  src/
    main.rs                     # Tauri app entry point
    commands/
      project.rs                # CRUD for projects
      transcribe.rs             # Start/stop/monitor transcription
      export.rs                 # Trigger caption/text export
      ai.rs                     # AI provider commands
      settings.rs               # App settings and preferences
      system.rs                 # Hardware detection
    db/
      mod.rs                    # SQLite connection pool
      schema.rs                 # Table definitions and migrations
      queries.rs                # Prepared queries
    sidecar/
      mod.rs                    # Python process lifecycle
      ipc.rs                    # JSON-line protocol handler
      messages.rs               # IPC message types (serde)
    state.rs                    # App state (db handle, sidecar handle)

4.3 Python Sidecar

The Python process runs independently and communicates via JSON-line protocol over stdin/stdout.

python/
  voice_to_notes/
    __init__.py
    main.py                     # Entry point, IPC message loop
    ipc/
      __init__.py
      protocol.py               # JSON-line read/write, message types
      handlers.py               # Route messages to services
    services/
      transcribe.py             # faster-whisper + wav2vec2 pipeline
      diarize.py                # pyannote.audio diarization
      pipeline.py               # Combined transcribe + diarize workflow
      ai_provider.py            # AI provider abstraction
      export.py                 # pysubs2 caption export, text export
    providers/
      __init__.py
      base.py                   # Abstract AI provider interface
      litellm_provider.py       # LiteLLM (multi-provider gateway)
      openai_provider.py        # Direct OpenAI SDK
      anthropic_provider.py     # Direct Anthropic SDK
      ollama_provider.py        # Local Ollama models
    hardware/
      __init__.py
      detect.py                 # GPU/CPU detection, VRAM estimation
      models.py                 # Model selection logic
    utils/
      audio.py                  # Audio format conversion (ffmpeg wrapper)
      progress.py               # Progress reporting via IPC
  tests/
    test_transcribe.py
    test_diarize.py
    test_pipeline.py
    test_providers.py
    test_export.py
  pyproject.toml                # Dependencies and build config

5. IPC Protocol

The Rust backend and Python sidecar communicate via newline-delimited JSON on stdin/stdout. Each message has a type, an optional request ID (for correlating responses), and a payload.

Message Format

{"id": "req-001", "type": "transcribe.start", "payload": {"file": "/path/to/audio.wav", "model": "large-v3-turbo", "device": "cuda", "language": "auto"}}

Message Types

Requests (Rust → Python):

Type Payload Description
transcribe.start {file, model, device, language} Start transcription
transcribe.cancel {id} Cancel running transcription
diarize.start {file, num_speakers?} Start speaker diarization
pipeline.start {file, model, device, language, num_speakers?} Full transcribe + diarize
ai.chat {provider, model, messages, transcript_context} Send AI chat message
ai.summarize {provider, model, transcript, style} Generate summary/notes
export.captions {segments, format, options} Export caption file
export.text {segments, speakers, format, options} Export text document
hardware.detect {} Detect available hardware

Responses (Python → Rust):

Type Payload Description
progress {id, percent, stage, message} Progress update
transcribe.result {segments: [{text, start_ms, end_ms, words: [...]}]} Transcription complete
diarize.result {speakers: [{id, segments: [{start_ms, end_ms}]}]} Diarization complete
pipeline.result {segments, speakers, words} Full pipeline result
ai.response {content, tokens_used, provider} AI response
ai.stream {id, delta, done} Streaming AI token
export.done {path} Export file written
error {id, code, message} Error response
hardware.info {gpu, vram_mb, ram_mb, cpu_cores, recommended_model} Hardware info

Progress Reporting

Long-running operations (transcription, diarization) send periodic progress messages:

{"id": "req-001", "type": "progress", "payload": {"percent": 45, "stage": "transcribing", "message": "Processing segment 23/51..."}}

Stages: loading_modelpreprocessingtranscribingaligningdiarizingmergingdone


6. Database Schema

SQLite database stored per-project at {project_dir}/project.db.

-- Projects metadata
CREATE TABLE projects (
    id          TEXT PRIMARY KEY,
    name        TEXT NOT NULL,
    created_at  TEXT NOT NULL,
    updated_at  TEXT NOT NULL,
    settings    TEXT,               -- JSON: project-specific overrides
    status      TEXT DEFAULT 'active'
);

-- Source media files
CREATE TABLE media_files (
    id          TEXT PRIMARY KEY,
    project_id  TEXT NOT NULL REFERENCES projects(id),
    file_path   TEXT NOT NULL,      -- relative to project dir
    file_hash   TEXT,               -- SHA-256 for integrity
    duration_ms INTEGER,
    sample_rate INTEGER,
    channels    INTEGER,
    format      TEXT,
    file_size   INTEGER,
    created_at  TEXT NOT NULL
);

-- Speakers identified in audio
CREATE TABLE speakers (
    id           TEXT PRIMARY KEY,
    project_id   TEXT NOT NULL REFERENCES projects(id),
    label        TEXT NOT NULL,     -- auto-assigned: "Speaker 1"
    display_name TEXT,              -- user-assigned: "Sarah Chen"
    color        TEXT,              -- hex color for UI
    metadata     TEXT               -- JSON: voice embedding ref, notes
);

-- Transcript segments (one per speaker turn)
CREATE TABLE segments (
    id            TEXT PRIMARY KEY,
    project_id    TEXT NOT NULL REFERENCES projects(id),
    media_file_id TEXT NOT NULL REFERENCES media_files(id),
    speaker_id    TEXT REFERENCES speakers(id),
    start_ms      INTEGER NOT NULL,
    end_ms        INTEGER NOT NULL,
    text          TEXT NOT NULL,
    original_text TEXT,             -- pre-edit text preserved
    confidence    REAL,
    is_edited     INTEGER DEFAULT 0,
    edited_at     TEXT,
    segment_index INTEGER NOT NULL
);

-- Word-level timestamps (for click-to-seek and captions)
CREATE TABLE words (
    id         TEXT PRIMARY KEY,
    segment_id TEXT NOT NULL REFERENCES segments(id),
    word       TEXT NOT NULL,
    start_ms   INTEGER NOT NULL,
    end_ms     INTEGER NOT NULL,
    confidence REAL,
    word_index INTEGER NOT NULL
);

-- AI-generated outputs
CREATE TABLE ai_outputs (
    id          TEXT PRIMARY KEY,
    project_id  TEXT NOT NULL REFERENCES projects(id),
    output_type TEXT NOT NULL,      -- summary, action_items, notes, qa
    prompt      TEXT,
    content     TEXT NOT NULL,
    provider    TEXT,
    created_at  TEXT NOT NULL,
    metadata    TEXT                -- JSON: tokens, latency
);

-- User annotations and bookmarks
CREATE TABLE annotations (
    id         TEXT PRIMARY KEY,
    project_id TEXT NOT NULL REFERENCES projects(id),
    start_ms   INTEGER NOT NULL,
    end_ms     INTEGER,
    text       TEXT NOT NULL,
    type       TEXT DEFAULT 'bookmark'
);

-- Performance indexes
CREATE INDEX idx_segments_project ON segments(project_id, segment_index);
CREATE INDEX idx_segments_time ON segments(media_file_id, start_ms);
CREATE INDEX idx_words_segment ON words(segment_id, word_index);
CREATE INDEX idx_words_time ON words(start_ms, end_ms);
CREATE INDEX idx_ai_outputs_project ON ai_outputs(project_id, output_type);

7. AI Provider Architecture

Provider Interface

class AIProvider(ABC):
    @abstractmethod
    async def chat(self, messages: list[dict], config: dict) -> str: ...

    @abstractmethod
    async def stream(self, messages: list[dict], config: dict) -> AsyncIterator[str]: ...

Supported Providers

Provider Package Use Case
LiteLLM litellm Gateway to 100+ providers via unified API
OpenAI openai Direct OpenAI API (GPT-4o, etc.)
Anthropic anthropic Direct Anthropic API (Claude)
Ollama HTTP to localhost:11434 Local models (Llama, Mistral, Phi, etc.)

Context Window Strategy

Transcript Length Strategy
< 100K tokens Send full transcript directly
100K - 200K tokens Use Claude (200K context) or chunk for smaller models
> 200K tokens Map-reduce: summarize chunks, then combine
Q&A mode Semantic search over chunks, send top-K relevant to model

Configuration

Users configure AI providers in Settings. API keys stored in OS keychain (libsecret on Linux, Windows Credential Manager). Local models (Ollama) require no keys.

{
  "ai": {
    "default_provider": "ollama",
    "providers": {
      "ollama": { "base_url": "http://localhost:11434", "model": "llama3:8b" },
      "openai": { "model": "gpt-4o" },
      "anthropic": { "model": "claude-sonnet-4-20250514" },
      "litellm": { "model": "gpt-4o" }
    }
  }
}

8. Export Formats

Caption Formats

Format Speaker Support Library
SRT [Speaker]: prefix convention pysubs2
WebVTT Native <v Speaker> voice tags pysubs2
ASS/SSA Named styles per speaker with colors pysubs2

Text Formats

Format Implementation
Plain text (.txt) Custom formatter
Markdown (.md) Custom formatter (bold speaker names)
DOCX python-docx

Text Output Example

[00:00:03] Sarah Chen:
Hello everyone, welcome to the meeting. I wanted to start by
discussing the Q3 results before we move on to planning.

[00:00:15] Michael Torres:
Thanks Sarah. The numbers look strong this quarter.

9. Project File Structure

~/VoiceToNotes/
  config.json                   # Global app settings
  projects/
    {project-uuid}/
      project.db                # SQLite database
      media/
        recording.m4a           # Original media file
      exports/
        transcript.srt
        transcript.vtt
        notes.md

10. Implementation Phases

Phase 1 — Foundation

Set up Tauri + Svelte project scaffold, Python sidecar with IPC protocol, SQLite schema, and basic project management UI.

Deliverables:

  • Tauri app launches with a working Svelte frontend
  • Python sidecar starts, communicates via JSON-line IPC
  • SQLite database created per-project
  • Create/open/list projects in the UI

Phase 2 — Core Transcription

Implement the transcription pipeline with audio playback and synchronized transcript display.

Deliverables:

  • Import audio/video files (ffmpeg conversion to WAV)
  • Run faster-whisper transcription with progress reporting
  • Display transcript with word-level timestamps
  • wavesurfer.js audio player with click-to-seek from transcript
  • Auto-scroll transcript during playback
  • Edit transcript text (corrections persist to DB)

Phase 3 — Speaker Diarization

Add speaker identification and management.

Deliverables:

  • pyannote.audio diarization integrated into pipeline
  • Speaker segments merged with word timestamps
  • Speaker labels displayed in transcript with colors
  • Rename speakers (persists across all segments)
  • Re-assign speaker for selected text segments
  • Hardware detection and model auto-selection (CPU/GPU)

Phase 4 — Export

Implement all export formats.

Deliverables:

  • SRT, WebVTT, ASS caption export with speaker labels
  • Plain text and Markdown export with speaker names
  • Export options panel in UI

Phase 5 — AI Integration

Add AI provider support for Q&A and summarization.

Deliverables:

  • Provider configuration UI with API key management
  • Ollama local model support
  • OpenAI and Anthropic direct SDK support
  • LiteLLM gateway support
  • Chat panel for asking questions about the transcript
  • Summary/notes generation with multiple styles
  • Context window management for long transcripts

Phase 6 — Polish and Packaging

Production readiness.

Deliverables:

  • Linux packaging (.deb, .AppImage)
  • Windows packaging (.msi, .exe installer)
  • Bundled Python environment (no user Python install required)
  • Model download manager (first-run setup)
  • Settings panel (model selection, hardware config, AI providers)
  • Error handling, logging, crash recovery

11. Agent Work Breakdown

For parallel development, the codebase splits into these independent workstreams:

Agent Scope Dependencies
Agent 1: Tauri + Frontend Shell Tauri project setup, Svelte scaffold, routing, project manager UI, settings UI None
Agent 2: Python Sidecar + IPC Python project setup, IPC protocol, message loop, handler routing None
Agent 3: Database Layer SQLite schema, Rust query layer, migration system None
Agent 4: Transcription Pipeline faster-whisper integration, wav2vec2 alignment, hardware detection, model management Agent 2 (IPC)
Agent 5: Diarization Pipeline pyannote.audio integration, speaker-word alignment, combined pipeline Agent 4 (transcription)
Agent 6: Audio Player + Transcript UI wavesurfer.js integration, TipTap transcript editor, playback-transcript sync Agent 1 (shell), Agent 3 (DB)
Agent 7: Export System pysubs2 caption export, text formatters, export UI Agent 2 (IPC), Agent 3 (DB)
Agent 8: AI Provider System Provider abstraction, LiteLLM/OpenAI/Anthropic/Ollama adapters, chat UI Agent 2 (IPC), Agent 1 (shell)

Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.