Files

Josh Knapp 0edb06a913 Add architecture document and project guidelines

Detailed architecture covering Tauri + Svelte frontend, Rust backend,
Python sidecar for ML (faster-whisper, pyannote.audio), IPC protocol,
SQLite schema, AI provider system, export formats, and phased
implementation plan with agent work breakdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-26 08:37:45 -08:00

24 KiB

Raw Blame History

Voice to Notes — Architecture Document

1. Overview

Voice to Notes is a desktop application that transcribes audio/video recordings with speaker identification. It runs entirely on the user's computer. Cloud AI providers are optional and only used when explicitly configured by the user.

┌─────────────────────────────────────────────────────────────────┐
│                        Tauri Application                        │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                   Frontend (Svelte + TS)                  │  │
│  │                                                           │  │
│  │  ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
│  │  │  Waveform    │ │  Transcript  │ │  AI Chat          │  │  │
│  │  │  Player      │ │  Editor      │ │  Panel            │  │  │
│  │  │ (wavesurfer) │ │  (TipTap)    │ │                   │  │  │
│  │  └─────────────┘ └──────────────┘ └───────────────────┘  │  │
│  │  ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
│  │  │  Speaker     │ │  Export      │ │  Project          │  │  │
│  │  │  Manager     │ │  Panel       │ │  Manager          │  │  │
│  │  └─────────────┘ └──────────────┘ └───────────────────┘  │  │
│  └──────────────────────────┬────────────────────────────────┘  │
│                             │ tauri::invoke()                   │
│  ┌──────────────────────────┴────────────────────────────────┐  │
│  │                  Rust Backend (thin layer)                 │  │
│  │                                                           │  │
│  │  ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
│  │  │  Process      │ │  File I/O    │ │  SQLite           │  │  │
│  │  │  Manager      │ │  & Media     │ │  (via rusqlite)   │  │  │
│  │  └──────┬───────┘ └──────────────┘ └───────────────────┘  │  │
│  └─────────┼─────────────────────────────────────────────────┘  │
└────────────┼────────────────────────────────────────────────────┘
             │ JSON-line IPC (stdin/stdout)
             │
┌────────────┴────────────────────────────────────────────────────┐
│                     Python Sidecar Process                       │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
│  │  Transcribe   │  │  Diarize     │  │  AI Provider           │ │
│  │  Service      │  │  Service     │  │  Service               │ │
│  │              │  │              │  │                        │ │
│  │ faster-whisper│  │ pyannote     │  │  ┌──────────────────┐ │ │
│  │ + wav2vec2   │  │ .audio 4.0   │  │  │ LiteLLM adapter  │ │ │
│  │              │  │              │  │  │ OpenAI adapter    │ │ │
│  │ CPU: auto    │  │ CPU: auto    │  │  │ Anthropic adapter │ │ │
│  │ GPU: CUDA    │  │ GPU: CUDA    │  │  │ Ollama adapter    │ │ │
│  └──────────────┘  └──────────────┘  │  └──────────────────┘ │ │
│                                       └────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

2. Technology Stack

Layer	Technology	Purpose
Desktop Shell	Tauri v2	Window management, OS integration, native packaging
Frontend	Svelte + TypeScript	UI components, state management
Audio Waveform	wavesurfer.js	Waveform visualization, click-to-seek playback
Transcript Editor	TipTap (ProseMirror)	Rich text editing with speaker-colored labels
Backend	Rust (thin)	Process management, file I/O, SQLite access, IPC relay
Database	SQLite (via rusqlite)	Project data, transcripts, word timestamps, speaker info
ML Runtime	Python sidecar	Speech-to-text, diarization, AI provider integration
STT Engine	faster-whisper	Transcription with word-level timestamps
Timestamp Refinement	wav2vec2	Precise word-level alignment
Speaker Diarization	pyannote.audio 4.0	Speaker segment detection
AI Providers	LiteLLM / direct SDKs	Summarization, Q&A, notes
Caption Export	pysubs2	SRT, WebVTT, ASS subtitle generation

3. CPU / GPU Strategy

All ML components must work on CPU. GPU acceleration is used when available but never required.

Detection and Selection

App Launch
    │
    ├─ Detect hardware (Python: torch.cuda.is_available(), etc.)
    │
    ├─ NVIDIA GPU detected (CUDA)
    │   ├─ VRAM >= 8GB → large-v3-turbo (int8), pyannote on GPU
    │   ├─ VRAM >= 4GB → medium model (int8), pyannote on GPU
    │   └─ VRAM < 4GB  → fall back to CPU
    │
    ├─ No GPU / unsupported GPU
    │   ├─ RAM >= 16GB → medium model on CPU, pyannote on CPU
    │   ├─ RAM >= 8GB  → small model on CPU, pyannote on CPU
    │   └─ RAM < 8GB   → base model on CPU, pyannote on CPU (warn: slow)
    │
    └─ User can override in Settings

Model Recommendations by Hardware

Hardware	STT Model	Diarization	Expected Speed
NVIDIA GPU, 8GB+ VRAM	large-v3-turbo (int8)	pyannote GPU	~20x realtime
NVIDIA GPU, 4GB VRAM	medium (int8)	pyannote GPU	~10x realtime
CPU only, 16GB RAM	medium (int8_cpu)	pyannote CPU	~2-4x realtime
CPU only, 8GB RAM	small (int8_cpu)	pyannote CPU	~3-5x realtime
CPU only, minimal	base	pyannote CPU	~5-8x realtime

Users can always override model selection in settings. The app displays estimated processing time before starting.

CTranslate2 CPU Backends

faster-whisper uses CTranslate2, which supports multiple CPU acceleration backends:

Intel MKL — Best performance on Intel CPUs
oneDNN — Good cross-platform alternative
OpenBLAS — Fallback for any CPU
Ruy — Lightweight option for ARM

The Python sidecar auto-detects and uses the best available backend.

4. Component Architecture

4.1 Frontend (Svelte + TypeScript)

src/
  lib/
    components/
      WaveformPlayer.svelte     # wavesurfer.js wrapper, playback controls
      TranscriptEditor.svelte   # TipTap editor with speaker labels
      SpeakerManager.svelte     # Assign names/colors to speakers
      ExportPanel.svelte        # Export format selection and options
      AIChatPanel.svelte        # Chat interface for AI Q&A
      ProjectList.svelte        # Project browser/manager
      SettingsPanel.svelte      # Model selection, AI config, preferences
      ProgressOverlay.svelte    # Transcription progress with cancel
    stores/
      project.ts                # Current project state
      transcript.ts             # Segments, words, speakers
      playback.ts               # Audio position, playing state
      ai.ts                     # AI provider config and chat history
    services/
      tauri-bridge.ts           # Typed wrappers around tauri::invoke
      audio-sync.ts             # Sync playback position ↔ transcript highlight
      export.ts                 # Trigger export via backend
    types/
      transcript.ts             # Segment, Word, Speaker interfaces
      project.ts                # Project, MediaFile interfaces
  routes/
    +page.svelte                # Main workspace
    +layout.svelte              # App shell with sidebar

Key UI interactions:

Click a word in the transcript → audio seeks to that word's start_ms
Audio plays → transcript auto-scrolls and highlights current word/segment
Click speaker label → open rename dialog, changes propagate to all segments
Drag to select text → option to re-assign speaker for selection

4.2 Rust Backend

The Rust layer is intentionally thin. It handles:

Process Management — Spawn, monitor, and kill the Python sidecar
IPC Relay — Forward messages between frontend and Python process
File Operations — Read/write project files, manage media
SQLite — All database operations via rusqlite
System Info — Detect GPU, RAM, CPU for hardware recommendations

src-tauri/
  src/
    main.rs                     # Tauri app entry point
    commands/
      project.rs                # CRUD for projects
      transcribe.rs             # Start/stop/monitor transcription
      export.rs                 # Trigger caption/text export
      ai.rs                     # AI provider commands
      settings.rs               # App settings and preferences
      system.rs                 # Hardware detection
    db/
      mod.rs                    # SQLite connection pool
      schema.rs                 # Table definitions and migrations
      queries.rs                # Prepared queries
    sidecar/
      mod.rs                    # Python process lifecycle
      ipc.rs                    # JSON-line protocol handler
      messages.rs               # IPC message types (serde)
    state.rs                    # App state (db handle, sidecar handle)

4.3 Python Sidecar

The Python process runs independently and communicates via JSON-line protocol over stdin/stdout.

python/
  voice_to_notes/
    __init__.py
    main.py                     # Entry point, IPC message loop
    ipc/
      __init__.py
      protocol.py               # JSON-line read/write, message types
      handlers.py               # Route messages to services
    services/
      transcribe.py             # faster-whisper + wav2vec2 pipeline
      diarize.py                # pyannote.audio diarization
      pipeline.py               # Combined transcribe + diarize workflow
      ai_provider.py            # AI provider abstraction
      export.py                 # pysubs2 caption export, text export
    providers/
      __init__.py
      base.py                   # Abstract AI provider interface
      litellm_provider.py       # LiteLLM (multi-provider gateway)
      openai_provider.py        # Direct OpenAI SDK
      anthropic_provider.py     # Direct Anthropic SDK
      ollama_provider.py        # Local Ollama models
    hardware/
      __init__.py
      detect.py                 # GPU/CPU detection, VRAM estimation
      models.py                 # Model selection logic
    utils/
      audio.py                  # Audio format conversion (ffmpeg wrapper)
      progress.py               # Progress reporting via IPC
  tests/
    test_transcribe.py
    test_diarize.py
    test_pipeline.py
    test_providers.py
    test_export.py
  pyproject.toml                # Dependencies and build config

5. IPC Protocol

The Rust backend and Python sidecar communicate via newline-delimited JSON on stdin/stdout. Each message has a type, an optional request ID (for correlating responses), and a payload.

Message Format

{"id": "req-001", "type": "transcribe.start", "payload": {"file": "/path/to/audio.wav", "model": "large-v3-turbo", "device": "cuda", "language": "auto"}}

Message Types

Requests (Rust → Python):

Type	Payload	Description
`transcribe.start`	`{file, model, device, language}`	Start transcription
`transcribe.cancel`	`{id}`	Cancel running transcription
`diarize.start`	`{file, num_speakers?}`	Start speaker diarization
`pipeline.start`	`{file, model, device, language, num_speakers?}`	Full transcribe + diarize
`ai.chat`	`{provider, model, messages, transcript_context}`	Send AI chat message
`ai.summarize`	`{provider, model, transcript, style}`	Generate summary/notes
`export.captions`	`{segments, format, options}`	Export caption file
`export.text`	`{segments, speakers, format, options}`	Export text document
`hardware.detect`	`{}`	Detect available hardware

Responses (Python → Rust):

Type	Payload	Description
`progress`	`{id, percent, stage, message}`	Progress update
`transcribe.result`	`{segments: [{text, start_ms, end_ms, words: [...]}]}`	Transcription complete
`diarize.result`	`{speakers: [{id, segments: [{start_ms, end_ms}]}]}`	Diarization complete
`pipeline.result`	`{segments, speakers, words}`	Full pipeline result
`ai.response`	`{content, tokens_used, provider}`	AI response
`ai.stream`	`{id, delta, done}`	Streaming AI token
`export.done`	`{path}`	Export file written
`error`	`{id, code, message}`	Error response
`hardware.info`	`{gpu, vram_mb, ram_mb, cpu_cores, recommended_model}`	Hardware info

Progress Reporting

Long-running operations (transcription, diarization) send periodic progress messages:

{"id": "req-001", "type": "progress", "payload": {"percent": 45, "stage": "transcribing", "message": "Processing segment 23/51..."}}

Stages: loading_model → preprocessing → transcribing → aligning → diarizing → merging → done

6. Database Schema

SQLite database stored per-project at {project_dir}/project.db.

-- Projects metadata
CREATE TABLE projects (
    id          TEXT PRIMARY KEY,
    name        TEXT NOT NULL,
    created_at  TEXT NOT NULL,
    updated_at  TEXT NOT NULL,
    settings    TEXT,               -- JSON: project-specific overrides
    status      TEXT DEFAULT 'active'
);

-- Source media files
CREATE TABLE media_files (
    id          TEXT PRIMARY KEY,
    project_id  TEXT NOT NULL REFERENCES projects(id),
    file_path   TEXT NOT NULL,      -- relative to project dir
    file_hash   TEXT,               -- SHA-256 for integrity
    duration_ms INTEGER,
    sample_rate INTEGER,
    channels    INTEGER,
    format      TEXT,
    file_size   INTEGER,
    created_at  TEXT NOT NULL
);

-- Speakers identified in audio
CREATE TABLE speakers (
    id           TEXT PRIMARY KEY,
    project_id   TEXT NOT NULL REFERENCES projects(id),
    label        TEXT NOT NULL,     -- auto-assigned: "Speaker 1"
    display_name TEXT,              -- user-assigned: "Sarah Chen"
    color        TEXT,              -- hex color for UI
    metadata     TEXT               -- JSON: voice embedding ref, notes
);

-- Transcript segments (one per speaker turn)
CREATE TABLE segments (
    id            TEXT PRIMARY KEY,
    project_id    TEXT NOT NULL REFERENCES projects(id),
    media_file_id TEXT NOT NULL REFERENCES media_files(id),
    speaker_id    TEXT REFERENCES speakers(id),
    start_ms      INTEGER NOT NULL,
    end_ms        INTEGER NOT NULL,
    text          TEXT NOT NULL,
    original_text TEXT,             -- pre-edit text preserved
    confidence    REAL,
    is_edited     INTEGER DEFAULT 0,
    edited_at     TEXT,
    segment_index INTEGER NOT NULL
);

-- Word-level timestamps (for click-to-seek and captions)
CREATE TABLE words (
    id         TEXT PRIMARY KEY,
    segment_id TEXT NOT NULL REFERENCES segments(id),
    word       TEXT NOT NULL,
    start_ms   INTEGER NOT NULL,
    end_ms     INTEGER NOT NULL,
    confidence REAL,
    word_index INTEGER NOT NULL
);

-- AI-generated outputs
CREATE TABLE ai_outputs (
    id          TEXT PRIMARY KEY,
    project_id  TEXT NOT NULL REFERENCES projects(id),
    output_type TEXT NOT NULL,      -- summary, action_items, notes, qa
    prompt      TEXT,
    content     TEXT NOT NULL,
    provider    TEXT,
    created_at  TEXT NOT NULL,
    metadata    TEXT                -- JSON: tokens, latency
);

-- User annotations and bookmarks
CREATE TABLE annotations (
    id         TEXT PRIMARY KEY,
    project_id TEXT NOT NULL REFERENCES projects(id),
    start_ms   INTEGER NOT NULL,
    end_ms     INTEGER,
    text       TEXT NOT NULL,
    type       TEXT DEFAULT 'bookmark'
);

-- Performance indexes
CREATE INDEX idx_segments_project ON segments(project_id, segment_index);
CREATE INDEX idx_segments_time ON segments(media_file_id, start_ms);
CREATE INDEX idx_words_segment ON words(segment_id, word_index);
CREATE INDEX idx_words_time ON words(start_ms, end_ms);
CREATE INDEX idx_ai_outputs_project ON ai_outputs(project_id, output_type);

7. AI Provider Architecture

Provider Interface

class AIProvider(ABC):
    @abstractmethod
    async def chat(self, messages: list[dict], config: dict) -> str: ...

    @abstractmethod
    async def stream(self, messages: list[dict], config: dict) -> AsyncIterator[str]: ...

Supported Providers

Provider	Package	Use Case
LiteLLM	`litellm`	Gateway to 100+ providers via unified API
OpenAI	`openai`	Direct OpenAI API (GPT-4o, etc.)
Anthropic	`anthropic`	Direct Anthropic API (Claude)
Ollama	HTTP to localhost:11434	Local models (Llama, Mistral, Phi, etc.)

Context Window Strategy

Transcript Length	Strategy
< 100K tokens	Send full transcript directly
100K - 200K tokens	Use Claude (200K context) or chunk for smaller models
> 200K tokens	Map-reduce: summarize chunks, then combine
Q&A mode	Semantic search over chunks, send top-K relevant to model

Configuration

Users configure AI providers in Settings. API keys stored in OS keychain (libsecret on Linux, Windows Credential Manager). Local models (Ollama) require no keys.

{
  "ai": {
    "default_provider": "ollama",
    "providers": {
      "ollama": { "base_url": "http://localhost:11434", "model": "llama3:8b" },
      "openai": { "model": "gpt-4o" },
      "anthropic": { "model": "claude-sonnet-4-20250514" },
      "litellm": { "model": "gpt-4o" }
    }
  }
}

8. Export Formats

Caption Formats

Format	Speaker Support	Library
SRT	`[Speaker]:` prefix convention	pysubs2
WebVTT	Native `<v Speaker>` voice tags	pysubs2
ASS/SSA	Named styles per speaker with colors	pysubs2

Text Formats

Format	Implementation
Plain text (.txt)	Custom formatter
Markdown (.md)	Custom formatter (bold speaker names)
DOCX	python-docx

Text Output Example

[00:00:03] Sarah Chen:
Hello everyone, welcome to the meeting. I wanted to start by
discussing the Q3 results before we move on to planning.

[00:00:15] Michael Torres:
Thanks Sarah. The numbers look strong this quarter.

9. Project File Structure

~/VoiceToNotes/
  config.json                   # Global app settings
  projects/
    {project-uuid}/
      project.db                # SQLite database
      media/
        recording.m4a           # Original media file
      exports/
        transcript.srt
        transcript.vtt
        notes.md

10. Implementation Phases

Phase 1 — Foundation

Set up Tauri + Svelte project scaffold, Python sidecar with IPC protocol, SQLite schema, and basic project management UI.

Deliverables:

Tauri app launches with a working Svelte frontend
Python sidecar starts, communicates via JSON-line IPC
SQLite database created per-project
Create/open/list projects in the UI

Phase 2 — Core Transcription

Implement the transcription pipeline with audio playback and synchronized transcript display.

Deliverables:

Import audio/video files (ffmpeg conversion to WAV)
Run faster-whisper transcription with progress reporting
Display transcript with word-level timestamps
wavesurfer.js audio player with click-to-seek from transcript
Auto-scroll transcript during playback
Edit transcript text (corrections persist to DB)

Phase 3 — Speaker Diarization

Add speaker identification and management.

Deliverables:

pyannote.audio diarization integrated into pipeline
Speaker segments merged with word timestamps
Speaker labels displayed in transcript with colors
Rename speakers (persists across all segments)
Re-assign speaker for selected text segments
Hardware detection and model auto-selection (CPU/GPU)

Phase 4 — Export

Implement all export formats.

Deliverables:

SRT, WebVTT, ASS caption export with speaker labels
Plain text and Markdown export with speaker names
Export options panel in UI

Phase 5 — AI Integration

Add AI provider support for Q&A and summarization.

Deliverables:

Provider configuration UI with API key management
Ollama local model support
OpenAI and Anthropic direct SDK support
LiteLLM gateway support
Chat panel for asking questions about the transcript
Summary/notes generation with multiple styles
Context window management for long transcripts

Phase 6 — Polish and Packaging

Production readiness.

Deliverables:

Linux packaging (.deb, .AppImage)
Windows packaging (.msi, .exe installer)
Bundled Python environment (no user Python install required)
Model download manager (first-run setup)
Settings panel (model selection, hardware config, AI providers)
Error handling, logging, crash recovery

11. Agent Work Breakdown

For parallel development, the codebase splits into these independent workstreams:

Agent	Scope	Dependencies
Agent 1: Tauri + Frontend Shell	Tauri project setup, Svelte scaffold, routing, project manager UI, settings UI	None
Agent 2: Python Sidecar + IPC	Python project setup, IPC protocol, message loop, handler routing	None
Agent 3: Database Layer	SQLite schema, Rust query layer, migration system	None
Agent 4: Transcription Pipeline	faster-whisper integration, wav2vec2 alignment, hardware detection, model management	Agent 2 (IPC)
Agent 5: Diarization Pipeline	pyannote.audio integration, speaker-word alignment, combined pipeline	Agent 4 (transcription)
Agent 6: Audio Player + Transcript UI	wavesurfer.js integration, TipTap transcript editor, playback-transcript sync	Agent 1 (shell), Agent 3 (DB)
Agent 7: Export System	pysubs2 caption export, text formatters, export UI	Agent 2 (IPC), Agent 3 (DB)
Agent 8: AI Provider System	Provider abstraction, LiteLLM/OpenAI/Anthropic/Ollama adapters, chat UI	Agent 2 (IPC), Agent 1 (shell)

Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.

24 KiB Raw Blame History

Voice to Notes — Architecture Document

1. Overview

2. Technology Stack

3. CPU / GPU Strategy

Detection and Selection

Model Recommendations by Hardware

CTranslate2 CPU Backends

4. Component Architecture

4.1 Frontend (Svelte + TypeScript)

4.2 Rust Backend

4.3 Python Sidecar

5. IPC Protocol

Message Format

Message Types

Progress Reporting

6. Database Schema

7. AI Provider Architecture

Provider Interface

Supported Providers

Context Window Strategy

Configuration

8. Export Formats

Caption Formats

Text Formats

Text Output Example

9. Project File Structure

10. Implementation Phases

Phase 1 — Foundation

Phase 2 — Core Transcription

Phase 3 — Speaker Diarization

Phase 4 — Export

Phase 5 — AI Integration

Phase 6 — Polish and Packaging

11. Agent Work Breakdown

24 KiB

Raw Blame History