Add architecture document and project guidelines

Detailed architecture covering Tauri + Svelte frontend, Rust backend, Python sidecar for ML (faster-whisper, pyannote.audio), IPC protocol, SQLite schema, AI provider system, export formats, and phased implementation plan with agent work breakdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 08:37:45 -08:00
parent d2bdbe3315
commit 0edb06a913
2 changed files with 611 additions and 0 deletions
@@ -0,0 +1,568 @@
+# Voice to Notes — Architecture Document
+
+## 1. Overview
+
+Voice to Notes is a desktop application that transcribes audio/video recordings with speaker identification. It runs entirely on the user's computer. Cloud AI providers are optional and only used when explicitly configured by the user.
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        Tauri Application                        │
+│                                                                 │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │                   Frontend (Svelte + TS)                  │  │
+│  │                                                           │  │
+│  │  ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
+│  │  │  Waveform    │ │  Transcript  │ │  AI Chat          │  │  │
+│  │  │  Player      │ │  Editor      │ │  Panel            │  │  │
+│  │  │ (wavesurfer) │ │  (TipTap)    │ │                   │  │  │
+│  │  └─────────────┘ └──────────────┘ └───────────────────┘  │  │
+│  │  ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
+│  │  │  Speaker     │ │  Export      │ │  Project          │  │  │
+│  │  │  Manager     │ │  Panel       │ │  Manager          │  │  │
+│  │  └─────────────┘ └──────────────┘ └───────────────────┘  │  │
+│  └──────────────────────────┬────────────────────────────────┘  │
+│                             │ tauri::invoke()                   │
+│  ┌──────────────────────────┴────────────────────────────────┐  │
+│  │                  Rust Backend (thin layer)                 │  │
+│  │                                                           │  │
+│  │  ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐  │  │
+│  │  │  Process      │ │  File I/O    │ │  SQLite           │  │  │
+│  │  │  Manager      │ │  & Media     │ │  (via rusqlite)   │  │  │
+│  │  └──────┬───────┘ └──────────────┘ └───────────────────┘  │  │
+│  └─────────┼─────────────────────────────────────────────────┘  │
+└────────────┼────────────────────────────────────────────────────┘
+             │ JSON-line IPC (stdin/stdout)
+             │
+┌────────────┴────────────────────────────────────────────────────┐
+│                     Python Sidecar Process                       │
+│                                                                  │
+│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
+│  │  Transcribe   │  │  Diarize     │  │  AI Provider           │ │
+│  │  Service      │  │  Service     │  │  Service               │ │
+│  │              │  │              │  │                        │ │
+│  │ faster-whisper│  │ pyannote     │  │  ┌──────────────────┐ │ │
+│  │ + wav2vec2   │  │ .audio 4.0   │  │  │ LiteLLM adapter  │ │ │
+│  │              │  │              │  │  │ OpenAI adapter    │ │ │
+│  │ CPU: auto    │  │ CPU: auto    │  │  │ Anthropic adapter │ │ │
+│  │ GPU: CUDA    │  │ GPU: CUDA    │  │  │ Ollama adapter    │ │ │
+│  └──────────────┘  └──────────────┘  │  └──────────────────┘ │ │
+│                                       └────────────────────────┘ │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 2. Technology Stack
+
+| Layer | Technology | Purpose |
+|-------|-----------|---------|
+| **Desktop Shell** | Tauri v2 | Window management, OS integration, native packaging |
+| **Frontend** | Svelte + TypeScript | UI components, state management |
+| **Audio Waveform** | wavesurfer.js | Waveform visualization, click-to-seek playback |
+| **Transcript Editor** | TipTap (ProseMirror) | Rich text editing with speaker-colored labels |
+| **Backend** | Rust (thin) | Process management, file I/O, SQLite access, IPC relay |
+| **Database** | SQLite (via rusqlite) | Project data, transcripts, word timestamps, speaker info |
+| **ML Runtime** | Python sidecar | Speech-to-text, diarization, AI provider integration |
+| **STT Engine** | faster-whisper | Transcription with word-level timestamps |
+| **Timestamp Refinement** | wav2vec2 | Precise word-level alignment |
+| **Speaker Diarization** | pyannote.audio 4.0 | Speaker segment detection |
+| **AI Providers** | LiteLLM / direct SDKs | Summarization, Q&A, notes |
+| **Caption Export** | pysubs2 | SRT, WebVTT, ASS subtitle generation |
+
+---
+
+## 3. CPU / GPU Strategy
+
+All ML components must work on CPU. GPU acceleration is used when available but never required.
+
+### Detection and Selection
+
+```
+App Launch
+    │
+    ├─ Detect hardware (Python: torch.cuda.is_available(), etc.)
+    │
+    ├─ NVIDIA GPU detected (CUDA)
+    │   ├─ VRAM >= 8GB → large-v3-turbo (int8), pyannote on GPU
+    │   ├─ VRAM >= 4GB → medium model (int8), pyannote on GPU
+    │   └─ VRAM < 4GB  → fall back to CPU
+    │
+    ├─ No GPU / unsupported GPU
+    │   ├─ RAM >= 16GB → medium model on CPU, pyannote on CPU
+    │   ├─ RAM >= 8GB  → small model on CPU, pyannote on CPU
+    │   └─ RAM < 8GB   → base model on CPU, pyannote on CPU (warn: slow)
+    │
+    └─ User can override in Settings
+```
+
+### Model Recommendations by Hardware
+
+| Hardware | STT Model | Diarization | Expected Speed |
+|----------|-----------|-------------|----------------|
+| NVIDIA GPU, 8GB+ VRAM | large-v3-turbo (int8) | pyannote GPU | ~20x realtime |
+| NVIDIA GPU, 4GB VRAM | medium (int8) | pyannote GPU | ~10x realtime |
+| CPU only, 16GB RAM | medium (int8_cpu) | pyannote CPU | ~2-4x realtime |
+| CPU only, 8GB RAM | small (int8_cpu) | pyannote CPU | ~3-5x realtime |
+| CPU only, minimal | base | pyannote CPU | ~5-8x realtime |
+
+Users can always override model selection in settings. The app displays estimated processing time before starting.
+
+### CTranslate2 CPU Backends
+
+faster-whisper uses CTranslate2, which supports multiple CPU acceleration backends:
+- **Intel MKL** — Best performance on Intel CPUs
+- **oneDNN** — Good cross-platform alternative
+- **OpenBLAS** — Fallback for any CPU
+- **Ruy** — Lightweight option for ARM
+
+The Python sidecar auto-detects and uses the best available backend.
+
+---
+
+## 4. Component Architecture
+
+### 4.1 Frontend (Svelte + TypeScript)
+
+```
+src/
+  lib/
+    components/
+      WaveformPlayer.svelte     # wavesurfer.js wrapper, playback controls
+      TranscriptEditor.svelte   # TipTap editor with speaker labels
+      SpeakerManager.svelte     # Assign names/colors to speakers
+      ExportPanel.svelte        # Export format selection and options
+      AIChatPanel.svelte        # Chat interface for AI Q&A
+      ProjectList.svelte        # Project browser/manager
+      SettingsPanel.svelte      # Model selection, AI config, preferences
+      ProgressOverlay.svelte    # Transcription progress with cancel
+    stores/
+      project.ts                # Current project state
+      transcript.ts             # Segments, words, speakers
+      playback.ts               # Audio position, playing state
+      ai.ts                     # AI provider config and chat history
+    services/
+      tauri-bridge.ts           # Typed wrappers around tauri::invoke
+      audio-sync.ts             # Sync playback position ↔ transcript highlight
+      export.ts                 # Trigger export via backend
+    types/
+      transcript.ts             # Segment, Word, Speaker interfaces
+      project.ts                # Project, MediaFile interfaces
+  routes/
+    +page.svelte                # Main workspace
+    +layout.svelte              # App shell with sidebar
+```
+
+**Key UI interactions:**
+- Click a word in the transcript → audio seeks to that word's `start_ms`
+- Audio plays → transcript auto-scrolls and highlights current word/segment
+- Click speaker label → open rename dialog, changes propagate to all segments
+- Drag to select text → option to re-assign speaker for selection
+
+### 4.2 Rust Backend
+
+The Rust layer is intentionally thin. It handles:
+
+1. **Process Management** — Spawn, monitor, and kill the Python sidecar
+2. **IPC Relay** — Forward messages between frontend and Python process
+3. **File Operations** — Read/write project files, manage media
+4. **SQLite** — All database operations via rusqlite
+5. **System Info** — Detect GPU, RAM, CPU for hardware recommendations
+
+```
+src-tauri/
+  src/
+    main.rs                     # Tauri app entry point
+    commands/
+      project.rs                # CRUD for projects
+      transcribe.rs             # Start/stop/monitor transcription
+      export.rs                 # Trigger caption/text export
+      ai.rs                     # AI provider commands
+      settings.rs               # App settings and preferences
+      system.rs                 # Hardware detection
+    db/
+      mod.rs                    # SQLite connection pool
+      schema.rs                 # Table definitions and migrations
+      queries.rs                # Prepared queries
+    sidecar/
+      mod.rs                    # Python process lifecycle
+      ipc.rs                    # JSON-line protocol handler
+      messages.rs               # IPC message types (serde)
+    state.rs                    # App state (db handle, sidecar handle)
+```
+
+### 4.3 Python Sidecar
+
+The Python process runs independently and communicates via JSON-line protocol over stdin/stdout.
+
+```
+python/
+  voice_to_notes/
+    __init__.py
+    main.py                     # Entry point, IPC message loop
+    ipc/
+      __init__.py
+      protocol.py               # JSON-line read/write, message types
+      handlers.py               # Route messages to services
+    services/
+      transcribe.py             # faster-whisper + wav2vec2 pipeline
+      diarize.py                # pyannote.audio diarization
+      pipeline.py               # Combined transcribe + diarize workflow
+      ai_provider.py            # AI provider abstraction
+      export.py                 # pysubs2 caption export, text export
+    providers/
+      __init__.py
+      base.py                   # Abstract AI provider interface
+      litellm_provider.py       # LiteLLM (multi-provider gateway)
+      openai_provider.py        # Direct OpenAI SDK
+      anthropic_provider.py     # Direct Anthropic SDK
+      ollama_provider.py        # Local Ollama models
+    hardware/
+      __init__.py
+      detect.py                 # GPU/CPU detection, VRAM estimation
+      models.py                 # Model selection logic
+    utils/
+      audio.py                  # Audio format conversion (ffmpeg wrapper)
+      progress.py               # Progress reporting via IPC
+  tests/
+    test_transcribe.py
+    test_diarize.py
+    test_pipeline.py
+    test_providers.py
+    test_export.py
+  pyproject.toml                # Dependencies and build config
+```
+
+---
+
+## 5. IPC Protocol
+
+The Rust backend and Python sidecar communicate via newline-delimited JSON on stdin/stdout. Each message has a type, an optional request ID (for correlating responses), and a payload.
+
+### Message Format
+
+```json
+{"id": "req-001", "type": "transcribe.start", "payload": {"file": "/path/to/audio.wav", "model": "large-v3-turbo", "device": "cuda", "language": "auto"}}
+```
+
+### Message Types
+
+**Requests (Rust → Python):**
+
+| Type | Payload | Description |
+|------|---------|-------------|
+| `transcribe.start` | `{file, model, device, language}` | Start transcription |
+| `transcribe.cancel` | `{id}` | Cancel running transcription |
+| `diarize.start` | `{file, num_speakers?}` | Start speaker diarization |
+| `pipeline.start` | `{file, model, device, language, num_speakers?}` | Full transcribe + diarize |
+| `ai.chat` | `{provider, model, messages, transcript_context}` | Send AI chat message |
+| `ai.summarize` | `{provider, model, transcript, style}` | Generate summary/notes |
+| `export.captions` | `{segments, format, options}` | Export caption file |
+| `export.text` | `{segments, speakers, format, options}` | Export text document |
+| `hardware.detect` | `{}` | Detect available hardware |
+
+**Responses (Python → Rust):**
+
+| Type | Payload | Description |
+|------|---------|-------------|
+| `progress` | `{id, percent, stage, message}` | Progress update |
+| `transcribe.result` | `{segments: [{text, start_ms, end_ms, words: [...]}]}` | Transcription complete |
+| `diarize.result` | `{speakers: [{id, segments: [{start_ms, end_ms}]}]}` | Diarization complete |
+| `pipeline.result` | `{segments, speakers, words}` | Full pipeline result |
+| `ai.response` | `{content, tokens_used, provider}` | AI response |
+| `ai.stream` | `{id, delta, done}` | Streaming AI token |
+| `export.done` | `{path}` | Export file written |
+| `error` | `{id, code, message}` | Error response |
+| `hardware.info` | `{gpu, vram_mb, ram_mb, cpu_cores, recommended_model}` | Hardware info |
+
+### Progress Reporting
+
+Long-running operations (transcription, diarization) send periodic progress messages:
+
+```json
+{"id": "req-001", "type": "progress", "payload": {"percent": 45, "stage": "transcribing", "message": "Processing segment 23/51..."}}
+```
+
+Stages: `loading_model` → `preprocessing` → `transcribing` → `aligning` → `diarizing` → `merging` → `done`
+
+---
+
+## 6. Database Schema
+
+SQLite database stored per-project at `{project_dir}/project.db`.
+
+```sql
+-- Projects metadata
+CREATE TABLE projects (
+    id          TEXT PRIMARY KEY,
+    name        TEXT NOT NULL,
+    created_at  TEXT NOT NULL,
+    updated_at  TEXT NOT NULL,
+    settings    TEXT,               -- JSON: project-specific overrides
+    status      TEXT DEFAULT 'active'
+);
+
+-- Source media files
+CREATE TABLE media_files (
+    id          TEXT PRIMARY KEY,
+    project_id  TEXT NOT NULL REFERENCES projects(id),
+    file_path   TEXT NOT NULL,      -- relative to project dir
+    file_hash   TEXT,               -- SHA-256 for integrity
+    duration_ms INTEGER,
+    sample_rate INTEGER,
+    channels    INTEGER,
+    format      TEXT,
+    file_size   INTEGER,
+    created_at  TEXT NOT NULL
+);
+
+-- Speakers identified in audio
+CREATE TABLE speakers (
+    id           TEXT PRIMARY KEY,
+    project_id   TEXT NOT NULL REFERENCES projects(id),
+    label        TEXT NOT NULL,     -- auto-assigned: "Speaker 1"
+    display_name TEXT,              -- user-assigned: "Sarah Chen"
+    color        TEXT,              -- hex color for UI
+    metadata     TEXT               -- JSON: voice embedding ref, notes
+);
+
+-- Transcript segments (one per speaker turn)
+CREATE TABLE segments (
+    id            TEXT PRIMARY KEY,
+    project_id    TEXT NOT NULL REFERENCES projects(id),
+    media_file_id TEXT NOT NULL REFERENCES media_files(id),
+    speaker_id    TEXT REFERENCES speakers(id),
+    start_ms      INTEGER NOT NULL,
+    end_ms        INTEGER NOT NULL,
+    text          TEXT NOT NULL,
+    original_text TEXT,             -- pre-edit text preserved
+    confidence    REAL,
+    is_edited     INTEGER DEFAULT 0,
+    edited_at     TEXT,
+    segment_index INTEGER NOT NULL
+);
+
+-- Word-level timestamps (for click-to-seek and captions)
+CREATE TABLE words (
+    id         TEXT PRIMARY KEY,
+    segment_id TEXT NOT NULL REFERENCES segments(id),
+    word       TEXT NOT NULL,
+    start_ms   INTEGER NOT NULL,
+    end_ms     INTEGER NOT NULL,
+    confidence REAL,
+    word_index INTEGER NOT NULL
+);
+
+-- AI-generated outputs
+CREATE TABLE ai_outputs (
+    id          TEXT PRIMARY KEY,
+    project_id  TEXT NOT NULL REFERENCES projects(id),
+    output_type TEXT NOT NULL,      -- summary, action_items, notes, qa
+    prompt      TEXT,
+    content     TEXT NOT NULL,
+    provider    TEXT,
+    created_at  TEXT NOT NULL,
+    metadata    TEXT                -- JSON: tokens, latency
+);
+
+-- User annotations and bookmarks
+CREATE TABLE annotations (
+    id         TEXT PRIMARY KEY,
+    project_id TEXT NOT NULL REFERENCES projects(id),
+    start_ms   INTEGER NOT NULL,
+    end_ms     INTEGER,
+    text       TEXT NOT NULL,
+    type       TEXT DEFAULT 'bookmark'
+);
+
+-- Performance indexes
+CREATE INDEX idx_segments_project ON segments(project_id, segment_index);
+CREATE INDEX idx_segments_time ON segments(media_file_id, start_ms);
+CREATE INDEX idx_words_segment ON words(segment_id, word_index);
+CREATE INDEX idx_words_time ON words(start_ms, end_ms);
+CREATE INDEX idx_ai_outputs_project ON ai_outputs(project_id, output_type);
+```
+
+---
+
+## 7. AI Provider Architecture
+
+### Provider Interface
+
+```python
+class AIProvider(ABC):
+    @abstractmethod
+    async def chat(self, messages: list[dict], config: dict) -> str: ...
+
+    @abstractmethod
+    async def stream(self, messages: list[dict], config: dict) -> AsyncIterator[str]: ...
+```
+
+### Supported Providers
+
+| Provider | Package | Use Case |
+|----------|---------|----------|
+| **LiteLLM** | `litellm` | Gateway to 100+ providers via unified API |
+| **OpenAI** | `openai` | Direct OpenAI API (GPT-4o, etc.) |
+| **Anthropic** | `anthropic` | Direct Anthropic API (Claude) |
+| **Ollama** | HTTP to localhost:11434 | Local models (Llama, Mistral, Phi, etc.) |
+
+### Context Window Strategy
+
+| Transcript Length | Strategy |
+|-------------------|----------|
+| < 100K tokens | Send full transcript directly |
+| 100K - 200K tokens | Use Claude (200K context) or chunk for smaller models |
+| > 200K tokens | Map-reduce: summarize chunks, then combine |
+| Q&A mode | Semantic search over chunks, send top-K relevant to model |
+
+### Configuration
+
+Users configure AI providers in Settings. API keys stored in OS keychain (libsecret on Linux, Windows Credential Manager). Local models (Ollama) require no keys.
+
+```json
+{
+  "ai": {
+    "default_provider": "ollama",
+    "providers": {
+      "ollama": { "base_url": "http://localhost:11434", "model": "llama3:8b" },
+      "openai": { "model": "gpt-4o" },
+      "anthropic": { "model": "claude-sonnet-4-20250514" },
+      "litellm": { "model": "gpt-4o" }
+    }
+  }
+}
+```
+
+---
+
+## 8. Export Formats
+
+### Caption Formats
+
+| Format | Speaker Support | Library |
+|--------|----------------|---------|
+| **SRT** | `[Speaker]:` prefix convention | pysubs2 |
+| **WebVTT** | Native `<v Speaker>` voice tags | pysubs2 |
+| **ASS/SSA** | Named styles per speaker with colors | pysubs2 |
+
+### Text Formats
+
+| Format | Implementation |
+|--------|---------------|
+| **Plain text (.txt)** | Custom formatter |
+| **Markdown (.md)** | Custom formatter (bold speaker names) |
+| **DOCX** | python-docx |
+
+### Text Output Example
+
+```
+[00:00:03] Sarah Chen:
+Hello everyone, welcome to the meeting. I wanted to start by
+discussing the Q3 results before we move on to planning.
+
+[00:00:15] Michael Torres:
+Thanks Sarah. The numbers look strong this quarter.
+```
+
+---
+
+## 9. Project File Structure
+
+```
+~/VoiceToNotes/
+  config.json                   # Global app settings
+  projects/
+    {project-uuid}/
+      project.db                # SQLite database
+      media/
+        recording.m4a           # Original media file
+      exports/
+        transcript.srt
+        transcript.vtt
+        notes.md
+```
+
+---
+
+## 10. Implementation Phases
+
+### Phase 1 — Foundation
+Set up Tauri + Svelte project scaffold, Python sidecar with IPC protocol, SQLite schema, and basic project management UI.
+
+**Deliverables:**
+- Tauri app launches with a working Svelte frontend
+- Python sidecar starts, communicates via JSON-line IPC
+- SQLite database created per-project
+- Create/open/list projects in the UI
+
+### Phase 2 — Core Transcription
+Implement the transcription pipeline with audio playback and synchronized transcript display.
+
+**Deliverables:**
+- Import audio/video files (ffmpeg conversion to WAV)
+- Run faster-whisper transcription with progress reporting
+- Display transcript with word-level timestamps
+- wavesurfer.js audio player with click-to-seek from transcript
+- Auto-scroll transcript during playback
+- Edit transcript text (corrections persist to DB)
+
+### Phase 3 — Speaker Diarization
+Add speaker identification and management.
+
+**Deliverables:**
+- pyannote.audio diarization integrated into pipeline
+- Speaker segments merged with word timestamps
+- Speaker labels displayed in transcript with colors
+- Rename speakers (persists across all segments)
+- Re-assign speaker for selected text segments
+- Hardware detection and model auto-selection (CPU/GPU)
+
+### Phase 4 — Export
+Implement all export formats.
+
+**Deliverables:**
+- SRT, WebVTT, ASS caption export with speaker labels
+- Plain text and Markdown export with speaker names
+- Export options panel in UI
+
+### Phase 5 — AI Integration
+Add AI provider support for Q&A and summarization.
+
+**Deliverables:**
+- Provider configuration UI with API key management
+- Ollama local model support
+- OpenAI and Anthropic direct SDK support
+- LiteLLM gateway support
+- Chat panel for asking questions about the transcript
+- Summary/notes generation with multiple styles
+- Context window management for long transcripts
+
+### Phase 6 — Polish and Packaging
+Production readiness.
+
+**Deliverables:**
+- Linux packaging (.deb, .AppImage)
+- Windows packaging (.msi, .exe installer)
+- Bundled Python environment (no user Python install required)
+- Model download manager (first-run setup)
+- Settings panel (model selection, hardware config, AI providers)
+- Error handling, logging, crash recovery
+
+---
+
+## 11. Agent Work Breakdown
+
+For parallel development, the codebase splits into these independent workstreams:
+
+| Agent | Scope | Dependencies |
+|-------|-------|-------------|
+| **Agent 1: Tauri + Frontend Shell** | Tauri project setup, Svelte scaffold, routing, project manager UI, settings UI | None |
+| **Agent 2: Python Sidecar + IPC** | Python project setup, IPC protocol, message loop, handler routing | None |
+| **Agent 3: Database Layer** | SQLite schema, Rust query layer, migration system | None |
+| **Agent 4: Transcription Pipeline** | faster-whisper integration, wav2vec2 alignment, hardware detection, model management | Agent 2 (IPC) |
+| **Agent 5: Diarization Pipeline** | pyannote.audio integration, speaker-word alignment, combined pipeline | Agent 4 (transcription) |
+| **Agent 6: Audio Player + Transcript UI** | wavesurfer.js integration, TipTap transcript editor, playback-transcript sync | Agent 1 (shell), Agent 3 (DB) |
+| **Agent 7: Export System** | pysubs2 caption export, text formatters, export UI | Agent 2 (IPC), Agent 3 (DB) |
+| **Agent 8: AI Provider System** | Provider abstraction, LiteLLM/OpenAI/Anthropic/Ollama adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |
+
+Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.