diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..b248506 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,43 @@ +# Voice to Notes — Project Guidelines + +## Project Overview +Desktop app for transcribing audio/video with speaker identification. Runs locally on user's computer. See `docs/ARCHITECTURE.md` for full architecture. + +## Tech Stack +- **Desktop shell:** Tauri v2 (Rust backend + Svelte/TypeScript frontend) +- **ML pipeline:** Python sidecar process (faster-whisper, pyannote.audio, wav2vec2) +- **Database:** SQLite (via rusqlite in Rust) +- **AI providers:** LiteLLM, OpenAI, Anthropic, Ollama (local) +- **Caption export:** pysubs2 (Python) +- **Audio UI:** wavesurfer.js +- **Transcript editor:** TipTap (ProseMirror) + +## Key Architecture Decisions +- Python sidecar communicates with Rust via JSON-line IPC (stdin/stdout) +- All ML models must work on CPU. GPU (CUDA) is optional acceleration. +- AI cloud providers are optional. Local models (Ollama) are a first-class option. +- SQLite database is per-project, stored alongside media files. +- Word-level timestamps are required for click-to-seek playback sync. + +## Directory Structure +``` +src/ # Svelte frontend source +src-tauri/ # Rust backend source +python/ # Python sidecar source + voice_to_notes/ # Python package + tests/ # Python tests +docs/ # Architecture and design documents +``` + +## Conventions +- Rust: follow standard Rust conventions, use `cargo fmt` and `cargo clippy` +- Python: Python 3.11+, use type hints, follow PEP 8, use `ruff` for linting +- TypeScript: strict mode, prefer Svelte stores for state management +- IPC messages: JSON-line format, each message has `id`, `type`, `payload` fields +- Database: UUIDs as primary keys (TEXT type in SQLite) +- All timestamps in milliseconds (integer) relative to media file start + +## Platform Targets +- Linux (primary development target) +- Windows (must work, tested before release) +- macOS (future, not yet targeted) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..dc49a05 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,568 @@ +# Voice to Notes — Architecture Document + +## 1. Overview + +Voice to Notes is a desktop application that transcribes audio/video recordings with speaker identification. It runs entirely on the user's computer. Cloud AI providers are optional and only used when explicitly configured by the user. + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Tauri Application │ +│ │ +│ ┌───────────────────────────────────────────────────────────┐ │ +│ │ Frontend (Svelte + TS) │ │ +│ │ │ │ +│ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │ +│ │ │ Waveform │ │ Transcript │ │ AI Chat │ │ │ +│ │ │ Player │ │ Editor │ │ Panel │ │ │ +│ │ │ (wavesurfer) │ │ (TipTap) │ │ │ │ │ +│ │ └─────────────┘ └──────────────┘ └───────────────────┘ │ │ +│ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │ +│ │ │ Speaker │ │ Export │ │ Project │ │ │ +│ │ │ Manager │ │ Panel │ │ Manager │ │ │ +│ │ └─────────────┘ └──────────────┘ └───────────────────┘ │ │ +│ └──────────────────────────┬────────────────────────────────┘ │ +│ │ tauri::invoke() │ +│ ┌──────────────────────────┴────────────────────────────────┐ │ +│ │ Rust Backend (thin layer) │ │ +│ │ │ │ +│ │ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │ +│ │ │ Process │ │ File I/O │ │ SQLite │ │ │ +│ │ │ Manager │ │ & Media │ │ (via rusqlite) │ │ │ +│ │ └──────┬───────┘ └──────────────┘ └───────────────────┘ │ │ +│ └─────────┼─────────────────────────────────────────────────┘ │ +└────────────┼────────────────────────────────────────────────────┘ + │ JSON-line IPC (stdin/stdout) + │ +┌────────────┴────────────────────────────────────────────────────┐ +│ Python Sidecar Process │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │ +│ │ Transcribe │ │ Diarize │ │ AI Provider │ │ +│ │ Service │ │ Service │ │ Service │ │ +│ │ │ │ │ │ │ │ +│ │ faster-whisper│ │ pyannote │ │ ┌──────────────────┐ │ │ +│ │ + wav2vec2 │ │ .audio 4.0 │ │ │ LiteLLM adapter │ │ │ +│ │ │ │ │ │ │ OpenAI adapter │ │ │ +│ │ CPU: auto │ │ CPU: auto │ │ │ Anthropic adapter │ │ │ +│ │ GPU: CUDA │ │ GPU: CUDA │ │ │ Ollama adapter │ │ │ +│ └──────────────┘ └──────────────┘ │ └──────────────────┘ │ │ +│ └────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 2. Technology Stack + +| Layer | Technology | Purpose | +|-------|-----------|---------| +| **Desktop Shell** | Tauri v2 | Window management, OS integration, native packaging | +| **Frontend** | Svelte + TypeScript | UI components, state management | +| **Audio Waveform** | wavesurfer.js | Waveform visualization, click-to-seek playback | +| **Transcript Editor** | TipTap (ProseMirror) | Rich text editing with speaker-colored labels | +| **Backend** | Rust (thin) | Process management, file I/O, SQLite access, IPC relay | +| **Database** | SQLite (via rusqlite) | Project data, transcripts, word timestamps, speaker info | +| **ML Runtime** | Python sidecar | Speech-to-text, diarization, AI provider integration | +| **STT Engine** | faster-whisper | Transcription with word-level timestamps | +| **Timestamp Refinement** | wav2vec2 | Precise word-level alignment | +| **Speaker Diarization** | pyannote.audio 4.0 | Speaker segment detection | +| **AI Providers** | LiteLLM / direct SDKs | Summarization, Q&A, notes | +| **Caption Export** | pysubs2 | SRT, WebVTT, ASS subtitle generation | + +--- + +## 3. CPU / GPU Strategy + +All ML components must work on CPU. GPU acceleration is used when available but never required. + +### Detection and Selection + +``` +App Launch + │ + ├─ Detect hardware (Python: torch.cuda.is_available(), etc.) + │ + ├─ NVIDIA GPU detected (CUDA) + │ ├─ VRAM >= 8GB → large-v3-turbo (int8), pyannote on GPU + │ ├─ VRAM >= 4GB → medium model (int8), pyannote on GPU + │ └─ VRAM < 4GB → fall back to CPU + │ + ├─ No GPU / unsupported GPU + │ ├─ RAM >= 16GB → medium model on CPU, pyannote on CPU + │ ├─ RAM >= 8GB → small model on CPU, pyannote on CPU + │ └─ RAM < 8GB → base model on CPU, pyannote on CPU (warn: slow) + │ + └─ User can override in Settings +``` + +### Model Recommendations by Hardware + +| Hardware | STT Model | Diarization | Expected Speed | +|----------|-----------|-------------|----------------| +| NVIDIA GPU, 8GB+ VRAM | large-v3-turbo (int8) | pyannote GPU | ~20x realtime | +| NVIDIA GPU, 4GB VRAM | medium (int8) | pyannote GPU | ~10x realtime | +| CPU only, 16GB RAM | medium (int8_cpu) | pyannote CPU | ~2-4x realtime | +| CPU only, 8GB RAM | small (int8_cpu) | pyannote CPU | ~3-5x realtime | +| CPU only, minimal | base | pyannote CPU | ~5-8x realtime | + +Users can always override model selection in settings. The app displays estimated processing time before starting. + +### CTranslate2 CPU Backends + +faster-whisper uses CTranslate2, which supports multiple CPU acceleration backends: +- **Intel MKL** — Best performance on Intel CPUs +- **oneDNN** — Good cross-platform alternative +- **OpenBLAS** — Fallback for any CPU +- **Ruy** — Lightweight option for ARM + +The Python sidecar auto-detects and uses the best available backend. + +--- + +## 4. Component Architecture + +### 4.1 Frontend (Svelte + TypeScript) + +``` +src/ + lib/ + components/ + WaveformPlayer.svelte # wavesurfer.js wrapper, playback controls + TranscriptEditor.svelte # TipTap editor with speaker labels + SpeakerManager.svelte # Assign names/colors to speakers + ExportPanel.svelte # Export format selection and options + AIChatPanel.svelte # Chat interface for AI Q&A + ProjectList.svelte # Project browser/manager + SettingsPanel.svelte # Model selection, AI config, preferences + ProgressOverlay.svelte # Transcription progress with cancel + stores/ + project.ts # Current project state + transcript.ts # Segments, words, speakers + playback.ts # Audio position, playing state + ai.ts # AI provider config and chat history + services/ + tauri-bridge.ts # Typed wrappers around tauri::invoke + audio-sync.ts # Sync playback position ↔ transcript highlight + export.ts # Trigger export via backend + types/ + transcript.ts # Segment, Word, Speaker interfaces + project.ts # Project, MediaFile interfaces + routes/ + +page.svelte # Main workspace + +layout.svelte # App shell with sidebar +``` + +**Key UI interactions:** +- Click a word in the transcript → audio seeks to that word's `start_ms` +- Audio plays → transcript auto-scrolls and highlights current word/segment +- Click speaker label → open rename dialog, changes propagate to all segments +- Drag to select text → option to re-assign speaker for selection + +### 4.2 Rust Backend + +The Rust layer is intentionally thin. It handles: + +1. **Process Management** — Spawn, monitor, and kill the Python sidecar +2. **IPC Relay** — Forward messages between frontend and Python process +3. **File Operations** — Read/write project files, manage media +4. **SQLite** — All database operations via rusqlite +5. **System Info** — Detect GPU, RAM, CPU for hardware recommendations + +``` +src-tauri/ + src/ + main.rs # Tauri app entry point + commands/ + project.rs # CRUD for projects + transcribe.rs # Start/stop/monitor transcription + export.rs # Trigger caption/text export + ai.rs # AI provider commands + settings.rs # App settings and preferences + system.rs # Hardware detection + db/ + mod.rs # SQLite connection pool + schema.rs # Table definitions and migrations + queries.rs # Prepared queries + sidecar/ + mod.rs # Python process lifecycle + ipc.rs # JSON-line protocol handler + messages.rs # IPC message types (serde) + state.rs # App state (db handle, sidecar handle) +``` + +### 4.3 Python Sidecar + +The Python process runs independently and communicates via JSON-line protocol over stdin/stdout. + +``` +python/ + voice_to_notes/ + __init__.py + main.py # Entry point, IPC message loop + ipc/ + __init__.py + protocol.py # JSON-line read/write, message types + handlers.py # Route messages to services + services/ + transcribe.py # faster-whisper + wav2vec2 pipeline + diarize.py # pyannote.audio diarization + pipeline.py # Combined transcribe + diarize workflow + ai_provider.py # AI provider abstraction + export.py # pysubs2 caption export, text export + providers/ + __init__.py + base.py # Abstract AI provider interface + litellm_provider.py # LiteLLM (multi-provider gateway) + openai_provider.py # Direct OpenAI SDK + anthropic_provider.py # Direct Anthropic SDK + ollama_provider.py # Local Ollama models + hardware/ + __init__.py + detect.py # GPU/CPU detection, VRAM estimation + models.py # Model selection logic + utils/ + audio.py # Audio format conversion (ffmpeg wrapper) + progress.py # Progress reporting via IPC + tests/ + test_transcribe.py + test_diarize.py + test_pipeline.py + test_providers.py + test_export.py + pyproject.toml # Dependencies and build config +``` + +--- + +## 5. IPC Protocol + +The Rust backend and Python sidecar communicate via newline-delimited JSON on stdin/stdout. Each message has a type, an optional request ID (for correlating responses), and a payload. + +### Message Format + +```json +{"id": "req-001", "type": "transcribe.start", "payload": {"file": "/path/to/audio.wav", "model": "large-v3-turbo", "device": "cuda", "language": "auto"}} +``` + +### Message Types + +**Requests (Rust → Python):** + +| Type | Payload | Description | +|------|---------|-------------| +| `transcribe.start` | `{file, model, device, language}` | Start transcription | +| `transcribe.cancel` | `{id}` | Cancel running transcription | +| `diarize.start` | `{file, num_speakers?}` | Start speaker diarization | +| `pipeline.start` | `{file, model, device, language, num_speakers?}` | Full transcribe + diarize | +| `ai.chat` | `{provider, model, messages, transcript_context}` | Send AI chat message | +| `ai.summarize` | `{provider, model, transcript, style}` | Generate summary/notes | +| `export.captions` | `{segments, format, options}` | Export caption file | +| `export.text` | `{segments, speakers, format, options}` | Export text document | +| `hardware.detect` | `{}` | Detect available hardware | + +**Responses (Python → Rust):** + +| Type | Payload | Description | +|------|---------|-------------| +| `progress` | `{id, percent, stage, message}` | Progress update | +| `transcribe.result` | `{segments: [{text, start_ms, end_ms, words: [...]}]}` | Transcription complete | +| `diarize.result` | `{speakers: [{id, segments: [{start_ms, end_ms}]}]}` | Diarization complete | +| `pipeline.result` | `{segments, speakers, words}` | Full pipeline result | +| `ai.response` | `{content, tokens_used, provider}` | AI response | +| `ai.stream` | `{id, delta, done}` | Streaming AI token | +| `export.done` | `{path}` | Export file written | +| `error` | `{id, code, message}` | Error response | +| `hardware.info` | `{gpu, vram_mb, ram_mb, cpu_cores, recommended_model}` | Hardware info | + +### Progress Reporting + +Long-running operations (transcription, diarization) send periodic progress messages: + +```json +{"id": "req-001", "type": "progress", "payload": {"percent": 45, "stage": "transcribing", "message": "Processing segment 23/51..."}} +``` + +Stages: `loading_model` → `preprocessing` → `transcribing` → `aligning` → `diarizing` → `merging` → `done` + +--- + +## 6. Database Schema + +SQLite database stored per-project at `{project_dir}/project.db`. + +```sql +-- Projects metadata +CREATE TABLE projects ( + id TEXT PRIMARY KEY, + name TEXT NOT NULL, + created_at TEXT NOT NULL, + updated_at TEXT NOT NULL, + settings TEXT, -- JSON: project-specific overrides + status TEXT DEFAULT 'active' +); + +-- Source media files +CREATE TABLE media_files ( + id TEXT PRIMARY KEY, + project_id TEXT NOT NULL REFERENCES projects(id), + file_path TEXT NOT NULL, -- relative to project dir + file_hash TEXT, -- SHA-256 for integrity + duration_ms INTEGER, + sample_rate INTEGER, + channels INTEGER, + format TEXT, + file_size INTEGER, + created_at TEXT NOT NULL +); + +-- Speakers identified in audio +CREATE TABLE speakers ( + id TEXT PRIMARY KEY, + project_id TEXT NOT NULL REFERENCES projects(id), + label TEXT NOT NULL, -- auto-assigned: "Speaker 1" + display_name TEXT, -- user-assigned: "Sarah Chen" + color TEXT, -- hex color for UI + metadata TEXT -- JSON: voice embedding ref, notes +); + +-- Transcript segments (one per speaker turn) +CREATE TABLE segments ( + id TEXT PRIMARY KEY, + project_id TEXT NOT NULL REFERENCES projects(id), + media_file_id TEXT NOT NULL REFERENCES media_files(id), + speaker_id TEXT REFERENCES speakers(id), + start_ms INTEGER NOT NULL, + end_ms INTEGER NOT NULL, + text TEXT NOT NULL, + original_text TEXT, -- pre-edit text preserved + confidence REAL, + is_edited INTEGER DEFAULT 0, + edited_at TEXT, + segment_index INTEGER NOT NULL +); + +-- Word-level timestamps (for click-to-seek and captions) +CREATE TABLE words ( + id TEXT PRIMARY KEY, + segment_id TEXT NOT NULL REFERENCES segments(id), + word TEXT NOT NULL, + start_ms INTEGER NOT NULL, + end_ms INTEGER NOT NULL, + confidence REAL, + word_index INTEGER NOT NULL +); + +-- AI-generated outputs +CREATE TABLE ai_outputs ( + id TEXT PRIMARY KEY, + project_id TEXT NOT NULL REFERENCES projects(id), + output_type TEXT NOT NULL, -- summary, action_items, notes, qa + prompt TEXT, + content TEXT NOT NULL, + provider TEXT, + created_at TEXT NOT NULL, + metadata TEXT -- JSON: tokens, latency +); + +-- User annotations and bookmarks +CREATE TABLE annotations ( + id TEXT PRIMARY KEY, + project_id TEXT NOT NULL REFERENCES projects(id), + start_ms INTEGER NOT NULL, + end_ms INTEGER, + text TEXT NOT NULL, + type TEXT DEFAULT 'bookmark' +); + +-- Performance indexes +CREATE INDEX idx_segments_project ON segments(project_id, segment_index); +CREATE INDEX idx_segments_time ON segments(media_file_id, start_ms); +CREATE INDEX idx_words_segment ON words(segment_id, word_index); +CREATE INDEX idx_words_time ON words(start_ms, end_ms); +CREATE INDEX idx_ai_outputs_project ON ai_outputs(project_id, output_type); +``` + +--- + +## 7. AI Provider Architecture + +### Provider Interface + +```python +class AIProvider(ABC): + @abstractmethod + async def chat(self, messages: list[dict], config: dict) -> str: ... + + @abstractmethod + async def stream(self, messages: list[dict], config: dict) -> AsyncIterator[str]: ... +``` + +### Supported Providers + +| Provider | Package | Use Case | +|----------|---------|----------| +| **LiteLLM** | `litellm` | Gateway to 100+ providers via unified API | +| **OpenAI** | `openai` | Direct OpenAI API (GPT-4o, etc.) | +| **Anthropic** | `anthropic` | Direct Anthropic API (Claude) | +| **Ollama** | HTTP to localhost:11434 | Local models (Llama, Mistral, Phi, etc.) | + +### Context Window Strategy + +| Transcript Length | Strategy | +|-------------------|----------| +| < 100K tokens | Send full transcript directly | +| 100K - 200K tokens | Use Claude (200K context) or chunk for smaller models | +| > 200K tokens | Map-reduce: summarize chunks, then combine | +| Q&A mode | Semantic search over chunks, send top-K relevant to model | + +### Configuration + +Users configure AI providers in Settings. API keys stored in OS keychain (libsecret on Linux, Windows Credential Manager). Local models (Ollama) require no keys. + +```json +{ + "ai": { + "default_provider": "ollama", + "providers": { + "ollama": { "base_url": "http://localhost:11434", "model": "llama3:8b" }, + "openai": { "model": "gpt-4o" }, + "anthropic": { "model": "claude-sonnet-4-20250514" }, + "litellm": { "model": "gpt-4o" } + } + } +} +``` + +--- + +## 8. Export Formats + +### Caption Formats + +| Format | Speaker Support | Library | +|--------|----------------|---------| +| **SRT** | `[Speaker]:` prefix convention | pysubs2 | +| **WebVTT** | Native `` voice tags | pysubs2 | +| **ASS/SSA** | Named styles per speaker with colors | pysubs2 | + +### Text Formats + +| Format | Implementation | +|--------|---------------| +| **Plain text (.txt)** | Custom formatter | +| **Markdown (.md)** | Custom formatter (bold speaker names) | +| **DOCX** | python-docx | + +### Text Output Example + +``` +[00:00:03] Sarah Chen: +Hello everyone, welcome to the meeting. I wanted to start by +discussing the Q3 results before we move on to planning. + +[00:00:15] Michael Torres: +Thanks Sarah. The numbers look strong this quarter. +``` + +--- + +## 9. Project File Structure + +``` +~/VoiceToNotes/ + config.json # Global app settings + projects/ + {project-uuid}/ + project.db # SQLite database + media/ + recording.m4a # Original media file + exports/ + transcript.srt + transcript.vtt + notes.md +``` + +--- + +## 10. Implementation Phases + +### Phase 1 — Foundation +Set up Tauri + Svelte project scaffold, Python sidecar with IPC protocol, SQLite schema, and basic project management UI. + +**Deliverables:** +- Tauri app launches with a working Svelte frontend +- Python sidecar starts, communicates via JSON-line IPC +- SQLite database created per-project +- Create/open/list projects in the UI + +### Phase 2 — Core Transcription +Implement the transcription pipeline with audio playback and synchronized transcript display. + +**Deliverables:** +- Import audio/video files (ffmpeg conversion to WAV) +- Run faster-whisper transcription with progress reporting +- Display transcript with word-level timestamps +- wavesurfer.js audio player with click-to-seek from transcript +- Auto-scroll transcript during playback +- Edit transcript text (corrections persist to DB) + +### Phase 3 — Speaker Diarization +Add speaker identification and management. + +**Deliverables:** +- pyannote.audio diarization integrated into pipeline +- Speaker segments merged with word timestamps +- Speaker labels displayed in transcript with colors +- Rename speakers (persists across all segments) +- Re-assign speaker for selected text segments +- Hardware detection and model auto-selection (CPU/GPU) + +### Phase 4 — Export +Implement all export formats. + +**Deliverables:** +- SRT, WebVTT, ASS caption export with speaker labels +- Plain text and Markdown export with speaker names +- Export options panel in UI + +### Phase 5 — AI Integration +Add AI provider support for Q&A and summarization. + +**Deliverables:** +- Provider configuration UI with API key management +- Ollama local model support +- OpenAI and Anthropic direct SDK support +- LiteLLM gateway support +- Chat panel for asking questions about the transcript +- Summary/notes generation with multiple styles +- Context window management for long transcripts + +### Phase 6 — Polish and Packaging +Production readiness. + +**Deliverables:** +- Linux packaging (.deb, .AppImage) +- Windows packaging (.msi, .exe installer) +- Bundled Python environment (no user Python install required) +- Model download manager (first-run setup) +- Settings panel (model selection, hardware config, AI providers) +- Error handling, logging, crash recovery + +--- + +## 11. Agent Work Breakdown + +For parallel development, the codebase splits into these independent workstreams: + +| Agent | Scope | Dependencies | +|-------|-------|-------------| +| **Agent 1: Tauri + Frontend Shell** | Tauri project setup, Svelte scaffold, routing, project manager UI, settings UI | None | +| **Agent 2: Python Sidecar + IPC** | Python project setup, IPC protocol, message loop, handler routing | None | +| **Agent 3: Database Layer** | SQLite schema, Rust query layer, migration system | None | +| **Agent 4: Transcription Pipeline** | faster-whisper integration, wav2vec2 alignment, hardware detection, model management | Agent 2 (IPC) | +| **Agent 5: Diarization Pipeline** | pyannote.audio integration, speaker-word alignment, combined pipeline | Agent 4 (transcription) | +| **Agent 6: Audio Player + Transcript UI** | wavesurfer.js integration, TipTap transcript editor, playback-transcript sync | Agent 1 (shell), Agent 3 (DB) | +| **Agent 7: Export System** | pysubs2 caption export, text formatters, export UI | Agent 2 (IPC), Agent 3 (DB) | +| **Agent 8: AI Provider System** | Provider abstraction, LiteLLM/OpenAI/Anthropic/Ollama adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) | + +Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.