Add architecture document and project guidelines
Detailed architecture covering Tauri + Svelte frontend, Rust backend, Python sidecar for ML (faster-whisper, pyannote.audio), IPC protocol, SQLite schema, AI provider system, export formats, and phased implementation plan with agent work breakdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
568
docs/ARCHITECTURE.md
Normal file
568
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,568 @@
|
||||
# Voice to Notes — Architecture Document
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Voice to Notes is a desktop application that transcribes audio/video recordings with speaker identification. It runs entirely on the user's computer. Cloud AI providers are optional and only used when explicitly configured by the user.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Tauri Application │
|
||||
│ │
|
||||
│ ┌───────────────────────────────────────────────────────────┐ │
|
||||
│ │ Frontend (Svelte + TS) │ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │
|
||||
│ │ │ Waveform │ │ Transcript │ │ AI Chat │ │ │
|
||||
│ │ │ Player │ │ Editor │ │ Panel │ │ │
|
||||
│ │ │ (wavesurfer) │ │ (TipTap) │ │ │ │ │
|
||||
│ │ └─────────────┘ └──────────────┘ └───────────────────┘ │ │
|
||||
│ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │
|
||||
│ │ │ Speaker │ │ Export │ │ Project │ │ │
|
||||
│ │ │ Manager │ │ Panel │ │ Manager │ │ │
|
||||
│ │ └─────────────┘ └──────────────┘ └───────────────────┘ │ │
|
||||
│ └──────────────────────────┬────────────────────────────────┘ │
|
||||
│ │ tauri::invoke() │
|
||||
│ ┌──────────────────────────┴────────────────────────────────┐ │
|
||||
│ │ Rust Backend (thin layer) │ │
|
||||
│ │ │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │ │
|
||||
│ │ │ Process │ │ File I/O │ │ SQLite │ │ │
|
||||
│ │ │ Manager │ │ & Media │ │ (via rusqlite) │ │ │
|
||||
│ │ └──────┬───────┘ └──────────────┘ └───────────────────┘ │ │
|
||||
│ └─────────┼─────────────────────────────────────────────────┘ │
|
||||
└────────────┼────────────────────────────────────────────────────┘
|
||||
│ JSON-line IPC (stdin/stdout)
|
||||
│
|
||||
┌────────────┴────────────────────────────────────────────────────┐
|
||||
│ Python Sidecar Process │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
|
||||
│ │ Transcribe │ │ Diarize │ │ AI Provider │ │
|
||||
│ │ Service │ │ Service │ │ Service │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ faster-whisper│ │ pyannote │ │ ┌──────────────────┐ │ │
|
||||
│ │ + wav2vec2 │ │ .audio 4.0 │ │ │ LiteLLM adapter │ │ │
|
||||
│ │ │ │ │ │ │ OpenAI adapter │ │ │
|
||||
│ │ CPU: auto │ │ CPU: auto │ │ │ Anthropic adapter │ │ │
|
||||
│ │ GPU: CUDA │ │ GPU: CUDA │ │ │ Ollama adapter │ │ │
|
||||
│ └──────────────┘ └──────────────┘ │ └──────────────────┘ │ │
|
||||
│ └────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Technology Stack
|
||||
|
||||
| Layer | Technology | Purpose |
|
||||
|-------|-----------|---------|
|
||||
| **Desktop Shell** | Tauri v2 | Window management, OS integration, native packaging |
|
||||
| **Frontend** | Svelte + TypeScript | UI components, state management |
|
||||
| **Audio Waveform** | wavesurfer.js | Waveform visualization, click-to-seek playback |
|
||||
| **Transcript Editor** | TipTap (ProseMirror) | Rich text editing with speaker-colored labels |
|
||||
| **Backend** | Rust (thin) | Process management, file I/O, SQLite access, IPC relay |
|
||||
| **Database** | SQLite (via rusqlite) | Project data, transcripts, word timestamps, speaker info |
|
||||
| **ML Runtime** | Python sidecar | Speech-to-text, diarization, AI provider integration |
|
||||
| **STT Engine** | faster-whisper | Transcription with word-level timestamps |
|
||||
| **Timestamp Refinement** | wav2vec2 | Precise word-level alignment |
|
||||
| **Speaker Diarization** | pyannote.audio 4.0 | Speaker segment detection |
|
||||
| **AI Providers** | LiteLLM / direct SDKs | Summarization, Q&A, notes |
|
||||
| **Caption Export** | pysubs2 | SRT, WebVTT, ASS subtitle generation |
|
||||
|
||||
---
|
||||
|
||||
## 3. CPU / GPU Strategy
|
||||
|
||||
All ML components must work on CPU. GPU acceleration is used when available but never required.
|
||||
|
||||
### Detection and Selection
|
||||
|
||||
```
|
||||
App Launch
|
||||
│
|
||||
├─ Detect hardware (Python: torch.cuda.is_available(), etc.)
|
||||
│
|
||||
├─ NVIDIA GPU detected (CUDA)
|
||||
│ ├─ VRAM >= 8GB → large-v3-turbo (int8), pyannote on GPU
|
||||
│ ├─ VRAM >= 4GB → medium model (int8), pyannote on GPU
|
||||
│ └─ VRAM < 4GB → fall back to CPU
|
||||
│
|
||||
├─ No GPU / unsupported GPU
|
||||
│ ├─ RAM >= 16GB → medium model on CPU, pyannote on CPU
|
||||
│ ├─ RAM >= 8GB → small model on CPU, pyannote on CPU
|
||||
│ └─ RAM < 8GB → base model on CPU, pyannote on CPU (warn: slow)
|
||||
│
|
||||
└─ User can override in Settings
|
||||
```
|
||||
|
||||
### Model Recommendations by Hardware
|
||||
|
||||
| Hardware | STT Model | Diarization | Expected Speed |
|
||||
|----------|-----------|-------------|----------------|
|
||||
| NVIDIA GPU, 8GB+ VRAM | large-v3-turbo (int8) | pyannote GPU | ~20x realtime |
|
||||
| NVIDIA GPU, 4GB VRAM | medium (int8) | pyannote GPU | ~10x realtime |
|
||||
| CPU only, 16GB RAM | medium (int8_cpu) | pyannote CPU | ~2-4x realtime |
|
||||
| CPU only, 8GB RAM | small (int8_cpu) | pyannote CPU | ~3-5x realtime |
|
||||
| CPU only, minimal | base | pyannote CPU | ~5-8x realtime |
|
||||
|
||||
Users can always override model selection in settings. The app displays estimated processing time before starting.
|
||||
|
||||
### CTranslate2 CPU Backends
|
||||
|
||||
faster-whisper uses CTranslate2, which supports multiple CPU acceleration backends:
|
||||
- **Intel MKL** — Best performance on Intel CPUs
|
||||
- **oneDNN** — Good cross-platform alternative
|
||||
- **OpenBLAS** — Fallback for any CPU
|
||||
- **Ruy** — Lightweight option for ARM
|
||||
|
||||
The Python sidecar auto-detects and uses the best available backend.
|
||||
|
||||
---
|
||||
|
||||
## 4. Component Architecture
|
||||
|
||||
### 4.1 Frontend (Svelte + TypeScript)
|
||||
|
||||
```
|
||||
src/
|
||||
lib/
|
||||
components/
|
||||
WaveformPlayer.svelte # wavesurfer.js wrapper, playback controls
|
||||
TranscriptEditor.svelte # TipTap editor with speaker labels
|
||||
SpeakerManager.svelte # Assign names/colors to speakers
|
||||
ExportPanel.svelte # Export format selection and options
|
||||
AIChatPanel.svelte # Chat interface for AI Q&A
|
||||
ProjectList.svelte # Project browser/manager
|
||||
SettingsPanel.svelte # Model selection, AI config, preferences
|
||||
ProgressOverlay.svelte # Transcription progress with cancel
|
||||
stores/
|
||||
project.ts # Current project state
|
||||
transcript.ts # Segments, words, speakers
|
||||
playback.ts # Audio position, playing state
|
||||
ai.ts # AI provider config and chat history
|
||||
services/
|
||||
tauri-bridge.ts # Typed wrappers around tauri::invoke
|
||||
audio-sync.ts # Sync playback position ↔ transcript highlight
|
||||
export.ts # Trigger export via backend
|
||||
types/
|
||||
transcript.ts # Segment, Word, Speaker interfaces
|
||||
project.ts # Project, MediaFile interfaces
|
||||
routes/
|
||||
+page.svelte # Main workspace
|
||||
+layout.svelte # App shell with sidebar
|
||||
```
|
||||
|
||||
**Key UI interactions:**
|
||||
- Click a word in the transcript → audio seeks to that word's `start_ms`
|
||||
- Audio plays → transcript auto-scrolls and highlights current word/segment
|
||||
- Click speaker label → open rename dialog, changes propagate to all segments
|
||||
- Drag to select text → option to re-assign speaker for selection
|
||||
|
||||
### 4.2 Rust Backend
|
||||
|
||||
The Rust layer is intentionally thin. It handles:
|
||||
|
||||
1. **Process Management** — Spawn, monitor, and kill the Python sidecar
|
||||
2. **IPC Relay** — Forward messages between frontend and Python process
|
||||
3. **File Operations** — Read/write project files, manage media
|
||||
4. **SQLite** — All database operations via rusqlite
|
||||
5. **System Info** — Detect GPU, RAM, CPU for hardware recommendations
|
||||
|
||||
```
|
||||
src-tauri/
|
||||
src/
|
||||
main.rs # Tauri app entry point
|
||||
commands/
|
||||
project.rs # CRUD for projects
|
||||
transcribe.rs # Start/stop/monitor transcription
|
||||
export.rs # Trigger caption/text export
|
||||
ai.rs # AI provider commands
|
||||
settings.rs # App settings and preferences
|
||||
system.rs # Hardware detection
|
||||
db/
|
||||
mod.rs # SQLite connection pool
|
||||
schema.rs # Table definitions and migrations
|
||||
queries.rs # Prepared queries
|
||||
sidecar/
|
||||
mod.rs # Python process lifecycle
|
||||
ipc.rs # JSON-line protocol handler
|
||||
messages.rs # IPC message types (serde)
|
||||
state.rs # App state (db handle, sidecar handle)
|
||||
```
|
||||
|
||||
### 4.3 Python Sidecar
|
||||
|
||||
The Python process runs independently and communicates via JSON-line protocol over stdin/stdout.
|
||||
|
||||
```
|
||||
python/
|
||||
voice_to_notes/
|
||||
__init__.py
|
||||
main.py # Entry point, IPC message loop
|
||||
ipc/
|
||||
__init__.py
|
||||
protocol.py # JSON-line read/write, message types
|
||||
handlers.py # Route messages to services
|
||||
services/
|
||||
transcribe.py # faster-whisper + wav2vec2 pipeline
|
||||
diarize.py # pyannote.audio diarization
|
||||
pipeline.py # Combined transcribe + diarize workflow
|
||||
ai_provider.py # AI provider abstraction
|
||||
export.py # pysubs2 caption export, text export
|
||||
providers/
|
||||
__init__.py
|
||||
base.py # Abstract AI provider interface
|
||||
litellm_provider.py # LiteLLM (multi-provider gateway)
|
||||
openai_provider.py # Direct OpenAI SDK
|
||||
anthropic_provider.py # Direct Anthropic SDK
|
||||
ollama_provider.py # Local Ollama models
|
||||
hardware/
|
||||
__init__.py
|
||||
detect.py # GPU/CPU detection, VRAM estimation
|
||||
models.py # Model selection logic
|
||||
utils/
|
||||
audio.py # Audio format conversion (ffmpeg wrapper)
|
||||
progress.py # Progress reporting via IPC
|
||||
tests/
|
||||
test_transcribe.py
|
||||
test_diarize.py
|
||||
test_pipeline.py
|
||||
test_providers.py
|
||||
test_export.py
|
||||
pyproject.toml # Dependencies and build config
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. IPC Protocol
|
||||
|
||||
The Rust backend and Python sidecar communicate via newline-delimited JSON on stdin/stdout. Each message has a type, an optional request ID (for correlating responses), and a payload.
|
||||
|
||||
### Message Format
|
||||
|
||||
```json
|
||||
{"id": "req-001", "type": "transcribe.start", "payload": {"file": "/path/to/audio.wav", "model": "large-v3-turbo", "device": "cuda", "language": "auto"}}
|
||||
```
|
||||
|
||||
### Message Types
|
||||
|
||||
**Requests (Rust → Python):**
|
||||
|
||||
| Type | Payload | Description |
|
||||
|------|---------|-------------|
|
||||
| `transcribe.start` | `{file, model, device, language}` | Start transcription |
|
||||
| `transcribe.cancel` | `{id}` | Cancel running transcription |
|
||||
| `diarize.start` | `{file, num_speakers?}` | Start speaker diarization |
|
||||
| `pipeline.start` | `{file, model, device, language, num_speakers?}` | Full transcribe + diarize |
|
||||
| `ai.chat` | `{provider, model, messages, transcript_context}` | Send AI chat message |
|
||||
| `ai.summarize` | `{provider, model, transcript, style}` | Generate summary/notes |
|
||||
| `export.captions` | `{segments, format, options}` | Export caption file |
|
||||
| `export.text` | `{segments, speakers, format, options}` | Export text document |
|
||||
| `hardware.detect` | `{}` | Detect available hardware |
|
||||
|
||||
**Responses (Python → Rust):**
|
||||
|
||||
| Type | Payload | Description |
|
||||
|------|---------|-------------|
|
||||
| `progress` | `{id, percent, stage, message}` | Progress update |
|
||||
| `transcribe.result` | `{segments: [{text, start_ms, end_ms, words: [...]}]}` | Transcription complete |
|
||||
| `diarize.result` | `{speakers: [{id, segments: [{start_ms, end_ms}]}]}` | Diarization complete |
|
||||
| `pipeline.result` | `{segments, speakers, words}` | Full pipeline result |
|
||||
| `ai.response` | `{content, tokens_used, provider}` | AI response |
|
||||
| `ai.stream` | `{id, delta, done}` | Streaming AI token |
|
||||
| `export.done` | `{path}` | Export file written |
|
||||
| `error` | `{id, code, message}` | Error response |
|
||||
| `hardware.info` | `{gpu, vram_mb, ram_mb, cpu_cores, recommended_model}` | Hardware info |
|
||||
|
||||
### Progress Reporting
|
||||
|
||||
Long-running operations (transcription, diarization) send periodic progress messages:
|
||||
|
||||
```json
|
||||
{"id": "req-001", "type": "progress", "payload": {"percent": 45, "stage": "transcribing", "message": "Processing segment 23/51..."}}
|
||||
```
|
||||
|
||||
Stages: `loading_model` → `preprocessing` → `transcribing` → `aligning` → `diarizing` → `merging` → `done`
|
||||
|
||||
---
|
||||
|
||||
## 6. Database Schema
|
||||
|
||||
SQLite database stored per-project at `{project_dir}/project.db`.
|
||||
|
||||
```sql
|
||||
-- Projects metadata
|
||||
CREATE TABLE projects (
|
||||
id TEXT PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
created_at TEXT NOT NULL,
|
||||
updated_at TEXT NOT NULL,
|
||||
settings TEXT, -- JSON: project-specific overrides
|
||||
status TEXT DEFAULT 'active'
|
||||
);
|
||||
|
||||
-- Source media files
|
||||
CREATE TABLE media_files (
|
||||
id TEXT PRIMARY KEY,
|
||||
project_id TEXT NOT NULL REFERENCES projects(id),
|
||||
file_path TEXT NOT NULL, -- relative to project dir
|
||||
file_hash TEXT, -- SHA-256 for integrity
|
||||
duration_ms INTEGER,
|
||||
sample_rate INTEGER,
|
||||
channels INTEGER,
|
||||
format TEXT,
|
||||
file_size INTEGER,
|
||||
created_at TEXT NOT NULL
|
||||
);
|
||||
|
||||
-- Speakers identified in audio
|
||||
CREATE TABLE speakers (
|
||||
id TEXT PRIMARY KEY,
|
||||
project_id TEXT NOT NULL REFERENCES projects(id),
|
||||
label TEXT NOT NULL, -- auto-assigned: "Speaker 1"
|
||||
display_name TEXT, -- user-assigned: "Sarah Chen"
|
||||
color TEXT, -- hex color for UI
|
||||
metadata TEXT -- JSON: voice embedding ref, notes
|
||||
);
|
||||
|
||||
-- Transcript segments (one per speaker turn)
|
||||
CREATE TABLE segments (
|
||||
id TEXT PRIMARY KEY,
|
||||
project_id TEXT NOT NULL REFERENCES projects(id),
|
||||
media_file_id TEXT NOT NULL REFERENCES media_files(id),
|
||||
speaker_id TEXT REFERENCES speakers(id),
|
||||
start_ms INTEGER NOT NULL,
|
||||
end_ms INTEGER NOT NULL,
|
||||
text TEXT NOT NULL,
|
||||
original_text TEXT, -- pre-edit text preserved
|
||||
confidence REAL,
|
||||
is_edited INTEGER DEFAULT 0,
|
||||
edited_at TEXT,
|
||||
segment_index INTEGER NOT NULL
|
||||
);
|
||||
|
||||
-- Word-level timestamps (for click-to-seek and captions)
|
||||
CREATE TABLE words (
|
||||
id TEXT PRIMARY KEY,
|
||||
segment_id TEXT NOT NULL REFERENCES segments(id),
|
||||
word TEXT NOT NULL,
|
||||
start_ms INTEGER NOT NULL,
|
||||
end_ms INTEGER NOT NULL,
|
||||
confidence REAL,
|
||||
word_index INTEGER NOT NULL
|
||||
);
|
||||
|
||||
-- AI-generated outputs
|
||||
CREATE TABLE ai_outputs (
|
||||
id TEXT PRIMARY KEY,
|
||||
project_id TEXT NOT NULL REFERENCES projects(id),
|
||||
output_type TEXT NOT NULL, -- summary, action_items, notes, qa
|
||||
prompt TEXT,
|
||||
content TEXT NOT NULL,
|
||||
provider TEXT,
|
||||
created_at TEXT NOT NULL,
|
||||
metadata TEXT -- JSON: tokens, latency
|
||||
);
|
||||
|
||||
-- User annotations and bookmarks
|
||||
CREATE TABLE annotations (
|
||||
id TEXT PRIMARY KEY,
|
||||
project_id TEXT NOT NULL REFERENCES projects(id),
|
||||
start_ms INTEGER NOT NULL,
|
||||
end_ms INTEGER,
|
||||
text TEXT NOT NULL,
|
||||
type TEXT DEFAULT 'bookmark'
|
||||
);
|
||||
|
||||
-- Performance indexes
|
||||
CREATE INDEX idx_segments_project ON segments(project_id, segment_index);
|
||||
CREATE INDEX idx_segments_time ON segments(media_file_id, start_ms);
|
||||
CREATE INDEX idx_words_segment ON words(segment_id, word_index);
|
||||
CREATE INDEX idx_words_time ON words(start_ms, end_ms);
|
||||
CREATE INDEX idx_ai_outputs_project ON ai_outputs(project_id, output_type);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. AI Provider Architecture
|
||||
|
||||
### Provider Interface
|
||||
|
||||
```python
|
||||
class AIProvider(ABC):
|
||||
@abstractmethod
|
||||
async def chat(self, messages: list[dict], config: dict) -> str: ...
|
||||
|
||||
@abstractmethod
|
||||
async def stream(self, messages: list[dict], config: dict) -> AsyncIterator[str]: ...
|
||||
```
|
||||
|
||||
### Supported Providers
|
||||
|
||||
| Provider | Package | Use Case |
|
||||
|----------|---------|----------|
|
||||
| **LiteLLM** | `litellm` | Gateway to 100+ providers via unified API |
|
||||
| **OpenAI** | `openai` | Direct OpenAI API (GPT-4o, etc.) |
|
||||
| **Anthropic** | `anthropic` | Direct Anthropic API (Claude) |
|
||||
| **Ollama** | HTTP to localhost:11434 | Local models (Llama, Mistral, Phi, etc.) |
|
||||
|
||||
### Context Window Strategy
|
||||
|
||||
| Transcript Length | Strategy |
|
||||
|-------------------|----------|
|
||||
| < 100K tokens | Send full transcript directly |
|
||||
| 100K - 200K tokens | Use Claude (200K context) or chunk for smaller models |
|
||||
| > 200K tokens | Map-reduce: summarize chunks, then combine |
|
||||
| Q&A mode | Semantic search over chunks, send top-K relevant to model |
|
||||
|
||||
### Configuration
|
||||
|
||||
Users configure AI providers in Settings. API keys stored in OS keychain (libsecret on Linux, Windows Credential Manager). Local models (Ollama) require no keys.
|
||||
|
||||
```json
|
||||
{
|
||||
"ai": {
|
||||
"default_provider": "ollama",
|
||||
"providers": {
|
||||
"ollama": { "base_url": "http://localhost:11434", "model": "llama3:8b" },
|
||||
"openai": { "model": "gpt-4o" },
|
||||
"anthropic": { "model": "claude-sonnet-4-20250514" },
|
||||
"litellm": { "model": "gpt-4o" }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Export Formats
|
||||
|
||||
### Caption Formats
|
||||
|
||||
| Format | Speaker Support | Library |
|
||||
|--------|----------------|---------|
|
||||
| **SRT** | `[Speaker]:` prefix convention | pysubs2 |
|
||||
| **WebVTT** | Native `<v Speaker>` voice tags | pysubs2 |
|
||||
| **ASS/SSA** | Named styles per speaker with colors | pysubs2 |
|
||||
|
||||
### Text Formats
|
||||
|
||||
| Format | Implementation |
|
||||
|--------|---------------|
|
||||
| **Plain text (.txt)** | Custom formatter |
|
||||
| **Markdown (.md)** | Custom formatter (bold speaker names) |
|
||||
| **DOCX** | python-docx |
|
||||
|
||||
### Text Output Example
|
||||
|
||||
```
|
||||
[00:00:03] Sarah Chen:
|
||||
Hello everyone, welcome to the meeting. I wanted to start by
|
||||
discussing the Q3 results before we move on to planning.
|
||||
|
||||
[00:00:15] Michael Torres:
|
||||
Thanks Sarah. The numbers look strong this quarter.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Project File Structure
|
||||
|
||||
```
|
||||
~/VoiceToNotes/
|
||||
config.json # Global app settings
|
||||
projects/
|
||||
{project-uuid}/
|
||||
project.db # SQLite database
|
||||
media/
|
||||
recording.m4a # Original media file
|
||||
exports/
|
||||
transcript.srt
|
||||
transcript.vtt
|
||||
notes.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Implementation Phases
|
||||
|
||||
### Phase 1 — Foundation
|
||||
Set up Tauri + Svelte project scaffold, Python sidecar with IPC protocol, SQLite schema, and basic project management UI.
|
||||
|
||||
**Deliverables:**
|
||||
- Tauri app launches with a working Svelte frontend
|
||||
- Python sidecar starts, communicates via JSON-line IPC
|
||||
- SQLite database created per-project
|
||||
- Create/open/list projects in the UI
|
||||
|
||||
### Phase 2 — Core Transcription
|
||||
Implement the transcription pipeline with audio playback and synchronized transcript display.
|
||||
|
||||
**Deliverables:**
|
||||
- Import audio/video files (ffmpeg conversion to WAV)
|
||||
- Run faster-whisper transcription with progress reporting
|
||||
- Display transcript with word-level timestamps
|
||||
- wavesurfer.js audio player with click-to-seek from transcript
|
||||
- Auto-scroll transcript during playback
|
||||
- Edit transcript text (corrections persist to DB)
|
||||
|
||||
### Phase 3 — Speaker Diarization
|
||||
Add speaker identification and management.
|
||||
|
||||
**Deliverables:**
|
||||
- pyannote.audio diarization integrated into pipeline
|
||||
- Speaker segments merged with word timestamps
|
||||
- Speaker labels displayed in transcript with colors
|
||||
- Rename speakers (persists across all segments)
|
||||
- Re-assign speaker for selected text segments
|
||||
- Hardware detection and model auto-selection (CPU/GPU)
|
||||
|
||||
### Phase 4 — Export
|
||||
Implement all export formats.
|
||||
|
||||
**Deliverables:**
|
||||
- SRT, WebVTT, ASS caption export with speaker labels
|
||||
- Plain text and Markdown export with speaker names
|
||||
- Export options panel in UI
|
||||
|
||||
### Phase 5 — AI Integration
|
||||
Add AI provider support for Q&A and summarization.
|
||||
|
||||
**Deliverables:**
|
||||
- Provider configuration UI with API key management
|
||||
- Ollama local model support
|
||||
- OpenAI and Anthropic direct SDK support
|
||||
- LiteLLM gateway support
|
||||
- Chat panel for asking questions about the transcript
|
||||
- Summary/notes generation with multiple styles
|
||||
- Context window management for long transcripts
|
||||
|
||||
### Phase 6 — Polish and Packaging
|
||||
Production readiness.
|
||||
|
||||
**Deliverables:**
|
||||
- Linux packaging (.deb, .AppImage)
|
||||
- Windows packaging (.msi, .exe installer)
|
||||
- Bundled Python environment (no user Python install required)
|
||||
- Model download manager (first-run setup)
|
||||
- Settings panel (model selection, hardware config, AI providers)
|
||||
- Error handling, logging, crash recovery
|
||||
|
||||
---
|
||||
|
||||
## 11. Agent Work Breakdown
|
||||
|
||||
For parallel development, the codebase splits into these independent workstreams:
|
||||
|
||||
| Agent | Scope | Dependencies |
|
||||
|-------|-------|-------------|
|
||||
| **Agent 1: Tauri + Frontend Shell** | Tauri project setup, Svelte scaffold, routing, project manager UI, settings UI | None |
|
||||
| **Agent 2: Python Sidecar + IPC** | Python project setup, IPC protocol, message loop, handler routing | None |
|
||||
| **Agent 3: Database Layer** | SQLite schema, Rust query layer, migration system | None |
|
||||
| **Agent 4: Transcription Pipeline** | faster-whisper integration, wav2vec2 alignment, hardware detection, model management | Agent 2 (IPC) |
|
||||
| **Agent 5: Diarization Pipeline** | pyannote.audio integration, speaker-word alignment, combined pipeline | Agent 4 (transcription) |
|
||||
| **Agent 6: Audio Player + Transcript UI** | wavesurfer.js integration, TipTap transcript editor, playback-transcript sync | Agent 1 (shell), Agent 3 (DB) |
|
||||
| **Agent 7: Export System** | pysubs2 caption export, text formatters, export UI | Agent 2 (IPC), Agent 3 (DB) |
|
||||
| **Agent 8: AI Provider System** | Provider abstraction, LiteLLM/OpenAI/Anthropic/Ollama adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |
|
||||
|
||||
Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.
|
||||
Reference in New Issue
Block a user