Switch local AI from Ollama to bundled llama-server, add MIT license

- Replace Ollama dependency with bundled llama-server (llama.cpp)
  so users need no separate install for local AI inference
- Rust backend manages llama-server lifecycle (spawn, port, shutdown)
- Add MIT license for open source release
- Update architecture doc, CLAUDE.md, and README accordingly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-26 09:00:47 -08:00
parent 0edb06a913
commit c450ef3c0c
4 changed files with 61 additions and 13 deletions

View File

@@ -162,11 +162,12 @@ src/
The Rust layer is intentionally thin. It handles:
1. **Process Management** — Spawn, monitor, and kill the Python sidecar
1. **Process Management** — Spawn, monitor, and kill the Python sidecar and llama-server
2. **IPC Relay** — Forward messages between frontend and Python process
3. **File Operations** — Read/write project files, manage media
4. **SQLite** — All database operations via rusqlite
5. **System Info** — Detect GPU, RAM, CPU for hardware recommendations
6. **llama-server Lifecycle** — Start/stop bundled llama-server, manage port allocation
```
src-tauri/
@@ -179,6 +180,7 @@ src-tauri/
ai.rs # AI provider commands
settings.rs # App settings and preferences
system.rs # Hardware detection
llama_server.rs # llama-server process lifecycle
db/
mod.rs # SQLite connection pool
schema.rs # Table definitions and migrations
@@ -215,7 +217,7 @@ python/
litellm_provider.py # LiteLLM (multi-provider gateway)
openai_provider.py # Direct OpenAI SDK
anthropic_provider.py # Direct Anthropic SDK
ollama_provider.py # Local Ollama models
local_provider.py # Bundled llama-server (OpenAI-compatible API)
hardware/
__init__.py
detect.py # GPU/CPU detection, VRAM estimation
@@ -399,12 +401,33 @@ class AIProvider(ABC):
### Supported Providers
| Provider | Package | Use Case |
|----------|---------|----------|
| Provider | Package / Binary | Use Case |
|----------|-----------------|----------|
| **llama-server** (bundled) | llama.cpp binary | Default local AI — bundled with app, no install needed. OpenAI-compatible API on localhost. |
| **LiteLLM** | `litellm` | Gateway to 100+ providers via unified API |
| **OpenAI** | `openai` | Direct OpenAI API (GPT-4o, etc.) |
| **Anthropic** | `anthropic` | Direct Anthropic API (Claude) |
| **Ollama** | HTTP to localhost:11434 | Local models (Llama, Mistral, Phi, etc.) |
#### Local AI via llama-server (llama.cpp)
The app bundles `llama-server` from the llama.cpp project (MIT license). This is the default AI provider — it runs entirely on the user's machine with no internet connection or separate install required.
**How it works:**
1. Rust backend spawns `llama-server` as a managed subprocess on app launch (or on first AI use)
2. llama-server exposes an OpenAI-compatible REST API on `localhost:{dynamic_port}`
3. Python sidecar talks to it using the same OpenAI SDK interface as cloud providers
4. On app exit, Rust backend cleanly shuts down the llama-server process
**Model management:**
- Models stored in `~/.voicetonotes/models/` (GGUF format)
- First-run setup downloads a recommended small model (e.g., Phi-3-mini, Llama-3-8B Q4)
- Users can download additional models or point to their own GGUF files
- Model selection in Settings UI with size/quality tradeoffs shown
**Hardware utilization:**
- CPU: Works on any machine, uses all available cores
- NVIDIA GPU: CUDA acceleration when available
- The same CPU/GPU auto-detection used for Whisper applies here
### Context Window Strategy
@@ -417,14 +440,14 @@ class AIProvider(ABC):
### Configuration
Users configure AI providers in Settings. API keys stored in OS keychain (libsecret on Linux, Windows Credential Manager). Local models (Ollama) require no keys.
Users configure AI providers in Settings. API keys for cloud providers stored in OS keychain (libsecret on Linux, Windows Credential Manager). The bundled llama-server requires no keys or internet.
```json
{
"ai": {
"default_provider": "ollama",
"default_provider": "local",
"providers": {
"ollama": { "base_url": "http://localhost:11434", "model": "llama3:8b" },
"local": { "model": "phi-3-mini-Q4_K_M.gguf", "gpu_layers": "auto" },
"openai": { "model": "gpt-4o" },
"anthropic": { "model": "claude-sonnet-4-20250514" },
"litellm": { "model": "gpt-4o" }
@@ -530,7 +553,8 @@ Add AI provider support for Q&A and summarization.
**Deliverables:**
- Provider configuration UI with API key management
- Ollama local model support
- Bundled llama-server for local AI (default, no internet required)
- Model download manager for local GGUF models
- OpenAI and Anthropic direct SDK support
- LiteLLM gateway support
- Chat panel for asking questions about the transcript
@@ -563,6 +587,6 @@ For parallel development, the codebase splits into these independent workstreams
| **Agent 5: Diarization Pipeline** | pyannote.audio integration, speaker-word alignment, combined pipeline | Agent 4 (transcription) |
| **Agent 6: Audio Player + Transcript UI** | wavesurfer.js integration, TipTap transcript editor, playback-transcript sync | Agent 1 (shell), Agent 3 (DB) |
| **Agent 7: Export System** | pysubs2 caption export, text formatters, export UI | Agent 2 (IPC), Agent 3 (DB) |
| **Agent 8: AI Provider System** | Provider abstraction, LiteLLM/OpenAI/Anthropic/Ollama adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |
| **Agent 8: AI Provider System** | Provider abstraction, bundled llama-server, LiteLLM/OpenAI/Anthropic adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |
Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.