Switch local AI from Ollama to bundled llama-server, add MIT license

- Replace Ollama dependency with bundled llama-server (llama.cpp) so users need no separate install for local AI inference - Rust backend manages llama-server lifecycle (spawn, port, shutdown) - Add MIT license for open source release - Update architecture doc, CLAUDE.md, and README accordingly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 09:00:47 -08:00
parent 0edb06a913
commit c450ef3c0c
4 changed files with 61 additions and 13 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -162,11 +162,12 @@ src/

 The Rust layer is intentionally thin. It handles:

-1. **Process Management** — Spawn, monitor, and kill the Python sidecar
+1. **Process Management** — Spawn, monitor, and kill the Python sidecar and llama-server
 2. **IPC Relay** — Forward messages between frontend and Python process
 3. **File Operations** — Read/write project files, manage media
 4. **SQLite** — All database operations via rusqlite
 5. **System Info** — Detect GPU, RAM, CPU for hardware recommendations
+6. **llama-server Lifecycle** — Start/stop bundled llama-server, manage port allocation

 ```
 src-tauri/
@@ -179,6 +180,7 @@ src-tauri/
      ai.rs                     # AI provider commands
      settings.rs               # App settings and preferences
      system.rs                 # Hardware detection
+      llama_server.rs           # llama-server process lifecycle
    db/
      mod.rs                    # SQLite connection pool
      schema.rs                 # Table definitions and migrations
@@ -215,7 +217,7 @@ python/
      litellm_provider.py       # LiteLLM (multi-provider gateway)
      openai_provider.py        # Direct OpenAI SDK
      anthropic_provider.py     # Direct Anthropic SDK
-      ollama_provider.py        # Local Ollama models
+      local_provider.py         # Bundled llama-server (OpenAI-compatible API)
    hardware/
      __init__.py
      detect.py                 # GPU/CPU detection, VRAM estimation
@@ -399,12 +401,33 @@ class AIProvider(ABC):

 ### Supported Providers

-| Provider | Package | Use Case |
-|----------|---------|----------|
+| Provider | Package / Binary | Use Case |
+|----------|-----------------|----------|
+| **llama-server** (bundled) | llama.cpp binary | Default local AI — bundled with app, no install needed. OpenAI-compatible API on localhost. |
 | **LiteLLM** | `litellm` | Gateway to 100+ providers via unified API |
 | **OpenAI** | `openai` | Direct OpenAI API (GPT-4o, etc.) |
 | **Anthropic** | `anthropic` | Direct Anthropic API (Claude) |
-| **Ollama** | HTTP to localhost:11434 | Local models (Llama, Mistral, Phi, etc.) |
+
+#### Local AI via llama-server (llama.cpp)
+
+The app bundles `llama-server` from the llama.cpp project (MIT license). This is the default AI provider — it runs entirely on the user's machine with no internet connection or separate install required.
+
+**How it works:**
+1. Rust backend spawns `llama-server` as a managed subprocess on app launch (or on first AI use)
+2. llama-server exposes an OpenAI-compatible REST API on `localhost:{dynamic_port}`
+3. Python sidecar talks to it using the same OpenAI SDK interface as cloud providers
+4. On app exit, Rust backend cleanly shuts down the llama-server process
+
+**Model management:**
+- Models stored in `~/.voicetonotes/models/` (GGUF format)
+- First-run setup downloads a recommended small model (e.g., Phi-3-mini, Llama-3-8B Q4)
+- Users can download additional models or point to their own GGUF files
+- Model selection in Settings UI with size/quality tradeoffs shown
+
+**Hardware utilization:**
+- CPU: Works on any machine, uses all available cores
+- NVIDIA GPU: CUDA acceleration when available
+- The same CPU/GPU auto-detection used for Whisper applies here

 ### Context Window Strategy

@@ -417,14 +440,14 @@ class AIProvider(ABC):

 ### Configuration

-Users configure AI providers in Settings. API keys stored in OS keychain (libsecret on Linux, Windows Credential Manager). Local models (Ollama) require no keys.
+Users configure AI providers in Settings. API keys for cloud providers stored in OS keychain (libsecret on Linux, Windows Credential Manager). The bundled llama-server requires no keys or internet.

 ```json
 {
  "ai": {
-    "default_provider": "ollama",
+    "default_provider": "local",
    "providers": {
-      "ollama": { "base_url": "http://localhost:11434", "model": "llama3:8b" },
+      "local": { "model": "phi-3-mini-Q4_K_M.gguf", "gpu_layers": "auto" },
      "openai": { "model": "gpt-4o" },
      "anthropic": { "model": "claude-sonnet-4-20250514" },
      "litellm": { "model": "gpt-4o" }
@@ -530,7 +553,8 @@ Add AI provider support for Q&A and summarization.

 **Deliverables:**
 - Provider configuration UI with API key management
- Ollama local model support
+- Bundled llama-server for local AI (default, no internet required)
+- Model download manager for local GGUF models
 - OpenAI and Anthropic direct SDK support
 - LiteLLM gateway support
 - Chat panel for asking questions about the transcript
@@ -563,6 +587,6 @@ For parallel development, the codebase splits into these independent workstreams
 | **Agent 5: Diarization Pipeline** | pyannote.audio integration, speaker-word alignment, combined pipeline | Agent 4 (transcription) |
 | **Agent 6: Audio Player + Transcript UI** | wavesurfer.js integration, TipTap transcript editor, playback-transcript sync | Agent 1 (shell), Agent 3 (DB) |
 | **Agent 7: Export System** | pysubs2 caption export, text formatters, export UI | Agent 2 (IPC), Agent 3 (DB) |
-| **Agent 8: AI Provider System** | Provider abstraction, LiteLLM/OpenAI/Anthropic/Ollama adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |
+| **Agent 8: AI Provider System** | Provider abstraction, bundled llama-server, LiteLLM/OpenAI/Anthropic adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |

 Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.