From c450ef3c0cd9ea60859fa8153dc6b1d2e942d397 Mon Sep 17 00:00:00 2001
From: Josh Knapp <jknapp85@gmail.com>
Date: Thu, 26 Feb 2026 09:00:47 -0800
Subject: [PATCH] Switch local AI from Ollama to bundled llama-server, add MIT
 license

- Replace Ollama dependency with bundled llama-server (llama.cpp)
  so users need no separate install for local AI inference
- Rust backend manages llama-server lifecycle (spawn, port, shutdown)
- Add MIT license for open source release
- Update architecture doc, CLAUDE.md, and README accordingly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 CLAUDE.md            |  7 +++++--
 LICENSE              | 21 +++++++++++++++++++++
 README.md            |  2 +-
 docs/ARCHITECTURE.md | 44 ++++++++++++++++++++++++++++++++++----------
 4 files changed, 61 insertions(+), 13 deletions(-)
 create mode 100644 LICENSE

diff --git a/CLAUDE.md b/CLAUDE.md
index b248506..6e55905 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -7,7 +7,8 @@ Desktop app for transcribing audio/video with speaker identification. Runs local
 - **Desktop shell:** Tauri v2 (Rust backend + Svelte/TypeScript frontend)
 - **ML pipeline:** Python sidecar process (faster-whisper, pyannote.audio, wav2vec2)
 - **Database:** SQLite (via rusqlite in Rust)
-- **AI providers:** LiteLLM, OpenAI, Anthropic, Ollama (local)
+- **Local AI:** Bundled llama-server (llama.cpp) — default, no install needed
+- **Cloud AI providers:** LiteLLM, OpenAI, Anthropic (optional, user-configured)
 - **Caption export:** pysubs2 (Python)
 - **Audio UI:** wavesurfer.js
 - **Transcript editor:** TipTap (ProseMirror)
@@ -15,7 +16,9 @@ Desktop app for transcribing audio/video with speaker identification. Runs local
 ## Key Architecture Decisions
 - Python sidecar communicates with Rust via JSON-line IPC (stdin/stdout)
 - All ML models must work on CPU. GPU (CUDA) is optional acceleration.
-- AI cloud providers are optional. Local models (Ollama) are a first-class option.
+- AI cloud providers are optional. Bundled llama-server (llama.cpp) is the default local AI — no separate install needed.
+- Rust backend manages llama-server lifecycle (start/stop/port allocation).
+- Project is open source (MIT license).
 - SQLite database is per-project, stored alongside media files.
 - Word-level timestamps are required for click-to-seek playback sync.
 
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..db9356a
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 Voice to Notes Contributors
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
index c87b17a..740f612 100644
--- a/README.md
+++ b/README.md
@@ -27,4 +27,4 @@ A desktop application that transcribes audio/video recordings with speaker ident
 
 ## License
 
-TBD
+MIT
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
index dc49a05..f6480e8 100644
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -162,11 +162,12 @@ src/
 
 The Rust layer is intentionally thin. It handles:
 
-1. **Process Management** — Spawn, monitor, and kill the Python sidecar
+1. **Process Management** — Spawn, monitor, and kill the Python sidecar and llama-server
 2. **IPC Relay** — Forward messages between frontend and Python process
 3. **File Operations** — Read/write project files, manage media
 4. **SQLite** — All database operations via rusqlite
 5. **System Info** — Detect GPU, RAM, CPU for hardware recommendations
+6. **llama-server Lifecycle** — Start/stop bundled llama-server, manage port allocation
 
 ```
 src-tauri/
@@ -179,6 +180,7 @@ src-tauri/
       ai.rs                     # AI provider commands
       settings.rs               # App settings and preferences
       system.rs                 # Hardware detection
+      llama_server.rs           # llama-server process lifecycle
     db/
       mod.rs                    # SQLite connection pool
       schema.rs                 # Table definitions and migrations
@@ -215,7 +217,7 @@ python/
       litellm_provider.py       # LiteLLM (multi-provider gateway)
       openai_provider.py        # Direct OpenAI SDK
       anthropic_provider.py     # Direct Anthropic SDK
-      ollama_provider.py        # Local Ollama models
+      local_provider.py         # Bundled llama-server (OpenAI-compatible API)
     hardware/
       __init__.py
       detect.py                 # GPU/CPU detection, VRAM estimation
@@ -399,12 +401,33 @@ class AIProvider(ABC):
 
 ### Supported Providers
 
-| Provider | Package | Use Case |
-|----------|---------|----------|
+| Provider | Package / Binary | Use Case |
+|----------|-----------------|----------|
+| **llama-server** (bundled) | llama.cpp binary | Default local AI — bundled with app, no install needed. OpenAI-compatible API on localhost. |
 | **LiteLLM** | `litellm` | Gateway to 100+ providers via unified API |
 | **OpenAI** | `openai` | Direct OpenAI API (GPT-4o, etc.) |
 | **Anthropic** | `anthropic` | Direct Anthropic API (Claude) |
-| **Ollama** | HTTP to localhost:11434 | Local models (Llama, Mistral, Phi, etc.) |
+
+#### Local AI via llama-server (llama.cpp)
+
+The app bundles `llama-server` from the llama.cpp project (MIT license). This is the default AI provider — it runs entirely on the user's machine with no internet connection or separate install required.
+
+**How it works:**
+1. Rust backend spawns `llama-server` as a managed subprocess on app launch (or on first AI use)
+2. llama-server exposes an OpenAI-compatible REST API on `localhost:{dynamic_port}`
+3. Python sidecar talks to it using the same OpenAI SDK interface as cloud providers
+4. On app exit, Rust backend cleanly shuts down the llama-server process
+
+**Model management:**
+- Models stored in `~/.voicetonotes/models/` (GGUF format)
+- First-run setup downloads a recommended small model (e.g., Phi-3-mini, Llama-3-8B Q4)
+- Users can download additional models or point to their own GGUF files
+- Model selection in Settings UI with size/quality tradeoffs shown
+
+**Hardware utilization:**
+- CPU: Works on any machine, uses all available cores
+- NVIDIA GPU: CUDA acceleration when available
+- The same CPU/GPU auto-detection used for Whisper applies here
 
 ### Context Window Strategy
 
@@ -417,14 +440,14 @@ class AIProvider(ABC):
 
 ### Configuration
 
-Users configure AI providers in Settings. API keys stored in OS keychain (libsecret on Linux, Windows Credential Manager). Local models (Ollama) require no keys.
+Users configure AI providers in Settings. API keys for cloud providers stored in OS keychain (libsecret on Linux, Windows Credential Manager). The bundled llama-server requires no keys or internet.
 
 ```json
 {
   "ai": {
-    "default_provider": "ollama",
+    "default_provider": "local",
     "providers": {
-      "ollama": { "base_url": "http://localhost:11434", "model": "llama3:8b" },
+      "local": { "model": "phi-3-mini-Q4_K_M.gguf", "gpu_layers": "auto" },
       "openai": { "model": "gpt-4o" },
       "anthropic": { "model": "claude-sonnet-4-20250514" },
       "litellm": { "model": "gpt-4o" }
@@ -530,7 +553,8 @@ Add AI provider support for Q&A and summarization.
 
 **Deliverables:**
 - Provider configuration UI with API key management
-- Ollama local model support
+- Bundled llama-server for local AI (default, no internet required)
+- Model download manager for local GGUF models
 - OpenAI and Anthropic direct SDK support
 - LiteLLM gateway support
 - Chat panel for asking questions about the transcript
@@ -563,6 +587,6 @@ For parallel development, the codebase splits into these independent workstreams
 | **Agent 5: Diarization Pipeline** | pyannote.audio integration, speaker-word alignment, combined pipeline | Agent 4 (transcription) |
 | **Agent 6: Audio Player + Transcript UI** | wavesurfer.js integration, TipTap transcript editor, playback-transcript sync | Agent 1 (shell), Agent 3 (DB) |
 | **Agent 7: Export System** | pysubs2 caption export, text formatters, export UI | Agent 2 (IPC), Agent 3 (DB) |
-| **Agent 8: AI Provider System** | Provider abstraction, LiteLLM/OpenAI/Anthropic/Ollama adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |
+| **Agent 8: AI Provider System** | Provider abstraction, bundled llama-server, LiteLLM/OpenAI/Anthropic adapters, chat UI | Agent 2 (IPC), Agent 1 (shell) |
 
 Agents 1, 2, and 3 can start immediately in parallel. Agents 4-8 follow once their dependencies are in place.