Update README, add User Guide and Contributing docs

- README: Updated to reflect current architecture (decoupled app/sidecar), Ollama as local AI, CUDA support, split CI workflows - USER_GUIDE.md: Complete how-to including first-time setup, transcription workflow, speaker detection setup, Ollama configuration, export formats, keyboard shortcuts, and troubleshooting - CONTRIBUTING.md: Dev setup, project structure, conventions, CI/CD overview Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 12:06:10 -07:00
parent f022c6dfe0
commit 35173c54ce
3 changed files with 420 additions and 40 deletions
@@ -0,0 +1,140 @@
 # Contributing to Voice to Notes
 Thank you for your interest in contributing! This guide covers how to set up the project for development and submit changes.
 ## Development Setup
 ### Prerequisites
 - **Node.js 20+** and npm
 - **Rust** (stable toolchain)
 - **Python 3.11+** with [uv](https://docs.astral.sh/uv/) (recommended) or pip
 - **System libraries (Linux only):**
  ```bash
  sudo apt install libgtk-3-dev libwebkit2gtk-4.1-dev libappindicator3-dev librsvg2-dev patchelf xdg-utils
  ```
 ### Clone and Install
 ```bash
 git clone https://repo.anhonesthost.net/MacroPad/voice-to-notes.git
 cd voice-to-notes
 # Frontend
 npm install
 # Python sidecar
 cd python && pip install -e ".[dev]" && cd ..
 ```
 ### Running in Dev Mode
 ```bash
 npm run tauri:dev
 ```
 This runs the Svelte dev server + Tauri with hot-reload. The Python sidecar runs from your system Python (no PyInstaller needed in dev mode).
 ### Building
 ```bash
 # Build the Python sidecar (frozen binary)
 cd python && python build_sidecar.py --cpu-only && cd ..
 # Build the full app
 npm run tauri build
 ```
 ## Project Structure
 ```
 src/                        # Svelte 5 frontend
  lib/components/           # Reusable UI components
  lib/stores/               # Svelte stores (app state)
  routes/                   # SvelteKit pages
 src-tauri/                  # Rust backend (Tauri v2)
  src/sidecar/              # Python sidecar lifecycle (download, extract, IPC)
  src/commands/             # Tauri command handlers
  src/db/                   # SQLite database layer
 python/                     # Python ML sidecar
  voice_to_notes/           # Main package
    services/               # Transcription, diarization, AI, export
    ipc/                    # JSON-line IPC protocol
    hardware/               # GPU/CPU detection
 .gitea/workflows/           # CI/CD pipelines
 docs/                       # Documentation
 ```
 ## How It Works
 The app has three layers:
 1. **Frontend (Svelte)** — UI, audio playback (wavesurfer.js), transcript editing (TipTap)
 2. **Backend (Rust/Tauri)** — Desktop integration, file access, SQLite, sidecar process management
 3. **Sidecar (Python)** — ML inference (faster-whisper, pyannote.audio), AI chat, export
 Rust and Python communicate via **JSON-line IPC** over stdin/stdout pipes. Each request has an `id`, `type`, and `payload`. The Python sidecar runs as a child process managed by `SidecarManager` in Rust.
 ## Conventions
 ### Rust
 - Follow standard Rust conventions
 - Run `cargo fmt` and `cargo clippy` before committing
 - Tauri commands go in `src-tauri/src/commands/`
 ### Python
 - Python 3.11+, type hints everywhere
 - Use `ruff` for linting: `ruff check python/`
 - Tests with pytest: `cd python && pytest`
 - IPC messages: JSON-line format with `id`, `type`, `payload` fields
 ### TypeScript / Svelte
 - Svelte 5 runes (`$state`, `$derived`, `$effect`)
 - Strict TypeScript
 - Components in `src/lib/components/`
 - State in `src/lib/stores/`
 ### General
 - All timestamps in milliseconds (integer)
 - UUIDs as primary keys in the database
 - Don't bundle API keys or secrets — those are user-configured
 ## Submitting Changes
 1. Fork the repository
 2. Create a feature branch: `git checkout -b my-feature`
 3. Make your changes
 4. Test locally with `npm run tauri:dev`
 5. Run linters: `cargo fmt && cargo clippy`, `ruff check python/`
 6. Commit with a clear message describing the change
 7. Open a Pull Request against `main`
 ## CI/CD
 Pushes to `main` automatically:
 - Bump the app version and create a release (`release.yml`)
 - Build app installers for all platforms
 Changes to `python/` also trigger sidecar builds (`build-sidecar.yml`).
 ## Areas for Contribution
 - UI/UX improvements
 - New export formats
 - Additional AI provider integrations
 - Performance optimizations
 - Accessibility improvements
 - Documentation and translations
 - Bug reports and testing on different platforms
 ## Reporting Issues
 Open an issue on the [repository](https://repo.anhonesthost.net/MacroPad/voice-to-notes/issues) with:
 - Steps to reproduce
 - Expected vs actual behavior
 - Platform and version info
 - Sidecar logs (`%LOCALAPPDATA%\com.voicetonotes.app\sidecar.log` on Windows)
 ## License
 By contributing, you agree that your contributions will be licensed under the [MIT License](LICENSE).
@@ -1,32 +1,55 @@
 # Voice to Notes
-A desktop application that transcribes audio/video recordings with speaker identification, producing editable transcriptions with synchronized audio playback.
+A desktop application that transcribes audio and video recordings with speaker identification, synchronized playback, and AI-powered analysis. Export to SRT, WebVTT, ASS captions, plain text, or Markdown.
 ## Features
- **Speech-to-Text Transcription** — Accurate transcription via faster-whisper (Whisper models) with word-level timestamps
+- **Speech-to-Text** — Accurate transcription via faster-whisper with word-level timestamps. Supports 99 languages.
- **Speaker Identification (Diarization)** — Detect and distinguish between speakers using pyannote.audio
+- **Speaker Identification** — Detect and label speakers using pyannote.audio. Rename speakers for clean exports.
- **Synchronized Playback** — Click any word to seek to that point in the audio (Web Audio API for instant playback)
+- **GPU Acceleration** — CUDA support for NVIDIA GPUs (Windows/Linux). Falls back to CPU automatically.
- **AI Integration** — Ask questions about your transcript via OpenAI, Anthropic, or any OpenAI-compatible API (LiteLLM proxies, Ollama, vLLM)
+- **Synchronized Playback** — Click any word to seek. Waveform visualization via wavesurfer.js.
- **Export Formats** — SRT, WebVTT, ASS captions, plain text, and Markdown with speaker labels
+- **AI Chat** — Ask questions about your transcript. Works with Ollama (local), OpenAI, Anthropic, or any OpenAI-compatible API.
- **Cross-Platform** — Builds for Linux, Windows, and macOS (Apple Silicon)
+- **Export** — SRT, WebVTT, ASS, plain text, Markdown — all with speaker labels.
 - **Cross-Platform** — Linux, Windows, macOS (Apple Silicon).
 ## Quick Start
 1. Download the installer from [Releases](https://repo.anhonesthost.net/MacroPad/voice-to-notes/releases)
 2. On first launch, choose **CPU** or **CUDA** sidecar (the AI engine downloads separately, ~500MB–2GB)
 3. Import an audio/video file and click **Transcribe**
 See the full [User Guide](docs/USER_GUIDE.md) for detailed setup and usage instructions.
 ## Platform Support
-| Platform | Architecture | Status |
+| Platform | Architecture | Installers |
-|----------|-------------|--------|
+|----------|-------------|------------|
-| Linux    | x86_64      | Supported |
+| Linux    | x86_64      | .deb, .rpm |
-| Windows  | x86_64      | Supported |
+| Windows  | x86_64      | .msi, .exe (NSIS) |
-| macOS    | ARM (Apple Silicon) | Supported |
+| macOS    | ARM (Apple Silicon) | .dmg |
 ## Architecture
 The app is split into two independently versioned components:
 - **App** (v0.2.x) — Tauri desktop shell with Svelte frontend. Small installer (~50MB).
 - **Sidecar** (v1.x) — Python ML engine (faster-whisper, pyannote.audio). Downloaded on first launch. CPU (~500MB) or CUDA (~2GB) variants.
 This separation means app UI updates don't require re-downloading the sidecar, and sidecar updates don't require reinstalling the app.
 ## Tech Stack
- **Desktop shell:** Tauri v2 (Rust backend + Svelte 5 / TypeScript frontend)
+| Component | Technology |
- **ML pipeline:** Python sidecar (faster-whisper, pyannote.audio) — frozen via PyInstaller for distribution
+|-----------|-----------|
- **Audio playback:** wavesurfer.js with Web Audio API backend
+| Desktop shell | Tauri v2 (Rust + Svelte 5 / TypeScript) |
- **AI providers:** OpenAI, Anthropic, OpenAI-compatible endpoints (local or remote)
+| Transcription | faster-whisper (CTranslate2) |
- **Local AI:** Bundled llama-server (llama.cpp)
+| Speaker ID | pyannote.audio 3.1 |
- **Caption export:** pysubs2
+| Audio UI | wavesurfer.js |
 | Transcript editor | TipTap (ProseMirror) |
 | AI (local) | Ollama (any model) |
 | AI (cloud) | OpenAI, Anthropic, OpenAI-compatible |
 | Caption export | pysubs2 |
 | Database | SQLite (rusqlite) |
 ## Development
@@ -34,8 +57,8 @@ A desktop application that transcribes audio/video recordings with speaker ident
 - Node.js 20+
 - Rust (stable)
- Python 3.11+ with ML dependencies
+- Python 3.11+ with uv or pip
- System: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev` (Linux)
+- Linux: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev`, `libappindicator3-dev`, `librsvg2-dev`
 ### Getting Started
@@ -44,47 +67,61 @@ A desktop application that transcribes audio/video recordings with speaker ident
 npm install
 # Install Python sidecar dependencies
-cd python && pip install -e . && cd ..
+cd python && pip install -e ".[dev]" && cd ..
 # Run in dev mode (uses system Python for the sidecar)
 npm run tauri:dev
 ```
-### Building for Distribution
+### Building
 ```bash
-# Build the frozen Python sidecar
+# Build the frozen Python sidecar (CPU-only)
-npm run sidecar:build
+cd python && python build_sidecar.py --cpu-only && cd ..
-# Build the Tauri app (requires sidecar in src-tauri/binaries/)
+# Build with CUDA support
 cd python && python build_sidecar.py --with-cuda && cd ..
 # Build the Tauri app
 npm run tauri build
 ```
 ### CI/CD
-Gitea Actions workflows are in `.gitea/workflows/`. The build pipeline:
+Two Gitea Actions workflows in `.gitea/workflows/`:
-1. **Build sidecar** — PyInstaller-frozen Python binary per platform (CPU-only PyTorch)
+**`release.yml`** — Triggers on push to main:
-2. **Build Tauri app** — Bundles the sidecar via `externalBin`, produces .deb/.AppImage (Linux), .msi (Windows), .dmg (macOS)
+1. Bumps app version (patch), creates git tag and Gitea release
 2. Builds lightweight app installers for all platforms (no sidecar bundled)
 **`build-sidecar.yml`** — Triggers on changes to `python/` or manual dispatch:
 1. Bumps sidecar version, creates `sidecar-v*` tag and release
 2. Builds CPU + CUDA variants for Linux/Windows, CPU for macOS
 3. Uploads as separate release assets
 #### Required Secrets
-| Secret | Purpose | Required? |
+| Secret | Purpose |
-|--------|---------|-----------|
+|--------|---------|
-| `TAURI_SIGNING_PRIVATE_KEY` | Signs Tauri update bundles | Optional (for auto-updates) |
+| `BUILD_TOKEN` | Gitea API token for creating releases and pushing tags |
 No other secrets are needed for building. AI provider API keys and HuggingFace tokens are configured by end users in the app's Settings.
 ### Project Structure
 ```
-src/                    # Svelte 5 frontend
+src/                        # Svelte 5 frontend
-src-tauri/              # Rust backend (Tauri commands, sidecar manager, SQLite)
+  lib/components/           # UI components (waveform, transcript editor, settings, etc.)
-python/                 # Python sidecar (transcription, diarization, AI)
+  lib/stores/               # Svelte stores (settings, transcript state)
-  voice_to_notes/       # Python package
+  routes/                   # SvelteKit pages
-  build_sidecar.py      # PyInstaller build script
+src-tauri/                  # Rust backend
-  voice_to_notes.spec   # PyInstaller spec
+  src/sidecar/              # Sidecar process manager (download, extract, IPC)
-.gitea/workflows/       # Gitea Actions CI/CD
+  src/commands/             # Tauri command handlers
  nsis-hooks.nsh            # Windows uninstall cleanup
 python/                     # Python sidecar
  voice_to_notes/           # Python package (transcription, diarization, AI, export)
  build_sidecar.py          # PyInstaller build script
  voice_to_notes.spec       # PyInstaller spec
 .gitea/workflows/           # CI/CD (release.yml, build-sidecar.yml)
 docs/                       # Documentation
 ```
 ## License
@@ -0,0 +1,203 @@
 # Voice to Notes — User Guide
 ## Getting Started
 ### Installation
 Download the installer for your platform from the [Releases](https://repo.anhonesthost.net/MacroPad/voice-to-notes/releases) page:
 - **Windows:** `.msi` or `-setup.exe`
 - **Linux:** `.deb` or `.rpm`
 - **macOS:** `.dmg`
 ### First-Time Setup
 On first launch, Voice to Notes will prompt you to download its AI engine (the "sidecar"):
 1. Choose **Standard (CPU)** (~500 MB) or **GPU Accelerated (CUDA)** (~2 GB)
   - Choose CUDA if you have an NVIDIA GPU for significantly faster transcription
   - CPU works on all computers
 2. Click **Download & Install** and wait for the download to complete
 3. The app will proceed to the main interface once the sidecar is ready
 The sidecar only needs to be downloaded once. Updates are detected automatically on launch.
 ---
 ## Basic Workflow
 ### 1. Import Audio
 - Click **Import Audio** or press **Ctrl+O** (Cmd+O on Mac)
 - Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA, MP4, MKV, AVI, MOV, WebM
 ### 2. Transcribe
 After importing, click **Transcribe** to start the transcription pipeline:
 - **Transcription:** Converts speech to text with word-level timestamps
 - **Speaker Detection:** Identifies different speakers (if configured — see [Speaker Detection](#speaker-detection))
 - A progress bar shows the current stage and percentage
 ### 3. Review and Edit
 - The **waveform** displays at the top — click anywhere to seek
 - The **transcript** shows below with speaker labels and timestamps
 - **Click any word** in the transcript to jump to that point in the audio
 - The current word highlights during playback
 - **Edit text** directly in the transcript — word timings are preserved
 ### 4. Export
 Click **Export** and choose a format:
 | Format | Extension | Best For |
 |--------|-----------|----------|
 | SRT | `.srt` | Video subtitles (most compatible) |
 | WebVTT | `.vtt` | Web video players, HTML5 |
 | ASS/SSA | `.ass` | Styled subtitles with speaker colors |
 | Plain Text | `.txt` | Reading, sharing, pasting |
 | Markdown | `.md` | Documentation, notes |
 All formats include speaker labels when speaker detection is enabled.
 ### 5. Save Project
 - **Ctrl+S** (Cmd+S) saves the current project as a `.vtn` file
 - This preserves the full transcript, speaker assignments, and edits
 - Reopen later to continue editing or re-export
 ---
 ## Playback Controls
 | Action | Shortcut |
 |--------|----------|
 | Play / Pause | **Space** |
 | Skip back 5s | **Left Arrow** |
 | Skip forward 5s | **Right Arrow** |
 | Seek to word | Click any word in the transcript |
 | Import audio | **Ctrl+O** / **Cmd+O** |
 | Open settings | **Ctrl+,** / **Cmd+,** |
 ---
 ## Speaker Detection
 Speaker detection (diarization) identifies who is speaking at each point in the audio. It requires a one-time setup:
 ### Setup
 1. Go to **Settings > Speakers**
 2. Create a free account at [huggingface.co](https://huggingface.co/join)
 3. Accept the license on **all three** model pages:
   - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
   - [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
   - [pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1)
 4. Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (read access is sufficient)
 5. Paste the token in Settings and click **Test & Download Model**
 ### Speaker Options
 - **Number of speakers:** Set to auto-detect or specify a fixed number for faster results
 - **Skip speaker detection:** Check this to only transcribe without identifying speakers
 ### Managing Speakers
 After transcription, speakers appear as "Speaker 1", "Speaker 2", etc. in the left sidebar. Double-click a speaker name to rename it — the new name appears throughout the transcript and in exports.
 ---
 ## AI Chat
 The AI chat panel lets you ask questions about your transcript. The AI sees the full transcript with speaker labels as context.
 Example prompts:
 - "Summarize this conversation"
 - "What were the key action items?"
 - "What did Speaker 1 say about the budget?"
 ### Setting Up Ollama (Local AI)
 [Ollama](https://ollama.com) runs AI models locally on your computer — no API keys or internet required.
 1. **Install Ollama:**
   - Download from [ollama.com](https://ollama.com)
   - Or on Linux: `curl -fsSL https://ollama.com/install.sh | sh`
 2. **Pull a model:**
   ```bash
   ollama pull llama3.2
   ```
   Other good options: `mistral`, `gemma2`, `phi3`
 3. **Configure in Voice to Notes:**
   - Go to **Settings > AI Provider**
   - Select **Ollama**
   - URL: `http://localhost:11434` (default, usually no change needed)
   - Model: `llama3.2` (or whichever model you pulled)
 4. **Use:** Open the AI chat panel (right sidebar) and start asking questions
 ### Cloud AI Providers
 If you prefer cloud-based AI:
 **OpenAI:**
 - Select **OpenAI** in Settings > AI Provider
 - Enter your API key from [platform.openai.com/api-keys](https://platform.openai.com/api-keys)
 - Default model: `gpt-4o-mini`
 **Anthropic:**
 - Select **Anthropic** in Settings > AI Provider
 - Enter your API key from [console.anthropic.com](https://console.anthropic.com)
 - Default model: `claude-sonnet-4-6`
 **OpenAI Compatible:**
 - For any provider with an OpenAI-compatible API (vLLM, LiteLLM, etc.)
 - Enter the API base URL, key, and model name
 ---
 ## Settings Reference
 ### Transcription
 | Setting | Options | Default |
 |---------|---------|---------|
 | Whisper Model | tiny, base, small, medium, large-v3 | base |
 | Device | CPU, CUDA | CPU |
 | Language | Auto-detect, or specify (en, es, fr, etc.) | Auto-detect |
 **Model recommendations:**
 - **tiny/base:** Fast, good for clear audio with one speaker
 - **small:** Best balance of speed and accuracy
 - **medium:** Better accuracy, noticeably slower
 - **large-v3:** Best accuracy, requires 8GB+ VRAM (GPU) or 16GB+ RAM (CPU)
 ### Debug
 - **Enable Developer Tools:** Opens the browser inspector for debugging
 ---
 ## Troubleshooting
 ### Transcription is slow
 - Use a smaller model (tiny or base)
 - If you have an NVIDIA GPU, select CUDA in Settings > Transcription > Device
 - Ensure you downloaded the CUDA sidecar during setup
 ### Speaker detection not working
 - Verify your HuggingFace token in Settings > Speakers
 - Click "Test & Download Model" to re-download
 - Make sure you accepted the license on all three model pages
 ### Audio won't play / No waveform
 - Check that the audio file still exists at its original location
 - Try re-importing the file
 - Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA
 ### App shows "Setting up Voice to Notes"
 - This is the first-launch sidecar download — it only happens once
 - If it fails, check your internet connection and click Retry