Update README, add User Guide and Contributing docs

- README: Updated to reflect current architecture (decoupled app/sidecar), Ollama as local AI, CUDA support, split CI workflows - USER_GUIDE.md: Complete how-to including first-time setup, transcription workflow, speaker detection setup, Ollama configuration, export formats, keyboard shortcuts, and troubleshooting - CONTRIBUTING.md: Dev setup, project structure, conventions, CI/CD overview Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 12:06:10 -07:00
parent f022c6dfe0
commit 35173c54ce
3 changed files with 420 additions and 40 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -0,0 +1,140 @@
+# Contributing to Voice to Notes
+
+Thank you for your interest in contributing! This guide covers how to set up the project for development and submit changes.
+
+## Development Setup
+
+### Prerequisites
+
+- **Node.js 20+** and npm
+- **Rust** (stable toolchain)
+- **Python 3.11+** with [uv](https://docs.astral.sh/uv/) (recommended) or pip
+- **System libraries (Linux only):**
+  ```bash
+  sudo apt install libgtk-3-dev libwebkit2gtk-4.1-dev libappindicator3-dev librsvg2-dev patchelf xdg-utils
+  ```
+
+### Clone and Install
+
+```bash
+git clone https://repo.anhonesthost.net/MacroPad/voice-to-notes.git
+cd voice-to-notes
+
+# Frontend
+npm install
+
+# Python sidecar
+cd python && pip install -e ".[dev]" && cd ..
+```
+
+### Running in Dev Mode
+
+```bash
+npm run tauri:dev
+```
+
+This runs the Svelte dev server + Tauri with hot-reload. The Python sidecar runs from your system Python (no PyInstaller needed in dev mode).
+
+### Building
+
+```bash
+# Build the Python sidecar (frozen binary)
+cd python && python build_sidecar.py --cpu-only && cd ..
+
+# Build the full app
+npm run tauri build
+```
+
+## Project Structure
+
+```
+src/                        # Svelte 5 frontend
+  lib/components/           # Reusable UI components
+  lib/stores/               # Svelte stores (app state)
+  routes/                   # SvelteKit pages
+src-tauri/                  # Rust backend (Tauri v2)
+  src/sidecar/              # Python sidecar lifecycle (download, extract, IPC)
+  src/commands/             # Tauri command handlers
+  src/db/                   # SQLite database layer
+python/                     # Python ML sidecar
+  voice_to_notes/           # Main package
+    services/               # Transcription, diarization, AI, export
+    ipc/                    # JSON-line IPC protocol
+    hardware/               # GPU/CPU detection
+.gitea/workflows/           # CI/CD pipelines
+docs/                       # Documentation
+```
+
+## How It Works
+
+The app has three layers:
+
+1. **Frontend (Svelte)** — UI, audio playback (wavesurfer.js), transcript editing (TipTap)
+2. **Backend (Rust/Tauri)** — Desktop integration, file access, SQLite, sidecar process management
+3. **Sidecar (Python)** — ML inference (faster-whisper, pyannote.audio), AI chat, export
+
+Rust and Python communicate via **JSON-line IPC** over stdin/stdout pipes. Each request has an `id`, `type`, and `payload`. The Python sidecar runs as a child process managed by `SidecarManager` in Rust.
+
+## Conventions
+
+### Rust
+- Follow standard Rust conventions
+- Run `cargo fmt` and `cargo clippy` before committing
+- Tauri commands go in `src-tauri/src/commands/`
+
+### Python
+- Python 3.11+, type hints everywhere
+- Use `ruff` for linting: `ruff check python/`
+- Tests with pytest: `cd python && pytest`
+- IPC messages: JSON-line format with `id`, `type`, `payload` fields
+
+### TypeScript / Svelte
+- Svelte 5 runes (`$state`, `$derived`, `$effect`)
+- Strict TypeScript
+- Components in `src/lib/components/`
+- State in `src/lib/stores/`
+
+### General
+- All timestamps in milliseconds (integer)
+- UUIDs as primary keys in the database
+- Don't bundle API keys or secrets — those are user-configured
+
+## Submitting Changes
+
+1. Fork the repository
+2. Create a feature branch: `git checkout -b my-feature`
+3. Make your changes
+4. Test locally with `npm run tauri:dev`
+5. Run linters: `cargo fmt && cargo clippy`, `ruff check python/`
+6. Commit with a clear message describing the change
+7. Open a Pull Request against `main`
+
+## CI/CD
+
+Pushes to `main` automatically:
+- Bump the app version and create a release (`release.yml`)
+- Build app installers for all platforms
+
+Changes to `python/` also trigger sidecar builds (`build-sidecar.yml`).
+
+## Areas for Contribution
+
+- UI/UX improvements
+- New export formats
+- Additional AI provider integrations
+- Performance optimizations
+- Accessibility improvements
+- Documentation and translations
+- Bug reports and testing on different platforms
+
+## Reporting Issues
+
+Open an issue on the [repository](https://repo.anhonesthost.net/MacroPad/voice-to-notes/issues) with:
+- Steps to reproduce
+- Expected vs actual behavior
+- Platform and version info
+- Sidecar logs (`%LOCALAPPDATA%\com.voicetonotes.app\sidecar.log` on Windows)
+
+## License
+
+By contributing, you agree that your contributions will be licensed under the [MIT License](LICENSE).
--- a/README.md
+++ b/README.md
@@ -1,32 +1,55 @@
 # Voice to Notes

-A desktop application that transcribes audio/video recordings with speaker identification, producing editable transcriptions with synchronized audio playback.
+A desktop application that transcribes audio and video recordings with speaker identification, synchronized playback, and AI-powered analysis. Export to SRT, WebVTT, ASS captions, plain text, or Markdown.

 ## Features

- **Speech-to-Text Transcription** — Accurate transcription via faster-whisper (Whisper models) with word-level timestamps
- **Speaker Identification (Diarization)** — Detect and distinguish between speakers using pyannote.audio
- **Synchronized Playback** — Click any word to seek to that point in the audio (Web Audio API for instant playback)
- **AI Integration** — Ask questions about your transcript via OpenAI, Anthropic, or any OpenAI-compatible API (LiteLLM proxies, Ollama, vLLM)
- **Export Formats** — SRT, WebVTT, ASS captions, plain text, and Markdown with speaker labels
- **Cross-Platform** — Builds for Linux, Windows, and macOS (Apple Silicon)
+- **Speech-to-Text** — Accurate transcription via faster-whisper with word-level timestamps. Supports 99 languages.
+- **Speaker Identification** — Detect and label speakers using pyannote.audio. Rename speakers for clean exports.
+- **GPU Acceleration** — CUDA support for NVIDIA GPUs (Windows/Linux). Falls back to CPU automatically.
+- **Synchronized Playback** — Click any word to seek. Waveform visualization via wavesurfer.js.
+- **AI Chat** — Ask questions about your transcript. Works with Ollama (local), OpenAI, Anthropic, or any OpenAI-compatible API.
+- **Export** — SRT, WebVTT, ASS, plain text, Markdown — all with speaker labels.
+- **Cross-Platform** — Linux, Windows, macOS (Apple Silicon).
+
+## Quick Start
+
+1. Download the installer from [Releases](https://repo.anhonesthost.net/MacroPad/voice-to-notes/releases)
+2. On first launch, choose **CPU** or **CUDA** sidecar (the AI engine downloads separately, ~500MB–2GB)
+3. Import an audio/video file and click **Transcribe**
+
+See the full [User Guide](docs/USER_GUIDE.md) for detailed setup and usage instructions.

 ## Platform Support

-| Platform | Architecture | Status |
-|----------|-------------|--------|
-| Linux    | x86_64      | Supported |
-| Windows  | x86_64      | Supported |
-| macOS    | ARM (Apple Silicon) | Supported |
+| Platform | Architecture | Installers |
+|----------|-------------|------------|
+| Linux    | x86_64      | .deb, .rpm |
+| Windows  | x86_64      | .msi, .exe (NSIS) |
+| macOS    | ARM (Apple Silicon) | .dmg |
+
+## Architecture
+
+The app is split into two independently versioned components:
+
+- **App** (v0.2.x) — Tauri desktop shell with Svelte frontend. Small installer (~50MB).
+- **Sidecar** (v1.x) — Python ML engine (faster-whisper, pyannote.audio). Downloaded on first launch. CPU (~500MB) or CUDA (~2GB) variants.
+
+This separation means app UI updates don't require re-downloading the sidecar, and sidecar updates don't require reinstalling the app.

 ## Tech Stack

- **Desktop shell:** Tauri v2 (Rust backend + Svelte 5 / TypeScript frontend)
- **ML pipeline:** Python sidecar (faster-whisper, pyannote.audio) — frozen via PyInstaller for distribution
- **Audio playback:** wavesurfer.js with Web Audio API backend
- **AI providers:** OpenAI, Anthropic, OpenAI-compatible endpoints (local or remote)
- **Local AI:** Bundled llama-server (llama.cpp)
- **Caption export:** pysubs2
+| Component | Technology |
+|-----------|-----------|
+| Desktop shell | Tauri v2 (Rust + Svelte 5 / TypeScript) |
+| Transcription | faster-whisper (CTranslate2) |
+| Speaker ID | pyannote.audio 3.1 |
+| Audio UI | wavesurfer.js |
+| Transcript editor | TipTap (ProseMirror) |
+| AI (local) | Ollama (any model) |
+| AI (cloud) | OpenAI, Anthropic, OpenAI-compatible |
+| Caption export | pysubs2 |
+| Database | SQLite (rusqlite) |

 ## Development

@@ -34,8 +57,8 @@ A desktop application that transcribes audio/video recordings with speaker ident

 - Node.js 20+
 - Rust (stable)
- Python 3.11+ with ML dependencies
- System: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev` (Linux)
+- Python 3.11+ with uv or pip
+- Linux: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev`, `libappindicator3-dev`, `librsvg2-dev`

 ### Getting Started

@@ -44,47 +67,61 @@ A desktop application that transcribes audio/video recordings with speaker ident
 npm install

 # Install Python sidecar dependencies
-cd python && pip install -e . && cd ..
+cd python && pip install -e ".[dev]" && cd ..

 # Run in dev mode (uses system Python for the sidecar)
 npm run tauri:dev
 ```

-### Building for Distribution
+### Building

 ```bash
-# Build the frozen Python sidecar
-npm run sidecar:build
+# Build the frozen Python sidecar (CPU-only)
+cd python && python build_sidecar.py --cpu-only && cd ..

-# Build the Tauri app (requires sidecar in src-tauri/binaries/)
+# Build with CUDA support
+cd python && python build_sidecar.py --with-cuda && cd ..
+
+# Build the Tauri app
 npm run tauri build
 ```

 ### CI/CD

-Gitea Actions workflows are in `.gitea/workflows/`. The build pipeline:
+Two Gitea Actions workflows in `.gitea/workflows/`:

-1. **Build sidecar** — PyInstaller-frozen Python binary per platform (CPU-only PyTorch)
-2. **Build Tauri app** — Bundles the sidecar via `externalBin`, produces .deb/.AppImage (Linux), .msi (Windows), .dmg (macOS)
+**`release.yml`** — Triggers on push to main:
+1. Bumps app version (patch), creates git tag and Gitea release
+2. Builds lightweight app installers for all platforms (no sidecar bundled)
+
+**`build-sidecar.yml`** — Triggers on changes to `python/` or manual dispatch:
+1. Bumps sidecar version, creates `sidecar-v*` tag and release
+2. Builds CPU + CUDA variants for Linux/Windows, CPU for macOS
+3. Uploads as separate release assets

 #### Required Secrets

-| Secret | Purpose | Required? |
-|--------|---------|-----------|
-| `TAURI_SIGNING_PRIVATE_KEY` | Signs Tauri update bundles | Optional (for auto-updates) |
-
-No other secrets are needed for building. AI provider API keys and HuggingFace tokens are configured by end users in the app's Settings.
+| Secret | Purpose |
+|--------|---------|
+| `BUILD_TOKEN` | Gitea API token for creating releases and pushing tags |

 ### Project Structure

 ```
-src/                    # Svelte 5 frontend
-src-tauri/              # Rust backend (Tauri commands, sidecar manager, SQLite)
-python/                 # Python sidecar (transcription, diarization, AI)
-  voice_to_notes/       # Python package
-  build_sidecar.py      # PyInstaller build script
-  voice_to_notes.spec   # PyInstaller spec
-.gitea/workflows/       # Gitea Actions CI/CD
+src/                        # Svelte 5 frontend
+  lib/components/           # UI components (waveform, transcript editor, settings, etc.)
+  lib/stores/               # Svelte stores (settings, transcript state)
+  routes/                   # SvelteKit pages
+src-tauri/                  # Rust backend
+  src/sidecar/              # Sidecar process manager (download, extract, IPC)
+  src/commands/             # Tauri command handlers
+  nsis-hooks.nsh            # Windows uninstall cleanup
+python/                     # Python sidecar
+  voice_to_notes/           # Python package (transcription, diarization, AI, export)
+  build_sidecar.py          # PyInstaller build script
+  voice_to_notes.spec       # PyInstaller spec
+.gitea/workflows/           # CI/CD (release.yml, build-sidecar.yml)
+docs/                       # Documentation
 ```

 ## License
--- a/docs/USER_GUIDE.md
+++ b/docs/USER_GUIDE.md
@@ -0,0 +1,203 @@
+# Voice to Notes — User Guide
+
+## Getting Started
+
+### Installation
+
+Download the installer for your platform from the [Releases](https://repo.anhonesthost.net/MacroPad/voice-to-notes/releases) page:
+
+- **Windows:** `.msi` or `-setup.exe`
+- **Linux:** `.deb` or `.rpm`
+- **macOS:** `.dmg`
+
+### First-Time Setup
+
+On first launch, Voice to Notes will prompt you to download its AI engine (the "sidecar"):
+
+1. Choose **Standard (CPU)** (~500 MB) or **GPU Accelerated (CUDA)** (~2 GB)
+   - Choose CUDA if you have an NVIDIA GPU for significantly faster transcription
+   - CPU works on all computers
+2. Click **Download & Install** and wait for the download to complete
+3. The app will proceed to the main interface once the sidecar is ready
+
+The sidecar only needs to be downloaded once. Updates are detected automatically on launch.
+
+---
+
+## Basic Workflow
+
+### 1. Import Audio
+
+- Click **Import Audio** or press **Ctrl+O** (Cmd+O on Mac)
+- Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA, MP4, MKV, AVI, MOV, WebM
+
+### 2. Transcribe
+
+After importing, click **Transcribe** to start the transcription pipeline:
+
+- **Transcription:** Converts speech to text with word-level timestamps
+- **Speaker Detection:** Identifies different speakers (if configured — see [Speaker Detection](#speaker-detection))
+- A progress bar shows the current stage and percentage
+
+### 3. Review and Edit
+
+- The **waveform** displays at the top — click anywhere to seek
+- The **transcript** shows below with speaker labels and timestamps
+- **Click any word** in the transcript to jump to that point in the audio
+- The current word highlights during playback
+- **Edit text** directly in the transcript — word timings are preserved
+
+### 4. Export
+
+Click **Export** and choose a format:
+
+| Format | Extension | Best For |
+|--------|-----------|----------|
+| SRT | `.srt` | Video subtitles (most compatible) |
+| WebVTT | `.vtt` | Web video players, HTML5 |
+| ASS/SSA | `.ass` | Styled subtitles with speaker colors |
+| Plain Text | `.txt` | Reading, sharing, pasting |
+| Markdown | `.md` | Documentation, notes |
+
+All formats include speaker labels when speaker detection is enabled.
+
+### 5. Save Project
+
+- **Ctrl+S** (Cmd+S) saves the current project as a `.vtn` file
+- This preserves the full transcript, speaker assignments, and edits
+- Reopen later to continue editing or re-export
+
+---
+
+## Playback Controls
+
+| Action | Shortcut |
+|--------|----------|
+| Play / Pause | **Space** |
+| Skip back 5s | **Left Arrow** |
+| Skip forward 5s | **Right Arrow** |
+| Seek to word | Click any word in the transcript |
+| Import audio | **Ctrl+O** / **Cmd+O** |
+| Open settings | **Ctrl+,** / **Cmd+,** |
+
+---
+
+## Speaker Detection
+
+Speaker detection (diarization) identifies who is speaking at each point in the audio. It requires a one-time setup:
+
+### Setup
+
+1. Go to **Settings > Speakers**
+2. Create a free account at [huggingface.co](https://huggingface.co/join)
+3. Accept the license on **all three** model pages:
+   - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
+   - [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
+   - [pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1)
+4. Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (read access is sufficient)
+5. Paste the token in Settings and click **Test & Download Model**
+
+### Speaker Options
+
+- **Number of speakers:** Set to auto-detect or specify a fixed number for faster results
+- **Skip speaker detection:** Check this to only transcribe without identifying speakers
+
+### Managing Speakers
+
+After transcription, speakers appear as "Speaker 1", "Speaker 2", etc. in the left sidebar. Double-click a speaker name to rename it — the new name appears throughout the transcript and in exports.
+
+---
+
+## AI Chat
+
+The AI chat panel lets you ask questions about your transcript. The AI sees the full transcript with speaker labels as context.
+
+Example prompts:
+- "Summarize this conversation"
+- "What were the key action items?"
+- "What did Speaker 1 say about the budget?"
+
+### Setting Up Ollama (Local AI)
+
+[Ollama](https://ollama.com) runs AI models locally on your computer — no API keys or internet required.
+
+1. **Install Ollama:**
+   - Download from [ollama.com](https://ollama.com)
+   - Or on Linux: `curl -fsSL https://ollama.com/install.sh | sh`
+
+2. **Pull a model:**
+   ```bash
+   ollama pull llama3.2
+   ```
+   Other good options: `mistral`, `gemma2`, `phi3`
+
+3. **Configure in Voice to Notes:**
+   - Go to **Settings > AI Provider**
+   - Select **Ollama**
+   - URL: `http://localhost:11434` (default, usually no change needed)
+   - Model: `llama3.2` (or whichever model you pulled)
+
+4. **Use:** Open the AI chat panel (right sidebar) and start asking questions
+
+### Cloud AI Providers
+
+If you prefer cloud-based AI:
+
+**OpenAI:**
+- Select **OpenAI** in Settings > AI Provider
+- Enter your API key from [platform.openai.com/api-keys](https://platform.openai.com/api-keys)
+- Default model: `gpt-4o-mini`
+
+**Anthropic:**
+- Select **Anthropic** in Settings > AI Provider
+- Enter your API key from [console.anthropic.com](https://console.anthropic.com)
+- Default model: `claude-sonnet-4-6`
+
+**OpenAI Compatible:**
+- For any provider with an OpenAI-compatible API (vLLM, LiteLLM, etc.)
+- Enter the API base URL, key, and model name
+
+---
+
+## Settings Reference
+
+### Transcription
+
+| Setting | Options | Default |
+|---------|---------|---------|
+| Whisper Model | tiny, base, small, medium, large-v3 | base |
+| Device | CPU, CUDA | CPU |
+| Language | Auto-detect, or specify (en, es, fr, etc.) | Auto-detect |
+
+**Model recommendations:**
+- **tiny/base:** Fast, good for clear audio with one speaker
+- **small:** Best balance of speed and accuracy
+- **medium:** Better accuracy, noticeably slower
+- **large-v3:** Best accuracy, requires 8GB+ VRAM (GPU) or 16GB+ RAM (CPU)
+
+### Debug
+
+- **Enable Developer Tools:** Opens the browser inspector for debugging
+
+---
+
+## Troubleshooting
+
+### Transcription is slow
+- Use a smaller model (tiny or base)
+- If you have an NVIDIA GPU, select CUDA in Settings > Transcription > Device
+- Ensure you downloaded the CUDA sidecar during setup
+
+### Speaker detection not working
+- Verify your HuggingFace token in Settings > Speakers
+- Click "Test & Download Model" to re-download
+- Make sure you accepted the license on all three model pages
+
+### Audio won't play / No waveform
+- Check that the audio file still exists at its original location
+- Try re-importing the file
+- Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA
+
+### App shows "Setting up Voice to Notes"
+- This is the first-launch sidecar download — it only happens once
+- If it fails, check your internet connection and click Retry