From 35173c54ce3163f3330a9fb31c41c6318630cf52 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 22 Mar 2026 12:06:10 -0700 Subject: [PATCH] Update README, add User Guide and Contributing docs - README: Updated to reflect current architecture (decoupled app/sidecar), Ollama as local AI, CUDA support, split CI workflows - USER_GUIDE.md: Complete how-to including first-time setup, transcription workflow, speaker detection setup, Ollama configuration, export formats, keyboard shortcuts, and troubleshooting - CONTRIBUTING.md: Dev setup, project structure, conventions, CI/CD overview Co-Authored-By: Claude Opus 4.6 --- CONTRIBUTING.md | 140 +++++++++++++++++++++++++++++++ README.md | 117 +++++++++++++++++--------- docs/USER_GUIDE.md | 203 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 420 insertions(+), 40 deletions(-) create mode 100644 CONTRIBUTING.md create mode 100644 docs/USER_GUIDE.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..4d85598 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,140 @@ +# Contributing to Voice to Notes + +Thank you for your interest in contributing! This guide covers how to set up the project for development and submit changes. + +## Development Setup + +### Prerequisites + +- **Node.js 20+** and npm +- **Rust** (stable toolchain) +- **Python 3.11+** with [uv](https://docs.astral.sh/uv/) (recommended) or pip +- **System libraries (Linux only):** + ```bash + sudo apt install libgtk-3-dev libwebkit2gtk-4.1-dev libappindicator3-dev librsvg2-dev patchelf xdg-utils + ``` + +### Clone and Install + +```bash +git clone https://repo.anhonesthost.net/MacroPad/voice-to-notes.git +cd voice-to-notes + +# Frontend +npm install + +# Python sidecar +cd python && pip install -e ".[dev]" && cd .. +``` + +### Running in Dev Mode + +```bash +npm run tauri:dev +``` + +This runs the Svelte dev server + Tauri with hot-reload. The Python sidecar runs from your system Python (no PyInstaller needed in dev mode). + +### Building + +```bash +# Build the Python sidecar (frozen binary) +cd python && python build_sidecar.py --cpu-only && cd .. + +# Build the full app +npm run tauri build +``` + +## Project Structure + +``` +src/ # Svelte 5 frontend + lib/components/ # Reusable UI components + lib/stores/ # Svelte stores (app state) + routes/ # SvelteKit pages +src-tauri/ # Rust backend (Tauri v2) + src/sidecar/ # Python sidecar lifecycle (download, extract, IPC) + src/commands/ # Tauri command handlers + src/db/ # SQLite database layer +python/ # Python ML sidecar + voice_to_notes/ # Main package + services/ # Transcription, diarization, AI, export + ipc/ # JSON-line IPC protocol + hardware/ # GPU/CPU detection +.gitea/workflows/ # CI/CD pipelines +docs/ # Documentation +``` + +## How It Works + +The app has three layers: + +1. **Frontend (Svelte)** — UI, audio playback (wavesurfer.js), transcript editing (TipTap) +2. **Backend (Rust/Tauri)** — Desktop integration, file access, SQLite, sidecar process management +3. **Sidecar (Python)** — ML inference (faster-whisper, pyannote.audio), AI chat, export + +Rust and Python communicate via **JSON-line IPC** over stdin/stdout pipes. Each request has an `id`, `type`, and `payload`. The Python sidecar runs as a child process managed by `SidecarManager` in Rust. + +## Conventions + +### Rust +- Follow standard Rust conventions +- Run `cargo fmt` and `cargo clippy` before committing +- Tauri commands go in `src-tauri/src/commands/` + +### Python +- Python 3.11+, type hints everywhere +- Use `ruff` for linting: `ruff check python/` +- Tests with pytest: `cd python && pytest` +- IPC messages: JSON-line format with `id`, `type`, `payload` fields + +### TypeScript / Svelte +- Svelte 5 runes (`$state`, `$derived`, `$effect`) +- Strict TypeScript +- Components in `src/lib/components/` +- State in `src/lib/stores/` + +### General +- All timestamps in milliseconds (integer) +- UUIDs as primary keys in the database +- Don't bundle API keys or secrets — those are user-configured + +## Submitting Changes + +1. Fork the repository +2. Create a feature branch: `git checkout -b my-feature` +3. Make your changes +4. Test locally with `npm run tauri:dev` +5. Run linters: `cargo fmt && cargo clippy`, `ruff check python/` +6. Commit with a clear message describing the change +7. Open a Pull Request against `main` + +## CI/CD + +Pushes to `main` automatically: +- Bump the app version and create a release (`release.yml`) +- Build app installers for all platforms + +Changes to `python/` also trigger sidecar builds (`build-sidecar.yml`). + +## Areas for Contribution + +- UI/UX improvements +- New export formats +- Additional AI provider integrations +- Performance optimizations +- Accessibility improvements +- Documentation and translations +- Bug reports and testing on different platforms + +## Reporting Issues + +Open an issue on the [repository](https://repo.anhonesthost.net/MacroPad/voice-to-notes/issues) with: +- Steps to reproduce +- Expected vs actual behavior +- Platform and version info +- Sidecar logs (`%LOCALAPPDATA%\com.voicetonotes.app\sidecar.log` on Windows) + +## License + +By contributing, you agree that your contributions will be licensed under the [MIT License](LICENSE). diff --git a/README.md b/README.md index 350a55d..8d35730 100644 --- a/README.md +++ b/README.md @@ -1,32 +1,55 @@ # Voice to Notes -A desktop application that transcribes audio/video recordings with speaker identification, producing editable transcriptions with synchronized audio playback. +A desktop application that transcribes audio and video recordings with speaker identification, synchronized playback, and AI-powered analysis. Export to SRT, WebVTT, ASS captions, plain text, or Markdown. ## Features -- **Speech-to-Text Transcription** — Accurate transcription via faster-whisper (Whisper models) with word-level timestamps -- **Speaker Identification (Diarization)** — Detect and distinguish between speakers using pyannote.audio -- **Synchronized Playback** — Click any word to seek to that point in the audio (Web Audio API for instant playback) -- **AI Integration** — Ask questions about your transcript via OpenAI, Anthropic, or any OpenAI-compatible API (LiteLLM proxies, Ollama, vLLM) -- **Export Formats** — SRT, WebVTT, ASS captions, plain text, and Markdown with speaker labels -- **Cross-Platform** — Builds for Linux, Windows, and macOS (Apple Silicon) +- **Speech-to-Text** — Accurate transcription via faster-whisper with word-level timestamps. Supports 99 languages. +- **Speaker Identification** — Detect and label speakers using pyannote.audio. Rename speakers for clean exports. +- **GPU Acceleration** — CUDA support for NVIDIA GPUs (Windows/Linux). Falls back to CPU automatically. +- **Synchronized Playback** — Click any word to seek. Waveform visualization via wavesurfer.js. +- **AI Chat** — Ask questions about your transcript. Works with Ollama (local), OpenAI, Anthropic, or any OpenAI-compatible API. +- **Export** — SRT, WebVTT, ASS, plain text, Markdown — all with speaker labels. +- **Cross-Platform** — Linux, Windows, macOS (Apple Silicon). + +## Quick Start + +1. Download the installer from [Releases](https://repo.anhonesthost.net/MacroPad/voice-to-notes/releases) +2. On first launch, choose **CPU** or **CUDA** sidecar (the AI engine downloads separately, ~500MB–2GB) +3. Import an audio/video file and click **Transcribe** + +See the full [User Guide](docs/USER_GUIDE.md) for detailed setup and usage instructions. ## Platform Support -| Platform | Architecture | Status | -|----------|-------------|--------| -| Linux | x86_64 | Supported | -| Windows | x86_64 | Supported | -| macOS | ARM (Apple Silicon) | Supported | +| Platform | Architecture | Installers | +|----------|-------------|------------| +| Linux | x86_64 | .deb, .rpm | +| Windows | x86_64 | .msi, .exe (NSIS) | +| macOS | ARM (Apple Silicon) | .dmg | + +## Architecture + +The app is split into two independently versioned components: + +- **App** (v0.2.x) — Tauri desktop shell with Svelte frontend. Small installer (~50MB). +- **Sidecar** (v1.x) — Python ML engine (faster-whisper, pyannote.audio). Downloaded on first launch. CPU (~500MB) or CUDA (~2GB) variants. + +This separation means app UI updates don't require re-downloading the sidecar, and sidecar updates don't require reinstalling the app. ## Tech Stack -- **Desktop shell:** Tauri v2 (Rust backend + Svelte 5 / TypeScript frontend) -- **ML pipeline:** Python sidecar (faster-whisper, pyannote.audio) — frozen via PyInstaller for distribution -- **Audio playback:** wavesurfer.js with Web Audio API backend -- **AI providers:** OpenAI, Anthropic, OpenAI-compatible endpoints (local or remote) -- **Local AI:** Bundled llama-server (llama.cpp) -- **Caption export:** pysubs2 +| Component | Technology | +|-----------|-----------| +| Desktop shell | Tauri v2 (Rust + Svelte 5 / TypeScript) | +| Transcription | faster-whisper (CTranslate2) | +| Speaker ID | pyannote.audio 3.1 | +| Audio UI | wavesurfer.js | +| Transcript editor | TipTap (ProseMirror) | +| AI (local) | Ollama (any model) | +| AI (cloud) | OpenAI, Anthropic, OpenAI-compatible | +| Caption export | pysubs2 | +| Database | SQLite (rusqlite) | ## Development @@ -34,8 +57,8 @@ A desktop application that transcribes audio/video recordings with speaker ident - Node.js 20+ - Rust (stable) -- Python 3.11+ with ML dependencies -- System: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev` (Linux) +- Python 3.11+ with uv or pip +- Linux: `libgtk-3-dev`, `libwebkit2gtk-4.1-dev`, `libappindicator3-dev`, `librsvg2-dev` ### Getting Started @@ -44,47 +67,61 @@ A desktop application that transcribes audio/video recordings with speaker ident npm install # Install Python sidecar dependencies -cd python && pip install -e . && cd .. +cd python && pip install -e ".[dev]" && cd .. # Run in dev mode (uses system Python for the sidecar) npm run tauri:dev ``` -### Building for Distribution +### Building ```bash -# Build the frozen Python sidecar -npm run sidecar:build +# Build the frozen Python sidecar (CPU-only) +cd python && python build_sidecar.py --cpu-only && cd .. -# Build the Tauri app (requires sidecar in src-tauri/binaries/) +# Build with CUDA support +cd python && python build_sidecar.py --with-cuda && cd .. + +# Build the Tauri app npm run tauri build ``` ### CI/CD -Gitea Actions workflows are in `.gitea/workflows/`. The build pipeline: +Two Gitea Actions workflows in `.gitea/workflows/`: -1. **Build sidecar** — PyInstaller-frozen Python binary per platform (CPU-only PyTorch) -2. **Build Tauri app** — Bundles the sidecar via `externalBin`, produces .deb/.AppImage (Linux), .msi (Windows), .dmg (macOS) +**`release.yml`** — Triggers on push to main: +1. Bumps app version (patch), creates git tag and Gitea release +2. Builds lightweight app installers for all platforms (no sidecar bundled) + +**`build-sidecar.yml`** — Triggers on changes to `python/` or manual dispatch: +1. Bumps sidecar version, creates `sidecar-v*` tag and release +2. Builds CPU + CUDA variants for Linux/Windows, CPU for macOS +3. Uploads as separate release assets #### Required Secrets -| Secret | Purpose | Required? | -|--------|---------|-----------| -| `TAURI_SIGNING_PRIVATE_KEY` | Signs Tauri update bundles | Optional (for auto-updates) | - -No other secrets are needed for building. AI provider API keys and HuggingFace tokens are configured by end users in the app's Settings. +| Secret | Purpose | +|--------|---------| +| `BUILD_TOKEN` | Gitea API token for creating releases and pushing tags | ### Project Structure ``` -src/ # Svelte 5 frontend -src-tauri/ # Rust backend (Tauri commands, sidecar manager, SQLite) -python/ # Python sidecar (transcription, diarization, AI) - voice_to_notes/ # Python package - build_sidecar.py # PyInstaller build script - voice_to_notes.spec # PyInstaller spec -.gitea/workflows/ # Gitea Actions CI/CD +src/ # Svelte 5 frontend + lib/components/ # UI components (waveform, transcript editor, settings, etc.) + lib/stores/ # Svelte stores (settings, transcript state) + routes/ # SvelteKit pages +src-tauri/ # Rust backend + src/sidecar/ # Sidecar process manager (download, extract, IPC) + src/commands/ # Tauri command handlers + nsis-hooks.nsh # Windows uninstall cleanup +python/ # Python sidecar + voice_to_notes/ # Python package (transcription, diarization, AI, export) + build_sidecar.py # PyInstaller build script + voice_to_notes.spec # PyInstaller spec +.gitea/workflows/ # CI/CD (release.yml, build-sidecar.yml) +docs/ # Documentation ``` ## License diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md new file mode 100644 index 0000000..bee5eb0 --- /dev/null +++ b/docs/USER_GUIDE.md @@ -0,0 +1,203 @@ +# Voice to Notes — User Guide + +## Getting Started + +### Installation + +Download the installer for your platform from the [Releases](https://repo.anhonesthost.net/MacroPad/voice-to-notes/releases) page: + +- **Windows:** `.msi` or `-setup.exe` +- **Linux:** `.deb` or `.rpm` +- **macOS:** `.dmg` + +### First-Time Setup + +On first launch, Voice to Notes will prompt you to download its AI engine (the "sidecar"): + +1. Choose **Standard (CPU)** (~500 MB) or **GPU Accelerated (CUDA)** (~2 GB) + - Choose CUDA if you have an NVIDIA GPU for significantly faster transcription + - CPU works on all computers +2. Click **Download & Install** and wait for the download to complete +3. The app will proceed to the main interface once the sidecar is ready + +The sidecar only needs to be downloaded once. Updates are detected automatically on launch. + +--- + +## Basic Workflow + +### 1. Import Audio + +- Click **Import Audio** or press **Ctrl+O** (Cmd+O on Mac) +- Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA, MP4, MKV, AVI, MOV, WebM + +### 2. Transcribe + +After importing, click **Transcribe** to start the transcription pipeline: + +- **Transcription:** Converts speech to text with word-level timestamps +- **Speaker Detection:** Identifies different speakers (if configured — see [Speaker Detection](#speaker-detection)) +- A progress bar shows the current stage and percentage + +### 3. Review and Edit + +- The **waveform** displays at the top — click anywhere to seek +- The **transcript** shows below with speaker labels and timestamps +- **Click any word** in the transcript to jump to that point in the audio +- The current word highlights during playback +- **Edit text** directly in the transcript — word timings are preserved + +### 4. Export + +Click **Export** and choose a format: + +| Format | Extension | Best For | +|--------|-----------|----------| +| SRT | `.srt` | Video subtitles (most compatible) | +| WebVTT | `.vtt` | Web video players, HTML5 | +| ASS/SSA | `.ass` | Styled subtitles with speaker colors | +| Plain Text | `.txt` | Reading, sharing, pasting | +| Markdown | `.md` | Documentation, notes | + +All formats include speaker labels when speaker detection is enabled. + +### 5. Save Project + +- **Ctrl+S** (Cmd+S) saves the current project as a `.vtn` file +- This preserves the full transcript, speaker assignments, and edits +- Reopen later to continue editing or re-export + +--- + +## Playback Controls + +| Action | Shortcut | +|--------|----------| +| Play / Pause | **Space** | +| Skip back 5s | **Left Arrow** | +| Skip forward 5s | **Right Arrow** | +| Seek to word | Click any word in the transcript | +| Import audio | **Ctrl+O** / **Cmd+O** | +| Open settings | **Ctrl+,** / **Cmd+,** | + +--- + +## Speaker Detection + +Speaker detection (diarization) identifies who is speaking at each point in the audio. It requires a one-time setup: + +### Setup + +1. Go to **Settings > Speakers** +2. Create a free account at [huggingface.co](https://huggingface.co/join) +3. Accept the license on **all three** model pages: + - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) + - [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) + - [pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1) +4. Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (read access is sufficient) +5. Paste the token in Settings and click **Test & Download Model** + +### Speaker Options + +- **Number of speakers:** Set to auto-detect or specify a fixed number for faster results +- **Skip speaker detection:** Check this to only transcribe without identifying speakers + +### Managing Speakers + +After transcription, speakers appear as "Speaker 1", "Speaker 2", etc. in the left sidebar. Double-click a speaker name to rename it — the new name appears throughout the transcript and in exports. + +--- + +## AI Chat + +The AI chat panel lets you ask questions about your transcript. The AI sees the full transcript with speaker labels as context. + +Example prompts: +- "Summarize this conversation" +- "What were the key action items?" +- "What did Speaker 1 say about the budget?" + +### Setting Up Ollama (Local AI) + +[Ollama](https://ollama.com) runs AI models locally on your computer — no API keys or internet required. + +1. **Install Ollama:** + - Download from [ollama.com](https://ollama.com) + - Or on Linux: `curl -fsSL https://ollama.com/install.sh | sh` + +2. **Pull a model:** + ```bash + ollama pull llama3.2 + ``` + Other good options: `mistral`, `gemma2`, `phi3` + +3. **Configure in Voice to Notes:** + - Go to **Settings > AI Provider** + - Select **Ollama** + - URL: `http://localhost:11434` (default, usually no change needed) + - Model: `llama3.2` (or whichever model you pulled) + +4. **Use:** Open the AI chat panel (right sidebar) and start asking questions + +### Cloud AI Providers + +If you prefer cloud-based AI: + +**OpenAI:** +- Select **OpenAI** in Settings > AI Provider +- Enter your API key from [platform.openai.com/api-keys](https://platform.openai.com/api-keys) +- Default model: `gpt-4o-mini` + +**Anthropic:** +- Select **Anthropic** in Settings > AI Provider +- Enter your API key from [console.anthropic.com](https://console.anthropic.com) +- Default model: `claude-sonnet-4-6` + +**OpenAI Compatible:** +- For any provider with an OpenAI-compatible API (vLLM, LiteLLM, etc.) +- Enter the API base URL, key, and model name + +--- + +## Settings Reference + +### Transcription + +| Setting | Options | Default | +|---------|---------|---------| +| Whisper Model | tiny, base, small, medium, large-v3 | base | +| Device | CPU, CUDA | CPU | +| Language | Auto-detect, or specify (en, es, fr, etc.) | Auto-detect | + +**Model recommendations:** +- **tiny/base:** Fast, good for clear audio with one speaker +- **small:** Best balance of speed and accuracy +- **medium:** Better accuracy, noticeably slower +- **large-v3:** Best accuracy, requires 8GB+ VRAM (GPU) or 16GB+ RAM (CPU) + +### Debug + +- **Enable Developer Tools:** Opens the browser inspector for debugging + +--- + +## Troubleshooting + +### Transcription is slow +- Use a smaller model (tiny or base) +- If you have an NVIDIA GPU, select CUDA in Settings > Transcription > Device +- Ensure you downloaded the CUDA sidecar during setup + +### Speaker detection not working +- Verify your HuggingFace token in Settings > Speakers +- Click "Test & Download Model" to re-download +- Make sure you accepted the license on all three model pages + +### Audio won't play / No waveform +- Check that the audio file still exists at its original location +- Try re-importing the file +- Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA + +### App shows "Setting up Voice to Notes" +- This is the first-launch sidecar download — it only happens once +- If it fails, check your internet connection and click Retry