Files
voice-to-notes/docs/USER_GUIDE.md
Claude 02c70f90c8
All checks were successful
Release / Bump version and tag (push) Successful in 3s
Release / Build App (macOS) (push) Successful in 1m17s
Release / Build App (Linux) (push) Successful in 4m53s
Release / Build App (Windows) (push) Successful in 3m45s
Extract audio from video files before loading
Video files (MP4, MKV, etc.) are now processed with ffmpeg to extract
audio to a temp WAV file before loading into wavesurfer. This prevents
the WebView crash caused by trying to fetch multi-GB files into memory.

- New extract_audio Tauri command uses ffmpeg (sidecar-bundled or system)
- Frontend detects video extensions and extracts audio automatically
- User-friendly error if ffmpeg is not installed with install instructions
- Reverted wavesurfer MediaElement approach in favor of clean extraction
- Added FFmpeg install guide to USER_GUIDE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 20:04:10 -07:00

7.6 KiB

Voice to Notes — User Guide

Getting Started

Installation

Download the installer for your platform from the Releases page:

  • Windows: .msi or -setup.exe
  • Linux: .deb or .rpm
  • macOS: .dmg

First-Time Setup

On first launch, Voice to Notes will prompt you to download its AI engine (the "sidecar"):

  1. Choose Standard (CPU) (~500 MB) or GPU Accelerated (CUDA) (~2 GB)
    • Choose CUDA if you have an NVIDIA GPU for significantly faster transcription
    • CPU works on all computers
  2. Click Download & Install and wait for the download to complete
  3. The app will proceed to the main interface once the sidecar is ready

The sidecar only needs to be downloaded once. Updates are detected automatically on launch.


Basic Workflow

1. Import Audio or Video

  • Click Import Audio or press Ctrl+O (Cmd+O on Mac)
  • Audio formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA
  • Video formats: MP4, MKV, AVI, MOV, WebM — audio is automatically extracted

Note: Video file import requires FFmpeg to be installed on your system.

2. Transcribe

After importing, click Transcribe to start the transcription pipeline:

  • Transcription: Converts speech to text with word-level timestamps
  • Speaker Detection: Identifies different speakers (if configured — see Speaker Detection)
  • A progress bar shows the current stage and percentage

3. Review and Edit

  • The waveform displays at the top — click anywhere to seek
  • The transcript shows below with speaker labels and timestamps
  • Click any word in the transcript to jump to that point in the audio
  • The current word highlights during playback
  • Edit text directly in the transcript — word timings are preserved

4. Export

Click Export and choose a format:

Format Extension Best For
SRT .srt Video subtitles (most compatible)
WebVTT .vtt Web video players, HTML5
ASS/SSA .ass Styled subtitles with speaker colors
Plain Text .txt Reading, sharing, pasting
Markdown .md Documentation, notes

All formats include speaker labels when speaker detection is enabled.

5. Save Project

  • Ctrl+S (Cmd+S) saves the current project as a .vtn file
  • This preserves the full transcript, speaker assignments, and edits
  • Reopen later to continue editing or re-export

Playback Controls

Action Shortcut
Play / Pause Space
Skip back 5s Left Arrow
Skip forward 5s Right Arrow
Seek to word Click any word in the transcript
Import audio Ctrl+O / Cmd+O
Open settings Ctrl+, / Cmd+,

Speaker Detection

Speaker detection (diarization) identifies who is speaking at each point in the audio. It requires a one-time setup:

Setup

  1. Go to Settings > Speakers
  2. Create a free account at huggingface.co
  3. Accept the license on all three model pages:
  4. Create a token at huggingface.co/settings/tokens (read access is sufficient)
  5. Paste the token in Settings and click Test & Download Model

Speaker Options

  • Number of speakers: Set to auto-detect or specify a fixed number for faster results
  • Skip speaker detection: Check this to only transcribe without identifying speakers

Managing Speakers

After transcription, speakers appear as "Speaker 1", "Speaker 2", etc. in the left sidebar. Double-click a speaker name to rename it — the new name appears throughout the transcript and in exports.


AI Chat

The AI chat panel lets you ask questions about your transcript. The AI sees the full transcript with speaker labels as context.

Example prompts:

  • "Summarize this conversation"
  • "What were the key action items?"
  • "What did Speaker 1 say about the budget?"

Setting Up Ollama (Local AI)

Ollama runs AI models locally on your computer — no API keys or internet required.

  1. Install Ollama:

    • Download from ollama.com
    • Or on Linux: curl -fsSL https://ollama.com/install.sh | sh
  2. Pull a model:

    ollama pull llama3.2
    

    Other good options: mistral, gemma2, phi3

  3. Configure in Voice to Notes:

    • Go to Settings > AI Provider
    • Select Ollama
    • URL: http://localhost:11434 (default, usually no change needed)
    • Model: llama3.2 (or whichever model you pulled)
  4. Use: Open the AI chat panel (right sidebar) and start asking questions

Cloud AI Providers

If you prefer cloud-based AI:

OpenAI:

Anthropic:

  • Select Anthropic in Settings > AI Provider
  • Enter your API key from console.anthropic.com
  • Default model: claude-sonnet-4-6

OpenAI Compatible:

  • For any provider with an OpenAI-compatible API (vLLM, LiteLLM, etc.)
  • Enter the API base URL, key, and model name

Settings Reference

Transcription

Setting Options Default
Whisper Model tiny, base, small, medium, large-v3 base
Device CPU, CUDA CPU
Language Auto-detect, or specify (en, es, fr, etc.) Auto-detect

Model recommendations:

  • tiny/base: Fast, good for clear audio with one speaker
  • small: Best balance of speed and accuracy
  • medium: Better accuracy, noticeably slower
  • large-v3: Best accuracy, requires 8GB+ VRAM (GPU) or 16GB+ RAM (CPU)

Debug

  • Enable Developer Tools: Opens the browser inspector for debugging

Installing FFmpeg

FFmpeg is required for importing video files (MP4, MKV, AVI, etc.). It's used to extract the audio track before transcription.

Windows:

winget install ffmpeg

Or download from ffmpeg.org/download.html and add to your PATH.

macOS:

brew install ffmpeg

Linux (Debian/Ubuntu):

sudo apt install ffmpeg

Linux (Fedora/RHEL):

sudo dnf install ffmpeg

After installing, restart Voice to Notes. FFmpeg is not needed for audio-only files (MP3, WAV, FLAC, etc.).


Troubleshooting

Video import fails / "FFmpeg not found"

  • Install FFmpeg using the instructions above
  • Make sure ffmpeg is in your system PATH
  • Restart Voice to Notes after installing

Transcription is slow

  • Use a smaller model (tiny or base)
  • If you have an NVIDIA GPU, select CUDA in Settings > Transcription > Device
  • Ensure you downloaded the CUDA sidecar during setup

Speaker detection not working

  • Verify your HuggingFace token in Settings > Speakers
  • Click "Test & Download Model" to re-download
  • Make sure you accepted the license on all three model pages

Audio won't play / No waveform

  • Check that the audio file still exists at its original location
  • Try re-importing the file
  • Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA

App shows "Setting up Voice to Notes"

  • This is the first-launch sidecar download — it only happens once
  • If it fails, check your internet connection and click Retry