MacroPad/voice-to-notes

Fork 0

Files

Claude 02c70f90c8

Release / Bump version and tag (push) Successful in 3s

Details

Release / Build App (macOS) (push) Successful in 1m17s

Details

Release / Build App (Linux) (push) Successful in 4m53s

Details

Release / Build App (Windows) (push) Successful in 3m45s

Details

Extract audio from video files before loading

Video files (MP4, MKV, etc.) are now processed with ffmpeg to extract
audio to a temp WAV file before loading into wavesurfer. This prevents
the WebView crash caused by trying to fetch multi-GB files into memory.

- New extract_audio Tauri command uses ffmpeg (sidecar-bundled or system)
- Frontend detects video extensions and extracts audio automatically
- User-friendly error if ffmpeg is not installed with install instructions
- Reverted wavesurfer MediaElement approach in favor of clean extraction
- Added FFmpeg install guide to USER_GUIDE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-22 20:04:10 -07:00

7.6 KiB

Raw Blame History

Voice to Notes — User Guide

Getting Started

Installation

Download the installer for your platform from the Releases page:

Windows: .msi or -setup.exe
Linux: .deb or .rpm
macOS: .dmg

First-Time Setup

On first launch, Voice to Notes will prompt you to download its AI engine (the "sidecar"):

Choose Standard (CPU) (~500 MB) or GPU Accelerated (CUDA) (~2 GB)
- Choose CUDA if you have an NVIDIA GPU for significantly faster transcription
- CPU works on all computers
Click Download & Install and wait for the download to complete
The app will proceed to the main interface once the sidecar is ready

The sidecar only needs to be downloaded once. Updates are detected automatically on launch.

Basic Workflow

1. Import Audio or Video

Click Import Audio or press Ctrl+O (Cmd+O on Mac)
Audio formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA
Video formats: MP4, MKV, AVI, MOV, WebM — audio is automatically extracted

Note: Video file import requires FFmpeg to be installed on your system.

2. Transcribe

After importing, click Transcribe to start the transcription pipeline:

Transcription: Converts speech to text with word-level timestamps
Speaker Detection: Identifies different speakers (if configured — see Speaker Detection)
A progress bar shows the current stage and percentage

3. Review and Edit

The waveform displays at the top — click anywhere to seek
The transcript shows below with speaker labels and timestamps
Click any word in the transcript to jump to that point in the audio
The current word highlights during playback
Edit text directly in the transcript — word timings are preserved

4. Export

Click Export and choose a format:

Format	Extension	Best For
SRT	`.srt`	Video subtitles (most compatible)
WebVTT	`.vtt`	Web video players, HTML5
ASS/SSA	`.ass`	Styled subtitles with speaker colors
Plain Text	`.txt`	Reading, sharing, pasting
Markdown	`.md`	Documentation, notes

All formats include speaker labels when speaker detection is enabled.

5. Save Project

Ctrl+S (Cmd+S) saves the current project as a .vtn file
This preserves the full transcript, speaker assignments, and edits
Reopen later to continue editing or re-export

Playback Controls

Action	Shortcut
Play / Pause	Space
Skip back 5s	Left Arrow
Skip forward 5s	Right Arrow
Seek to word	Click any word in the transcript
Import audio	Ctrl+O / Cmd+O
Open settings	Ctrl+, / Cmd+,

Speaker Detection

Speaker detection (diarization) identifies who is speaking at each point in the audio. It requires a one-time setup:

Setup

Go to Settings > Speakers
Create a free account at huggingface.co
Accept the license on all three model pages:
Create a token at huggingface.co/settings/tokens (read access is sufficient)
Paste the token in Settings and click Test & Download Model

Speaker Options

Number of speakers: Set to auto-detect or specify a fixed number for faster results
Skip speaker detection: Check this to only transcribe without identifying speakers

Managing Speakers

After transcription, speakers appear as "Speaker 1", "Speaker 2", etc. in the left sidebar. Double-click a speaker name to rename it — the new name appears throughout the transcript and in exports.

AI Chat

The AI chat panel lets you ask questions about your transcript. The AI sees the full transcript with speaker labels as context.

Example prompts:

"Summarize this conversation"
"What were the key action items?"
"What did Speaker 1 say about the budget?"

Setting Up Ollama (Local AI)

Ollama runs AI models locally on your computer — no API keys or internet required.

Install Ollama:
- Download from ollama.com
- Or on Linux: curl -fsSL https://ollama.com/install.sh | sh
Pull a model:
```
ollama pull llama3.2
```
Other good options: mistral, gemma2, phi3
Configure in Voice to Notes:
- Go to Settings > AI Provider
- Select Ollama
- URL: http://localhost:11434 (default, usually no change needed)
- Model: llama3.2 (or whichever model you pulled)
Use: Open the AI chat panel (right sidebar) and start asking questions

Cloud AI Providers

If you prefer cloud-based AI:

OpenAI:

Select OpenAI in Settings > AI Provider
Enter your API key from platform.openai.com/api-keys
Default model: gpt-4o-mini

Anthropic:

Select Anthropic in Settings > AI Provider
Enter your API key from console.anthropic.com
Default model: claude-sonnet-4-6

OpenAI Compatible:

For any provider with an OpenAI-compatible API (vLLM, LiteLLM, etc.)
Enter the API base URL, key, and model name

Settings Reference

Transcription

Setting	Options	Default
Whisper Model	tiny, base, small, medium, large-v3	base
Device	CPU, CUDA	CPU
Language	Auto-detect, or specify (en, es, fr, etc.)	Auto-detect

Model recommendations:

tiny/base: Fast, good for clear audio with one speaker
small: Best balance of speed and accuracy
medium: Better accuracy, noticeably slower
large-v3: Best accuracy, requires 8GB+ VRAM (GPU) or 16GB+ RAM (CPU)

Debug

Enable Developer Tools: Opens the browser inspector for debugging

Installing FFmpeg

FFmpeg is required for importing video files (MP4, MKV, AVI, etc.). It's used to extract the audio track before transcription.

Windows:

winget install ffmpeg

Or download from ffmpeg.org/download.html and add to your PATH.

macOS:

brew install ffmpeg

Linux (Debian/Ubuntu):

sudo apt install ffmpeg

Linux (Fedora/RHEL):

sudo dnf install ffmpeg

After installing, restart Voice to Notes. FFmpeg is not needed for audio-only files (MP3, WAV, FLAC, etc.).

Troubleshooting

Video import fails / "FFmpeg not found"

Install FFmpeg using the instructions above
Make sure ffmpeg is in your system PATH
Restart Voice to Notes after installing

Transcription is slow

Use a smaller model (tiny or base)
If you have an NVIDIA GPU, select CUDA in Settings > Transcription > Device
Ensure you downloaded the CUDA sidecar during setup

Speaker detection not working

Verify your HuggingFace token in Settings > Speakers
Click "Test & Download Model" to re-download
Make sure you accepted the license on all three model pages

Audio won't play / No waveform

Check that the audio file still exists at its original location
Try re-importing the file
Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA

App shows "Setting up Voice to Notes"

This is the first-launch sidecar download — it only happens once
If it fails, check your internet connection and click Retry

7.6 KiB Raw Blame History

Voice to Notes — User Guide

Getting Started

Installation

First-Time Setup

Basic Workflow

1. Import Audio or Video

2. Transcribe

3. Review and Edit

4. Export

5. Save Project

Playback Controls

Speaker Detection

Setup

Speaker Options

Managing Speakers

AI Chat

Setting Up Ollama (Local AI)

Cloud AI Providers

Settings Reference

Transcription

Debug

Installing FFmpeg

Troubleshooting

Video import fails / "FFmpeg not found"

Transcription is slow

Speaker detection not working

Audio won't play / No waveform

App shows "Setting up Voice to Notes"

7.6 KiB

Raw Blame History