204 lines
6.7 KiB
Markdown
204 lines
6.7 KiB
Markdown
|
|
# Voice to Notes — User Guide
|
||
|
|
|
||
|
|
## Getting Started
|
||
|
|
|
||
|
|
### Installation
|
||
|
|
|
||
|
|
Download the installer for your platform from the [Releases](https://repo.anhonesthost.net/MacroPad/voice-to-notes/releases) page:
|
||
|
|
|
||
|
|
- **Windows:** `.msi` or `-setup.exe`
|
||
|
|
- **Linux:** `.deb` or `.rpm`
|
||
|
|
- **macOS:** `.dmg`
|
||
|
|
|
||
|
|
### First-Time Setup
|
||
|
|
|
||
|
|
On first launch, Voice to Notes will prompt you to download its AI engine (the "sidecar"):
|
||
|
|
|
||
|
|
1. Choose **Standard (CPU)** (~500 MB) or **GPU Accelerated (CUDA)** (~2 GB)
|
||
|
|
- Choose CUDA if you have an NVIDIA GPU for significantly faster transcription
|
||
|
|
- CPU works on all computers
|
||
|
|
2. Click **Download & Install** and wait for the download to complete
|
||
|
|
3. The app will proceed to the main interface once the sidecar is ready
|
||
|
|
|
||
|
|
The sidecar only needs to be downloaded once. Updates are detected automatically on launch.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Basic Workflow
|
||
|
|
|
||
|
|
### 1. Import Audio
|
||
|
|
|
||
|
|
- Click **Import Audio** or press **Ctrl+O** (Cmd+O on Mac)
|
||
|
|
- Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA, MP4, MKV, AVI, MOV, WebM
|
||
|
|
|
||
|
|
### 2. Transcribe
|
||
|
|
|
||
|
|
After importing, click **Transcribe** to start the transcription pipeline:
|
||
|
|
|
||
|
|
- **Transcription:** Converts speech to text with word-level timestamps
|
||
|
|
- **Speaker Detection:** Identifies different speakers (if configured — see [Speaker Detection](#speaker-detection))
|
||
|
|
- A progress bar shows the current stage and percentage
|
||
|
|
|
||
|
|
### 3. Review and Edit
|
||
|
|
|
||
|
|
- The **waveform** displays at the top — click anywhere to seek
|
||
|
|
- The **transcript** shows below with speaker labels and timestamps
|
||
|
|
- **Click any word** in the transcript to jump to that point in the audio
|
||
|
|
- The current word highlights during playback
|
||
|
|
- **Edit text** directly in the transcript — word timings are preserved
|
||
|
|
|
||
|
|
### 4. Export
|
||
|
|
|
||
|
|
Click **Export** and choose a format:
|
||
|
|
|
||
|
|
| Format | Extension | Best For |
|
||
|
|
|--------|-----------|----------|
|
||
|
|
| SRT | `.srt` | Video subtitles (most compatible) |
|
||
|
|
| WebVTT | `.vtt` | Web video players, HTML5 |
|
||
|
|
| ASS/SSA | `.ass` | Styled subtitles with speaker colors |
|
||
|
|
| Plain Text | `.txt` | Reading, sharing, pasting |
|
||
|
|
| Markdown | `.md` | Documentation, notes |
|
||
|
|
|
||
|
|
All formats include speaker labels when speaker detection is enabled.
|
||
|
|
|
||
|
|
### 5. Save Project
|
||
|
|
|
||
|
|
- **Ctrl+S** (Cmd+S) saves the current project as a `.vtn` file
|
||
|
|
- This preserves the full transcript, speaker assignments, and edits
|
||
|
|
- Reopen later to continue editing or re-export
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Playback Controls
|
||
|
|
|
||
|
|
| Action | Shortcut |
|
||
|
|
|--------|----------|
|
||
|
|
| Play / Pause | **Space** |
|
||
|
|
| Skip back 5s | **Left Arrow** |
|
||
|
|
| Skip forward 5s | **Right Arrow** |
|
||
|
|
| Seek to word | Click any word in the transcript |
|
||
|
|
| Import audio | **Ctrl+O** / **Cmd+O** |
|
||
|
|
| Open settings | **Ctrl+,** / **Cmd+,** |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Speaker Detection
|
||
|
|
|
||
|
|
Speaker detection (diarization) identifies who is speaking at each point in the audio. It requires a one-time setup:
|
||
|
|
|
||
|
|
### Setup
|
||
|
|
|
||
|
|
1. Go to **Settings > Speakers**
|
||
|
|
2. Create a free account at [huggingface.co](https://huggingface.co/join)
|
||
|
|
3. Accept the license on **all three** model pages:
|
||
|
|
- [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
|
||
|
|
- [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
|
||
|
|
- [pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1)
|
||
|
|
4. Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (read access is sufficient)
|
||
|
|
5. Paste the token in Settings and click **Test & Download Model**
|
||
|
|
|
||
|
|
### Speaker Options
|
||
|
|
|
||
|
|
- **Number of speakers:** Set to auto-detect or specify a fixed number for faster results
|
||
|
|
- **Skip speaker detection:** Check this to only transcribe without identifying speakers
|
||
|
|
|
||
|
|
### Managing Speakers
|
||
|
|
|
||
|
|
After transcription, speakers appear as "Speaker 1", "Speaker 2", etc. in the left sidebar. Double-click a speaker name to rename it — the new name appears throughout the transcript and in exports.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## AI Chat
|
||
|
|
|
||
|
|
The AI chat panel lets you ask questions about your transcript. The AI sees the full transcript with speaker labels as context.
|
||
|
|
|
||
|
|
Example prompts:
|
||
|
|
- "Summarize this conversation"
|
||
|
|
- "What were the key action items?"
|
||
|
|
- "What did Speaker 1 say about the budget?"
|
||
|
|
|
||
|
|
### Setting Up Ollama (Local AI)
|
||
|
|
|
||
|
|
[Ollama](https://ollama.com) runs AI models locally on your computer — no API keys or internet required.
|
||
|
|
|
||
|
|
1. **Install Ollama:**
|
||
|
|
- Download from [ollama.com](https://ollama.com)
|
||
|
|
- Or on Linux: `curl -fsSL https://ollama.com/install.sh | sh`
|
||
|
|
|
||
|
|
2. **Pull a model:**
|
||
|
|
```bash
|
||
|
|
ollama pull llama3.2
|
||
|
|
```
|
||
|
|
Other good options: `mistral`, `gemma2`, `phi3`
|
||
|
|
|
||
|
|
3. **Configure in Voice to Notes:**
|
||
|
|
- Go to **Settings > AI Provider**
|
||
|
|
- Select **Ollama**
|
||
|
|
- URL: `http://localhost:11434` (default, usually no change needed)
|
||
|
|
- Model: `llama3.2` (or whichever model you pulled)
|
||
|
|
|
||
|
|
4. **Use:** Open the AI chat panel (right sidebar) and start asking questions
|
||
|
|
|
||
|
|
### Cloud AI Providers
|
||
|
|
|
||
|
|
If you prefer cloud-based AI:
|
||
|
|
|
||
|
|
**OpenAI:**
|
||
|
|
- Select **OpenAI** in Settings > AI Provider
|
||
|
|
- Enter your API key from [platform.openai.com/api-keys](https://platform.openai.com/api-keys)
|
||
|
|
- Default model: `gpt-4o-mini`
|
||
|
|
|
||
|
|
**Anthropic:**
|
||
|
|
- Select **Anthropic** in Settings > AI Provider
|
||
|
|
- Enter your API key from [console.anthropic.com](https://console.anthropic.com)
|
||
|
|
- Default model: `claude-sonnet-4-6`
|
||
|
|
|
||
|
|
**OpenAI Compatible:**
|
||
|
|
- For any provider with an OpenAI-compatible API (vLLM, LiteLLM, etc.)
|
||
|
|
- Enter the API base URL, key, and model name
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Settings Reference
|
||
|
|
|
||
|
|
### Transcription
|
||
|
|
|
||
|
|
| Setting | Options | Default |
|
||
|
|
|---------|---------|---------|
|
||
|
|
| Whisper Model | tiny, base, small, medium, large-v3 | base |
|
||
|
|
| Device | CPU, CUDA | CPU |
|
||
|
|
| Language | Auto-detect, or specify (en, es, fr, etc.) | Auto-detect |
|
||
|
|
|
||
|
|
**Model recommendations:**
|
||
|
|
- **tiny/base:** Fast, good for clear audio with one speaker
|
||
|
|
- **small:** Best balance of speed and accuracy
|
||
|
|
- **medium:** Better accuracy, noticeably slower
|
||
|
|
- **large-v3:** Best accuracy, requires 8GB+ VRAM (GPU) or 16GB+ RAM (CPU)
|
||
|
|
|
||
|
|
### Debug
|
||
|
|
|
||
|
|
- **Enable Developer Tools:** Opens the browser inspector for debugging
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Transcription is slow
|
||
|
|
- Use a smaller model (tiny or base)
|
||
|
|
- If you have an NVIDIA GPU, select CUDA in Settings > Transcription > Device
|
||
|
|
- Ensure you downloaded the CUDA sidecar during setup
|
||
|
|
|
||
|
|
### Speaker detection not working
|
||
|
|
- Verify your HuggingFace token in Settings > Speakers
|
||
|
|
- Click "Test & Download Model" to re-download
|
||
|
|
- Make sure you accepted the license on all three model pages
|
||
|
|
|
||
|
|
### Audio won't play / No waveform
|
||
|
|
- Check that the audio file still exists at its original location
|
||
|
|
- Try re-importing the file
|
||
|
|
- Supported formats: MP3, WAV, FLAC, OGG, M4A, AAC, WMA
|
||
|
|
|
||
|
|
### App shows "Setting up Voice to Notes"
|
||
|
|
- This is the first-launch sidecar download — it only happens once
|
||
|
|
- If it fails, check your internet connection and click Retry
|