streamer-tools/local-transcription

Fork 0

Go to file

Developer 37a029d1c6

Tests / Python Backend Tests (push) Successful in 5s

Details

Tests / Frontend Tests (push) Successful in 8s

Details

Tests / Rust Sidecar Tests (push) Successful in 1m59s

Details

Show app version from Tauri instead of sidecar

The version label was reading from backendStore.version which comes
from the sidecar's version.py (hardcoded at build time). Now uses
Tauri's getVersion() API which reads from tauri.conf.json -- the
actual app version that gets bumped by the release workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 13:45:53 -07:00

.claude

Phase 6: Add Deepgram remote transcription (managed + BYOK modes)

2026-04-05 11:45:30 -07:00

.gitea/workflows

Fix release workflow false failure on successful dispatch

2026-04-08 12:49:53 -07:00

backend

Fix Start button not updating: unblock the event loop

2026-04-08 12:43:49 -07:00

client

Fix Deepgram broken pipe: wait for WebSocket before starting audio

2026-04-08 12:18:47 -07:00

config

Phase 6: Add Deepgram remote transcription (managed + BYOK modes)

2026-04-05 11:45:30 -07:00

gui

Phase 6: Add Deepgram remote transcription (managed + BYOK modes)

2026-04-05 11:45:30 -07:00

hooks

Fix PyInstaller hook error for webrtcvad package

2025-12-28 19:49:45 -08:00

LocalTranscription.iconset

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

server

Add user-configurable colors for transcription display

2026-01-20 20:59:13 -08:00

src

Show app version from Tauri instead of sidecar

2026-04-08 13:45:53 -07:00

src-tauri

chore: bump version to 2.0.12 [skip ci]

2026-04-08 20:22:08 +00:00

.gitignore

Remove Zone.Identifier files that break Windows checkout

2026-04-06 14:02:11 -07:00

2025-live-transcription-research.md

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

build.bat

Fix enum34 error by excluding it in PyInstaller spec

2025-12-28 19:32:23 -08:00

BUILD.md

Simplify build process: CUDA support now included by default

2025-12-28 19:09:36 -08:00

build.sh

Fix enum34 error by excluding it in PyInstaller spec

2025-12-28 19:32:23 -08:00

check_cuda.py

Add CUDA diagnostic script for troubleshooting GPU detection

2025-12-26 12:00:37 -08:00

CLAUDE.md

Split CI workflows into per-OS files for independent re-runs

2026-04-06 17:35:25 -07:00

create_icons.py

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

DEBUG_4_SECOND_LAG.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

DEEPGRAM_PROXY_PLAN.md

Phase 6: Add Deepgram remote transcription (managed + BYOK modes)

2026-04-05 11:45:30 -07:00

FIX_2_SECOND_HTTP_DELAY.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

FIXES_APPLIED.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

index.html

Add Tauri v2 + Svelte 5 frontend and headless Python backend

2026-04-06 10:20:25 -07:00

INSTALL_REALTIMESTT.md

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

INSTALL.md

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

LATENCY_GUIDE.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

local-transcription-cloud.spec

Add cloud-only sidecar variant (~50MB vs 500MB-2GB)

2026-04-07 16:57:43 -07:00

local-transcription-headless.spec

Add Gitea CI/CD workflows for cross-platform builds

2026-04-06 11:44:34 -07:00

local-transcription.spec

Set console=False for production builds

2025-12-28 20:46:31 -08:00

LocalTranscription.ico

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

LocalTranscription.png

Add application icon support for GUI and compiled executables

2025-12-28 18:59:24 -08:00

main_cli.py

Migrate to RealtimeSTT for advanced VAD-based transcription

2025-12-28 18:48:29 -08:00

main.py

Add unified per-speaker font support and remote transcription service

2026-01-11 19:09:57 -08:00

NEXT_STEPS.md

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

package-lock.json

Add test suite (63 tests) and CI workflow, fix Settings API bugs

2026-04-07 07:48:36 -07:00

package.json

chore: bump version to 2.0.12 [skip ci]

2026-04-08 20:22:08 +00:00

PERFORMANCE_FIX.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

pyproject.toml

chore: bump sidecar version to 1.0.10 [skip ci]

2026-04-08 20:27:00 +00:00

README.md

Document macOS quarantine workaround in README

2026-04-08 11:02:55 -07:00

requirements.txt

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

SESSION_SUMMARY.md

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

svelte.config.js

Add Tauri v2 + Svelte 5 frontend and headless Python backend

2026-04-06 10:20:25 -07:00

test_components.py

Initial commit: Local Transcription App v1.0

2025-12-25 18:48:23 -08:00

test-server-timing.sh

Fix multi-user server sync performance and integration

2025-12-26 16:44:55 -08:00

tsconfig.json

Add Tauri v2 + Svelte 5 frontend and headless Python backend

2026-04-06 10:20:25 -07:00

version.py

chore: bump version to 2.0.12 [skip ci]

2026-04-08 20:22:08 +00:00

vite.config.ts

Add test suite (63 tests) and CI workflow, fix Settings API bugs

2026-04-07 07:48:36 -07:00

README.md

Local Transcription

A real-time speech-to-text desktop application for streamers. Runs locally on your machine with GPU or CPU, displays transcriptions via OBS browser source, and optionally syncs with other users through a multi-user server.

Version 1.4.0

Features

Real-Time Transcription: Live speech-to-text using Whisper models with minimal latency
Cross-Platform: Native desktop app for Windows, macOS, and Linux via Tauri
Dual Transcription Modes: Local (Whisper) or cloud (Deepgram) with managed billing or BYOK
CPU & GPU Support: Automatic detection of CUDA (NVIDIA), MPS (Apple Silicon), or CPU fallback
Advanced Voice Detection: Dual-layer VAD (WebRTC + Silero) for accurate speech detection
OBS Integration: Built-in web server for browser source capture at http://localhost:8080
Multi-User Sync: Optional Node.js server to sync transcriptions across multiple users
Custom Fonts: Support for system fonts, web-safe fonts, Google Fonts, and custom font files
Customizable Colors: User-configurable colors for name, text, and background
Noise Suppression: Built-in audio preprocessing to reduce background noise
Auto-Updates: Automatic update checking with release notes display

Architecture

The application uses a two-process architecture:

Tauri Shell (Svelte 5 frontend) — lightweight native window (~50MB) rendering the UI
Python Backend (sidecar) — headless process running transcription, audio capture, and the OBS web server

The Tauri frontend communicates with the Python backend via REST API and WebSocket, following the same pattern as voice-to-notes.

Tauri App (user launches this)
  └─ Spawns Python backend as sidecar
       ├─ FastAPI REST API (control endpoints)
       ├─ WebSocket /ws/control (real-time state + transcriptions)
       ├─ OBS web display at http://localhost:8080
       └─ Transcription engine (Whisper or Deepgram)

Legacy GUI: The original PySide6/Qt desktop GUI (main.py) still works alongside the new Tauri frontend during the transition period.

Quick Start

Running from Source

# Install Python dependencies
uv sync

# Run the Tauri app (frontend + backend)
npm install
npm run tauri dev

# Or run just the headless backend (for development)
uv run python -m backend.main_headless

# Or run the legacy PySide6 GUI
uv run python main.py

Using Pre-Built Executables

Download the latest release from the releases page:

App installer (Tauri shell): .msi (Windows), .dmg (macOS), .deb/.rpm/.AppImage (Linux)
Sidecar (Python backend): Download the matching sidecar-* zip for your platform (CUDA or CPU)

Building from Source

# Build the Tauri app
npm install
npm run tauri build
# Output: src-tauri/target/release/bundle/

# Build the Python sidecar (headless, no Qt)
uv sync
uv run pyinstaller local-transcription-headless.spec
# Output: dist/local-transcription-backend/

# Build the legacy PySide6 app (Linux)
./build.sh
# Build the legacy PySide6 app (Windows)
build.bat

For detailed build instructions, see BUILD.md.

Usage

Standalone Mode

Launch the application
Select your microphone from the audio device dropdown
Choose a Whisper model (smaller = faster, larger = more accurate):
- tiny.en / tiny — Fastest, good for quick captions
- base.en / base — Balanced speed and accuracy
- small.en / small — Better accuracy
- medium.en / medium — High accuracy
- large-v3 — Best accuracy (requires more resources)
Click Start to begin transcription
Transcriptions appear in the main window and at http://localhost:8080

Remote Transcription (Deepgram)

Instead of local Whisper models, you can use cloud-based transcription:

Managed mode: Sign up via the transcription proxy for metered billing
BYOK mode: Bring your own Deepgram API key for direct access

Configure in Settings > Remote Transcription.

OBS Browser Source Setup

Start the Local Transcription app
In OBS, add a Browser source
Set URL to http://localhost:8080
Set dimensions (e.g., 1920x300)
Check "Shutdown source when not visible" for performance

Multi-User Mode (Optional)

For syncing transcriptions across multiple users (e.g., multi-host streams or translation teams):

Deploy the Node.js server (see server/nodejs/README.md)
In the app settings, enable Server Sync
Enter the server URL (e.g., http://your-server:3000/api/send)
Set a room name and passphrase (shared with other users)

In OBS, use the server's display URL with your room name:

http://your-server:3000/display?room=YOURROOM&timestamps=true&maxlines=50

Configuration

Settings are stored at ~/.local-transcription/config.yaml and can be modified through the GUI settings panel or the REST API.

Key Settings

Setting	Description	Default
`transcription.model`	Whisper model to use	`base.en`
`transcription.device`	Processing device (auto/cuda/cpu)	`auto`
`transcription.enable_realtime_transcription`	Show preview while speaking	`false`
`transcription.silero_sensitivity`	VAD sensitivity (0-1, lower = more sensitive)	`0.4`
`transcription.post_speech_silence_duration`	Silence before finalizing (seconds)	`0.3`
`transcription.continuous_mode`	Fast speaker mode for quick talkers	`false`
`remote.mode`	Transcription mode (local/managed/byok)	`local`
`display.show_timestamps`	Show timestamps with transcriptions	`true`
`display.fade_after_seconds`	Fade out time (0 = never)	`10`
`display.font_source`	Font type (System Font/Web-Safe/Google Font/Custom File)	`System Font`
`web_server.port`	Local web server port	`8080`

See config/default_config.yaml for all available options.

Project Structure

local-transcription/
├── src/                             # Svelte 5 frontend (Tauri UI)
│   ├── App.svelte                   # Main app shell
│   ├── lib/components/              # UI components
│   │   ├── Header.svelte
│   │   ├── StatusBar.svelte
│   │   ├── Controls.svelte
│   │   ├── TranscriptionDisplay.svelte
│   │   └── Settings.svelte
│   └── lib/stores/                  # Reactive state management
│       ├── backend.ts               # WebSocket + REST API client
│       ├── config.ts                # App configuration
│       └── transcriptions.ts        # Transcription data
├── src-tauri/                       # Tauri v2 Rust shell
│   ├── src/main.rs
│   └── tauri.conf.json
├── backend/                         # Headless Python backend (sidecar)
│   ├── app_controller.py            # Orchestration logic (engine, sync, config)
│   ├── api_server.py                # FastAPI REST + WebSocket control API
│   └── main_headless.py             # Headless entry point
├── client/                          # Core transcription modules
│   ├── audio_capture.py             # Audio input handling
│   ├── transcription_engine_realtime.py  # RealtimeSTT / Whisper
│   ├── deepgram_transcription.py    # Deepgram cloud transcription
│   ├── noise_suppression.py         # VAD and noise reduction
│   ├── device_utils.py              # CPU/GPU/MPS detection
│   ├── config.py                    # Configuration management
│   ├── server_sync.py               # Multi-user server client
│   └── update_checker.py            # Auto-update functionality
├── gui/                             # Legacy PySide6/Qt GUI
│   ├── main_window_qt.py
│   ├── settings_dialog_qt.py
│   └── transcription_display_qt.py
├── server/                          # Web servers
│   ├── web_display.py               # Local FastAPI server for OBS
│   └── nodejs/                      # Multi-user sync server
├── .gitea/workflows/                # CI/CD
│   ├── release.yml                  # Tauri app builds (all platforms)
│   └── build-sidecar.yml            # Python sidecar builds (CUDA + CPU)
├── config/
│   └── default_config.yaml          # Default settings template
├── main.py                          # Legacy GUI entry point
├── main_cli.py                      # CLI version (for testing)
├── local-transcription.spec         # PyInstaller config (legacy, with PySide6)
├── local-transcription-headless.spec # PyInstaller config (headless sidecar)
├── pyproject.toml                   # Python dependencies
└── package.json                     # Node.js / Tauri dependencies

Technology Stack

Frontend (Tauri)

Tauri v2 — Native cross-platform shell (Rust)
Svelte 5 — Reactive UI framework (TypeScript)
Vite — Frontend build tool

Backend (Python Sidecar)

Python 3.9+
FastAPI + Uvicorn — REST API and WebSocket server
RealtimeSTT — Real-time speech-to-text with advanced VAD
faster-whisper — Optimized Whisper model inference (CTranslate2)
PyTorch — ML framework (CUDA-enabled builds available)
sounddevice — Cross-platform audio capture
webrtcvad + silero_vad — Voice activity detection

Multi-User Server (Optional)

Node.js + Express + WebSocket — Real-time sync server

Build & CI/CD

PyInstaller — Python sidecar packaging
Tauri CLI — App bundling (.msi, .dmg, .deb, .rpm, .AppImage)
Gitea Actions — Automated cross-platform builds
uv — Fast Python package manager

CI/CD

Two Gitea Actions workflows in .gitea/workflows/:

Workflow	Trigger	Produces
`release.yml`	Push to `main`	Tauri app installers for all platforms
`build-sidecar.yml`	Changes to `client/`, `server/`, `backend/`, or `pyproject.toml`	Python sidecar zips (CUDA + CPU)

Both workflows require a BUILD_TOKEN secret in the repo settings (Gitea API token with release write access).

Release Artifacts

Platform	App Installer	Sidecar (CUDA)	Sidecar (CPU)
Linux x86_64	`.deb`, `.rpm`, `.AppImage`	`sidecar-linux-x86_64-cuda.zip`	`sidecar-linux-x86_64-cpu.zip`
Windows x86_64	`.msi`, `-setup.exe`	`sidecar-windows-x86_64-cuda.zip`	`sidecar-windows-x86_64-cpu.zip`
macOS ARM64	`.dmg`	—	`sidecar-macos-aarch64-cpu.zip`

System Requirements

Minimum

4GB RAM
Any modern CPU

Recommended (for local real-time transcription)

8GB+ RAM
NVIDIA GPU with CUDA support (for GPU acceleration)

For Building

Tauri app: Node.js 20+, Rust stable, platform SDK (see Tauri prerequisites)
Python sidecar: Python 3.9+, uv, PyInstaller
Linux: libgtk-3-dev, libwebkit2gtk-4.1-dev, libappindicator3-dev, librsvg2-dev, patchelf
Windows: Visual Studio Build Tools, WebView2
macOS: Xcode Command Line Tools

Troubleshooting

macOS: "App is damaged and can't be opened"

macOS Gatekeeper blocks unsigned applications. Since the app is not yet signed with an Apple Developer certificate, you need to remove the quarantine flag before opening:

xattr -cr "/Applications/Local Transcription.app"

Then open the app normally. You only need to do this once after downloading.

Model Loading Issues

Models download automatically on first use to ~/.cache/huggingface/
First run requires internet connection
Check disk space (models range from 75MB to 3GB)

Audio Device Issues

# List available audio devices
uv run python main_cli.py --list-devices

Ensure microphone permissions are granted (especially on macOS)
Try different device indices in settings

GPU Not Detected

# Check CUDA availability
uv run python -c "import torch; print(torch.cuda.is_available())"

Install NVIDIA drivers (CUDA toolkit is bundled in CUDA sidecar builds)
The app automatically falls back to CPU if no GPU is available

Web Server Port Conflicts

Default port is 8080; the app tries ports 8080-8084 automatically
Change in settings or edit config file
Check for conflicts: lsof -i :8080 (Linux/macOS) or netstat -ano | findstr :8080 (Windows)

Use Cases

Live Streaming Captions: Add real-time captions to your Twitch/YouTube streams
Multi-Language Translation: Multiple translators transcribing in different languages
Accessibility: Provide captions for hearing-impaired viewers
Podcast Recording: Real-time transcription for multi-host shows
Gaming Commentary: Track who said what in multiplayer sessions

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests at the repository.

License

MIT License

Acknowledgments

OpenAI Whisper for the speech recognition model
RealtimeSTT for real-time transcription capabilities
faster-whisper for optimized inference
Tauri for the cross-platform desktop framework
Deepgram for cloud transcription API

Releases 8

Sidecar v1.0.15 Latest

2026-04-12 17:44:51 +00:00

Languages

Python 66.4%

Svelte 11.8%

JavaScript 8.4%

Rust 7.1%

TypeScript 4.5%

Other 1.7%