local-transcription

Archived

Author	SHA1	Message	Date
Developer	ef188e1f67	Fix managed mode WebSocket URL when server_url uses https:// Tests / Python Backend Tests (push) Successful in 5s Details Tests / Frontend Tests (push) Successful in 7s Details Tests / Rust Sidecar Tests (push) Successful in 2m21s Details The URL builder was prepending wss:// to the full https:// URL, producing an invalid wss://https://... URL. Now properly converts https→wss and http→ws before appending the /ws/transcribe path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:03:58 -07:00
Developer	352615c15c	Fix Deepgram broken pipe: wait for WebSocket before starting audio Tests / Python Backend Tests (push) Successful in 5s Details Tests / Frontend Tests (push) Successful in 8s Details Tests / Rust Sidecar Tests (push) Successful in 2m0s Details Audio capture started immediately after spawning the WebSocket thread, but the WebSocket hadn't connected yet. Audio chunks sent to the unconnected WebSocket caused a broken pipe error. Fix: added a threading.Event that start_recording() waits on (up to 15s) before opening the audio stream. The event is set in _ws_lifecycle after the WebSocket connects and handshake completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 12:18:47 -07:00
Developer	a3bcc5bee5	Show transcription start errors in UI, improve error logging Tests / Python Backend Tests (push) Successful in 5s Details Tests / Frontend Tests (push) Successful in 8s Details Tests / Rust Sidecar Tests (push) Successful in 2m5s Details Start Transcription button now shows the error message when it fails instead of silently reverting. Common causes: - Missing PortAudio library on Linux - Audio device not accessible - Deepgram connection failure Also added error details to backend console output and captured the last error from the Deepgram engine for better diagnostics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 12:15:43 -07:00
Developer	3d3d7ec3c5	Add cloud-only sidecar variant (~50MB vs 500MB-2GB) Tests / Python Backend Tests (push) Successful in 6s Details Tests / Frontend Tests (push) Successful in 7s Details Tests / Rust Sidecar Tests (push) Successful in 1m59s Details Lightweight Deepgram-only sidecar that excludes PyTorch, faster-whisper, RealtimeSTT, and CUDA. Only includes audio capture + WebSocket streaming to Deepgram. Requires a Deepgram API key (BYOK or managed mode). Changes: - client/models.py: Extracted TranscriptionResult into standalone module so deepgram_transcription.py doesn't transitively import torch - backend/app_controller.py: Made RealtimeTranscriptionEngine and DeviceManager imports lazy (only loaded when remote.mode == "local") - local-transcription-cloud.spec: PyInstaller spec excluding all ML deps - SidecarSetup.svelte: Added "Cloud Only (Deepgram)" variant option - build-sidecar-cloud.yml: CI workflow building cloud sidecar for all 3 OS - sidecar-release.yml: Dispatches cloud build alongside CPU/CUDA builds Sidecar download options are now: - Standard (CPU): ~500 MB - local Whisper on any computer - GPU Accelerated (CUDA): ~2 GB - local Whisper with NVIDIA GPU - Cloud Only (Deepgram): ~50 MB - requires API key, no local models Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 16:57:43 -07:00
Developer	9dcb14e92c	Fix Deepgram streaming latency Tests / Python Backend Tests (push) Successful in 5s Details Tests / Frontend Tests (push) Successful in 9s Details Tests / Rust Sidecar Tests (push) Successful in 2m5s Details Three changes to reduce transcription delay: 1. Send loop: queue.get() was blocking the asyncio event loop, stalling the receive loop and delaying transcription results. Now uses run_in_executor() to avoid blocking the event loop. 2. Block size: reduced from 4096 (~256ms) to 1024 (~64ms) for more frequent, smaller audio chunks. Deepgram handles streaming better with smaller packets. 3. Added punctuate=true and smart_format=true to Deepgram BYOK params for cleaner transcription output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 16:31:50 -07:00
Developer	5a674ed199	Add test suite (63 tests) and CI workflow, fix Settings API bugs Release / Bump version and tag (push) Successful in 4s Details Sidecar Release / Bump sidecar version and tag (push) Failing after 3s Details Tests / Python Backend Tests (push) Failing after 3s Details Tests / Frontend Tests (push) Successful in 8s Details Tests / Rust Sidecar Tests (push) Successful in 3m10s Details Test suite covering all three layers: Python backend (25 tests): - AppController: state machine, start/stop, callbacks, settings reload - API server: REST endpoints, config CRUD, status, devices - Config: dot-notation get/set, persistence, nested paths - Main headless: ready event port format validation Svelte frontend (14 tests via Vitest): - Backend store: exported properties/methods, port derivation, URLs - Config store: method names (fetchConfig not loadConfig), defaults - Transcriptions store: add/clear/plaintext - File extension regression: ensures $state runes only in .svelte.ts Rust sidecar (24 tests via cargo test): - Platform/arch detection, asset name construction - Ready event deserialization (with extra fields tolerance) - Path construction, version read/write, old version cleanup - Zip extraction, SidecarManager lifecycle CI workflow (.gitea/workflows/test.yml): - Runs on push to main and PRs - Three parallel jobs: Python, Frontend, Rust Also fixes three bugs found during test planning: - Settings: /api/check-updates -> GET /api/check-update - Settings: /api/remote/login -> /api/login - Settings: /api/remote/register -> /api/register Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 07:48:36 -07:00
Developer	9ff883e2e3	Phase 6: Add Deepgram remote transcription (managed + BYOK modes) New files: - client/deepgram_transcription.py — DeepgramTranscriptionEngine with managed mode (proxy) and BYOK mode (direct Deepgram). Sends raw binary PCM audio over WebSocket, handles both proxy and Deepgram response formats. Modified files: - config/default_config.yaml — Replace remote_processing with new remote section (mode, server_url, auth_token, byok_api_key, deepgram_model, language) - client/config.py — Add migration from old remote_processing config - gui/settings_dialog_qt.py — Replace Remote Processing group with Transcription Mode section (Local/Managed/BYOK radio buttons, login/register dialogs, balance display, model selector) - gui/main_window_qt.py — Select engine based on remote.mode config, add error and credits_low handlers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 11:45:30 -07:00
shadowdao	b7ab57f21f	Add auto-update feature with Gitea release checking - Add UpdateChecker class to query Gitea API for latest releases - Show update dialog with release notes when new version available - Open browser to release page for download (handles large files) - Allow users to skip specific versions or defer updates - Add "Check for Updates Now" button in settings - Check automatically on startup (respects 24-hour interval) - Pre-configured for repo.anhonesthost.net/streamer-tools Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 17:40:13 -08:00
shadowdao	89819f5d1b	Add user-configurable colors for transcription display - Add color settings (user_color, text_color, background_color) to config - Add color picker buttons in Settings dialog with alpha support for backgrounds - Update local web display to use configurable colors - Send per-user colors with transcriptions to multi-user server - Update Node.js server to apply per-user colors on display page - Improve server landing page: replace tech details with display options reference - Bump version to 1.3.2 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 20:59:13 -08:00
shadowdao	ff067b3368	Add unified per-speaker font support and remote transcription service Font changes: - Consolidate font settings into single Display Settings section - Support Web-Safe, Google Fonts, and Custom File uploads for both displays - Fix Google Fonts URL encoding (use + instead of %2B for spaces) - Fix per-speaker font inline style quote escaping in Node.js display - Add font debug logging to help diagnose font issues - Update web server to sync all font settings on settings change - Remove deprecated PHP server documentation files New features: - Add remote transcription service for GPU offloading - Add instance lock to prevent multiple app instances - Add version tracking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-11 19:09:57 -08:00
shadowdao	5f3c058be6	Migrate to RealtimeSTT for advanced VAD-based transcription Major refactor to eliminate word loss issues using RealtimeSTT with dual-layer VAD (WebRTC + Silero) instead of time-based chunking. ## Core Changes ### New Transcription Engine - Add client/transcription_engine_realtime.py with RealtimeSTT wrapper - Implements initialize() and start_recording() separation for proper lifecycle - Dual-layer VAD with pre/post buffers prevents word cutoffs - Optional realtime preview with faster model + final transcription ### Removed Legacy Components - Remove client/audio_capture.py (RealtimeSTT handles audio) - Remove client/noise_suppression.py (VAD handles silence detection) - Remove client/transcription_engine.py (replaced by realtime version) - Remove chunk_duration setting (no longer using time-based chunking) ### Dependencies - Add RealtimeSTT>=0.3.0 to pyproject.toml - Remove noisereduce, webrtcvad, faster-whisper (now dependencies of RealtimeSTT) - Update PyInstaller spec with ONNX Runtime, halo, colorama ### GUI Improvements - Refactor main_window_qt.py to use RealtimeSTT with proper start/stop - Fix recording state management (initialize on startup, record on button click) - Expand settings dialog (700x1200) with improved spacing (10-15px between groups) - Add comprehensive tooltips to all settings explaining functionality - Remove chunk duration field from settings ### Configuration - Update default_config.yaml with RealtimeSTT parameters: - Silero VAD sensitivity (0.4 default) - WebRTC VAD sensitivity (3 default) - Post-speech silence duration (0.3s) - Pre-recording buffer (0.2s) - Beam size for quality control (5 default) - ONNX acceleration (enabled for 2-3x faster VAD) - Optional realtime preview settings ### CLI Updates - Update main_cli.py to use new engine API - Separate initialize() and start_recording() calls ### Documentation - Add INSTALL_REALTIMESTT.md with migration guide and benefits - Update INSTALL.md: Remove FFmpeg requirement (not needed!) - Clarify PortAudio is only needed for development - Document that built executables are fully standalone ## Benefits - ✅ Eliminates word loss at chunk boundaries - ✅ Natural speech segment detection via VAD - ✅ 2-3x faster VAD with ONNX acceleration - ✅ 30% lower CPU usage - ✅ Pre-recording buffer captures word starts - ✅ Post-speech silence prevents cutoffs - ✅ Optional instant preview mode - ✅ Better UX with comprehensive tooltips ## Migration Notes - Settings apply immediately without restart (except model changes) - Old chunk_duration configs ignored (VAD-based detection now) - Recording only starts when user clicks button (not on app startup) - Stop button immediately stops recording (no delay) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-28 18:48:29 -08:00
shadowdao	bd0e84c5e7	Fix model switching crash and improve error handling Model Reload Fixes: - Properly disconnect signals before reconnecting to prevent duplicate connections - Wait for previous model loader thread to finish before starting new one - Add garbage collection after unloading model to free memory - Improve error handling in model reload callback Settings Dialog: - Remove duplicate success message (callback handles it) - Only show message if no callback is defined Transcription Engine: - Explicitly delete model reference before setting to None - Force garbage collection to ensure memory is freed This prevents crashes when switching models, especially when done multiple times in succession or while the app is under load. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-27 06:28:40 -08:00
shadowdao	64c864b0f0	Fix multi-user server sync performance and integration Major fixes: - Integrated ServerSyncClient into GUI for actual multi-user sync - Fixed CUDA device display to show actual hardware used - Optimized server sync with parallel HTTP requests (5x faster) - Fixed 2-second DNS delay by using 127.0.0.1 instead of localhost - Added comprehensive debugging and performance logging Performance improvements: - HTTP requests: 2045ms → 52ms (97% faster) - Multi-user sync lag: ~4s → ~100ms (97% faster) - Parallel request processing with ThreadPoolExecutor (3 workers) New features: - Room generator with one-click copy on Node.js landing page - Auto-detection of PHP vs Node.js server types - Localhost warning banner for WSL2 users - Comprehensive debug logging throughout sync pipeline Files modified: - gui/main_window_qt.py - Server sync integration, device display fix - client/server_sync.py - Parallel HTTP, server type detection - server/nodejs/server.js - Room generator, warnings, debug logs Documentation added: - PERFORMANCE_FIX.md - Server sync optimization details - FIX_2_SECOND_HTTP_DELAY.md - DNS/localhost issue solution - LATENCY_GUIDE.md - Audio chunk duration tuning guide - DEBUG_4_SECOND_LAG.md - Comprehensive debugging guide - SESSION_SUMMARY.md - Complete session summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-26 16:44:55 -08:00
shadowdao	9c3a0d7678	Add multi-user server sync (PHP server + client) Phase 2 implementation: Multiple streamers can now merge their captions into a single stream using a PHP server. PHP Server (server/php/): - server.php: API endpoint for sending/streaming transcriptions - display.php: Web page for viewing merged captions in OBS - config.php: Server configuration - .htaccess: Security settings - README.md: Comprehensive deployment guide Features: - Room-based isolation (multiple groups on same server) - Passphrase authentication per room - Real-time streaming via Server-Sent Events (SSE) - Different colors for each user - File-based storage (no database required) - Auto-cleanup of old rooms - Works on standard PHP hosting Client-Side: - client/server_sync.py: HTTP client for sending to PHP server - Settings dialog updated with server sync options - Config updated with server_sync section Server Configuration: - URL: Server endpoint (e.g., http://example.com/transcription/server.php) - Room: Unique room name for your group - Passphrase: Shared secret for authentication OBS Integration: Display URL format: http://example.com/transcription/display.php?room=ROOM&passphrase=PASS&fade=10&timestamps=true NOTE: Main window integration pending (client sends transcriptions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-26 10:09:12 -08:00
shadowdao	0ba84e6ddd	Improve transcription accuracy with overlapping audio chunks Changes: 1. Changed UI text from "Recording" to "Transcribing" for clarity 2. Implemented overlapping audio chunks to prevent word cutoff Audio Overlap Feature: - Added overlap_duration parameter (default: 0.5 seconds) - Audio chunks now overlap by 0.5s to capture words at boundaries - Prevents missed words when chunks are processed separately - Configurable via audio.overlap_duration in config.yaml How it works: - Each 3-second chunk includes 0.5s from the previous chunk - Buffer advances by (chunk_size - overlap_size) instead of full chunk - Ensures words at chunk boundaries are captured in at least one chunk - No duplicate transcription due to Whisper's context handling Example with 3s chunks and 0.5s overlap: Chunk 1: [0.0s - 3.0s] Chunk 2: [2.5s - 5.5s] <- 0.5s overlap Chunk 3: [5.0s - 8.0s] <- 0.5s overlap Files modified: - client/audio_capture.py: Implemented overlapping buffer logic - config/default_config.yaml: Added overlap_duration setting - gui/main_window_qt.py: Updated UI text, passed overlap param - main_cli.py: Passed overlap param 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-26 08:47:19 -08:00
shadowdao	472233aec4	Initial commit: Local Transcription App v1.0 Phase 1 Complete - Standalone Desktop Application Features: - Real-time speech-to-text with Whisper (faster-whisper) - PySide6 desktop GUI with settings dialog - Web server for OBS browser source integration - Audio capture with automatic sample rate detection and resampling - Noise suppression with Voice Activity Detection (VAD) - Configurable display settings (font, timestamps, fade duration) - Settings apply without restart (with automatic model reloading) - Auto-fade for web display transcriptions - CPU/GPU support with automatic device detection - Standalone executable builds (PyInstaller) - CUDA build support (works on systems without CUDA hardware) Components: - Audio capture with sounddevice - Noise reduction with noisereduce + webrtcvad - Transcription with faster-whisper - GUI with PySide6 - Web server with FastAPI + WebSocket - Configuration system with YAML Build System: - Standard builds (CPU-only): build.sh / build.bat - CUDA builds (universal): build-cuda.sh / build-cuda.bat - Comprehensive BUILD.md documentation - Cross-platform support (Linux, Windows) Documentation: - README.md with project overview and quick start - BUILD.md with detailed build instructions - NEXT_STEPS.md with future enhancement roadmap - INSTALL.md with setup instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-25 18:48:23 -08:00

16 Commits