Initial commit: Local Transcription App v1.0
Phase 1 Complete - Standalone Desktop Application Features: - Real-time speech-to-text with Whisper (faster-whisper) - PySide6 desktop GUI with settings dialog - Web server for OBS browser source integration - Audio capture with automatic sample rate detection and resampling - Noise suppression with Voice Activity Detection (VAD) - Configurable display settings (font, timestamps, fade duration) - Settings apply without restart (with automatic model reloading) - Auto-fade for web display transcriptions - CPU/GPU support with automatic device detection - Standalone executable builds (PyInstaller) - CUDA build support (works on systems without CUDA hardware) Components: - Audio capture with sounddevice - Noise reduction with noisereduce + webrtcvad - Transcription with faster-whisper - GUI with PySide6 - Web server with FastAPI + WebSocket - Configuration system with YAML Build System: - Standard builds (CPU-only): build.sh / build.bat - CUDA builds (universal): build-cuda.sh / build-cuda.bat - Comprehensive BUILD.md documentation - Cross-platform support (Linux, Windows) Documentation: - README.md with project overview and quick start - BUILD.md with detailed build instructions - NEXT_STEPS.md with future enhancement roadmap - INSTALL.md with setup instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
56
.gitignore
vendored
Normal file
56
.gitignore
vendored
Normal file
@@ -0,0 +1,56 @@
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
||||
# Virtual environments
|
||||
venv/
|
||||
env/
|
||||
ENV/
|
||||
.venv/
|
||||
.venv
|
||||
|
||||
# uv
|
||||
uv.lock
|
||||
.python-version
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Application specific
|
||||
*.log
|
||||
config/*.yaml
|
||||
!config/default_config.yaml
|
||||
.local-transcription/
|
||||
|
||||
# Model cache
|
||||
models/
|
||||
.cache/
|
||||
|
||||
# PyInstaller
|
||||
*.spec.lock
|
||||
259
BUILD.md
Normal file
259
BUILD.md
Normal file
@@ -0,0 +1,259 @@
|
||||
# Building Local Transcription
|
||||
|
||||
This guide explains how to build standalone executables for Linux and Windows.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **Python 3.8+** installed on your system
|
||||
2. **uv** package manager (install from https://docs.astral.sh/uv/)
|
||||
3. All project dependencies installed (`uv sync`)
|
||||
|
||||
## Building for Linux
|
||||
|
||||
### Standard Build (CPU-only):
|
||||
|
||||
```bash
|
||||
# Make the build script executable (first time only)
|
||||
chmod +x build.sh
|
||||
|
||||
# Run the build script
|
||||
./build.sh
|
||||
```
|
||||
|
||||
### CUDA Build (GPU Support):
|
||||
|
||||
Build with CUDA support even without NVIDIA hardware:
|
||||
|
||||
```bash
|
||||
# Make the build script executable (first time only)
|
||||
chmod +x build-cuda.sh
|
||||
|
||||
# Run the CUDA build script
|
||||
./build-cuda.sh
|
||||
```
|
||||
|
||||
This will:
|
||||
- Install PyTorch with CUDA 12.1 support
|
||||
- Bundle CUDA runtime libraries (~600MB extra)
|
||||
- Create an executable that works on both GPU and CPU systems
|
||||
- Automatically fall back to CPU if no CUDA GPU is available
|
||||
|
||||
The executable will be created in `dist/LocalTranscription/LocalTranscription`
|
||||
|
||||
### Manual build:
|
||||
```bash
|
||||
# Clean previous builds
|
||||
rm -rf build dist
|
||||
|
||||
# Build with PyInstaller
|
||||
uv run pyinstaller local-transcription.spec
|
||||
```
|
||||
|
||||
### Distribution:
|
||||
```bash
|
||||
cd dist
|
||||
tar -czf LocalTranscription-Linux.tar.gz LocalTranscription/
|
||||
```
|
||||
|
||||
## Building for Windows
|
||||
|
||||
### Standard Build (CPU-only):
|
||||
|
||||
```cmd
|
||||
# Run the build script
|
||||
build.bat
|
||||
```
|
||||
|
||||
### CUDA Build (GPU Support):
|
||||
|
||||
Build with CUDA support even without NVIDIA hardware:
|
||||
|
||||
```cmd
|
||||
# Run the CUDA build script
|
||||
build-cuda.bat
|
||||
```
|
||||
|
||||
This will:
|
||||
- Install PyTorch with CUDA 12.1 support
|
||||
- Bundle CUDA runtime libraries (~600MB extra)
|
||||
- Create an executable that works on both GPU and CPU systems
|
||||
- Automatically fall back to CPU if no CUDA GPU is available
|
||||
|
||||
The executable will be created in `dist\LocalTranscription\LocalTranscription.exe`
|
||||
|
||||
### Manual build:
|
||||
```cmd
|
||||
# Clean previous builds
|
||||
rmdir /s /q build
|
||||
rmdir /s /q dist
|
||||
|
||||
# Build with PyInstaller
|
||||
uv run pyinstaller local-transcription.spec
|
||||
```
|
||||
|
||||
### Distribution:
|
||||
- Compress the `dist\LocalTranscription` folder to a ZIP file
|
||||
- Or use an installer creator like NSIS or Inno Setup
|
||||
|
||||
## Important Notes
|
||||
|
||||
### Cross-Platform Building
|
||||
|
||||
**You cannot cross-compile!**
|
||||
- Linux executables must be built on Linux
|
||||
- Windows executables must be built on Windows
|
||||
- Mac executables must be built on macOS
|
||||
|
||||
### First Run
|
||||
|
||||
On the first run, the application will:
|
||||
1. Create a config directory at `~/.local-transcription/` (Linux) or `%USERPROFILE%\.local-transcription\` (Windows)
|
||||
2. Download the Whisper model (if not already present)
|
||||
3. The model will be cached in `~/.cache/huggingface/` by default
|
||||
|
||||
### Executable Size
|
||||
|
||||
The built executable will be large (300MB - 2GB+) because it includes:
|
||||
- Python runtime
|
||||
- PySide6 (Qt framework)
|
||||
- PyTorch/faster-whisper
|
||||
- NumPy, SciPy, and other dependencies
|
||||
|
||||
### Console Window
|
||||
|
||||
By default, the console window is visible (for debugging). To hide it:
|
||||
|
||||
1. Edit `local-transcription.spec`
|
||||
2. Change `console=True` to `console=False` in the `EXE` section
|
||||
3. Rebuild
|
||||
|
||||
### GPU Support
|
||||
|
||||
#### Building with CUDA (Recommended for Distribution)
|
||||
|
||||
**Yes, you CAN build with CUDA support on systems without NVIDIA GPUs!**
|
||||
|
||||
PyTorch provides CUDA-enabled builds that bundle the CUDA runtime libraries. This means:
|
||||
|
||||
1. **You don't need NVIDIA hardware** to create CUDA-enabled builds
|
||||
2. **The executable will work everywhere** - on systems with or without NVIDIA GPUs
|
||||
3. **Automatic fallback** - the app detects available hardware and uses GPU if available, CPU otherwise
|
||||
4. **Larger file size** - adds ~600MB-1GB to the executable size
|
||||
|
||||
**How it works:**
|
||||
```bash
|
||||
# Linux
|
||||
./build-cuda.sh
|
||||
|
||||
# Windows
|
||||
build-cuda.bat
|
||||
```
|
||||
|
||||
The build script will:
|
||||
- Install PyTorch with bundled CUDA 12.1 runtime
|
||||
- Package all CUDA libraries into the executable
|
||||
- Create a universal build that runs on any system
|
||||
|
||||
**When users run the executable:**
|
||||
- If they have an NVIDIA GPU with drivers: Uses GPU acceleration
|
||||
- If they don't have NVIDIA GPU: Automatically uses CPU
|
||||
- No configuration needed - it just works!
|
||||
|
||||
#### Alternative: CPU-Only Builds
|
||||
|
||||
If you only want CPU support (smaller file size):
|
||||
```bash
|
||||
# Linux
|
||||
./build.sh
|
||||
|
||||
# Windows
|
||||
build.bat
|
||||
```
|
||||
|
||||
#### AMD GPU Support
|
||||
|
||||
- **ROCm**: Requires special PyTorch builds from AMD
|
||||
- Not recommended for general distribution
|
||||
- Better to use CUDA build (works on all systems) or CPU build
|
||||
|
||||
### Optimizations
|
||||
|
||||
To reduce size:
|
||||
|
||||
1. **Remove unused model sizes**: The app downloads models on-demand, so you don't need to bundle them
|
||||
2. **Use UPX compression**: Already enabled in the spec file
|
||||
3. **Exclude dev dependencies**: Only build dependencies are needed
|
||||
|
||||
## Testing the Build
|
||||
|
||||
After building, test the executable:
|
||||
|
||||
### Linux:
|
||||
```bash
|
||||
cd dist/LocalTranscription
|
||||
./LocalTranscription
|
||||
```
|
||||
|
||||
### Windows:
|
||||
```cmd
|
||||
cd dist\LocalTranscription
|
||||
LocalTranscription.exe
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Missing modules error
|
||||
If you get "No module named X" errors, add the module to the `hiddenimports` list in `local-transcription.spec`
|
||||
|
||||
### DLL errors (Windows)
|
||||
Make sure Visual C++ Redistributable is installed on the target system:
|
||||
https://aka.ms/vs/17/release/vc_redist.x64.exe
|
||||
|
||||
### Audio device errors
|
||||
The application needs access to audio devices. Ensure:
|
||||
- Microphone permissions are granted
|
||||
- Audio drivers are installed
|
||||
- PulseAudio (Linux) or Windows Audio is running
|
||||
|
||||
### Model download fails
|
||||
Ensure internet connection on first run. Models are downloaded from:
|
||||
https://huggingface.co/guillaumekln/faster-whisper-base
|
||||
|
||||
## Advanced: Adding an Icon
|
||||
|
||||
1. Create or obtain an `.ico` file (Windows) or `.png` file (Linux)
|
||||
2. Edit `local-transcription.spec`
|
||||
3. Change `icon=None` to `icon='path/to/your/icon.ico'`
|
||||
4. Rebuild
|
||||
|
||||
## Advanced: Creating an Installer
|
||||
|
||||
### Windows (using Inno Setup):
|
||||
|
||||
1. Install Inno Setup: https://jrsoftware.org/isinfo.php
|
||||
2. Create an `.iss` script file
|
||||
3. Build the installer
|
||||
|
||||
### Linux (using AppImage):
|
||||
|
||||
```bash
|
||||
# Install appimagetool
|
||||
wget https://github.com/AppImage/AppImageKit/releases/download/continuous/appimagetool-x86_64.AppImage
|
||||
chmod +x appimagetool-x86_64.AppImage
|
||||
|
||||
# Create AppDir structure
|
||||
mkdir -p LocalTranscription.AppDir/usr/bin
|
||||
cp -r dist/LocalTranscription/* LocalTranscription.AppDir/usr/bin/
|
||||
|
||||
# Create desktop file and icon
|
||||
# (Create .desktop file and icon as needed)
|
||||
|
||||
# Build AppImage
|
||||
./appimagetool-x86_64.AppImage LocalTranscription.AppDir
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
For build issues, check:
|
||||
1. PyInstaller documentation: https://pyinstaller.org/
|
||||
2. Project issues: https://github.com/anthropics/claude-code/issues
|
||||
194
INSTALL.md
Normal file
194
INSTALL.md
Normal file
@@ -0,0 +1,194 @@
|
||||
# Installation Guide
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **Python 3.9 or higher**
|
||||
- **uv** (Python package installer)
|
||||
- **FFmpeg** (required by faster-whisper)
|
||||
- **CUDA-capable GPU** (optional, for GPU acceleration)
|
||||
|
||||
### Installing uv
|
||||
|
||||
If you don't have `uv` installed:
|
||||
|
||||
```bash
|
||||
# On macOS and Linux
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
# On Windows
|
||||
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
|
||||
|
||||
# Or with pip
|
||||
pip install uv
|
||||
```
|
||||
|
||||
### Installing FFmpeg
|
||||
|
||||
#### On Ubuntu/Debian:
|
||||
```bash
|
||||
sudo apt update
|
||||
sudo apt install ffmpeg
|
||||
```
|
||||
|
||||
#### On macOS (with Homebrew):
|
||||
```bash
|
||||
brew install ffmpeg
|
||||
```
|
||||
|
||||
#### On Windows:
|
||||
Download from [ffmpeg.org](https://ffmpeg.org/download.html) and add to PATH.
|
||||
|
||||
## Installation Steps
|
||||
|
||||
### 1. Navigate to Project Directory
|
||||
|
||||
```bash
|
||||
cd /home/jknapp/code/local-transcription
|
||||
```
|
||||
|
||||
### 2. Install Dependencies with uv
|
||||
|
||||
```bash
|
||||
# uv will automatically create a virtual environment and install dependencies
|
||||
uv sync
|
||||
```
|
||||
|
||||
This single command will:
|
||||
- Create a virtual environment (`.venv/`)
|
||||
- Install all dependencies from `pyproject.toml`
|
||||
- Lock dependencies for reproducibility
|
||||
|
||||
**Note for CUDA users:** If you have an NVIDIA GPU, install PyTorch with CUDA support:
|
||||
|
||||
```bash
|
||||
# For CUDA 11.8
|
||||
uv pip install torch --index-url https://download.pytorch.org/whl/cu118
|
||||
|
||||
# For CUDA 12.1
|
||||
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
|
||||
```
|
||||
|
||||
### 3. Run the Application
|
||||
|
||||
```bash
|
||||
# Option 1: Using uv run (automatically uses the venv)
|
||||
uv run python main.py
|
||||
|
||||
# Option 2: Activate venv manually
|
||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||
python main.py
|
||||
```
|
||||
|
||||
On first run, the application will:
|
||||
- Download the Whisper model (this may take a few minutes)
|
||||
- Create a configuration file at `~/.local-transcription/config.yaml`
|
||||
|
||||
## Quick Start Commands
|
||||
|
||||
```bash
|
||||
# Install everything
|
||||
uv sync
|
||||
|
||||
# Run the application
|
||||
uv run python main.py
|
||||
|
||||
# Install with server dependencies (for Phase 2+)
|
||||
uv sync --extra server
|
||||
|
||||
# Update dependencies
|
||||
uv sync --upgrade
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Settings can be changed through the GUI (Settings button) or by editing:
|
||||
```
|
||||
~/.local-transcription/config.yaml
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Audio Device Issues
|
||||
|
||||
If no audio devices are detected:
|
||||
```bash
|
||||
uv run python -c "import sounddevice as sd; print(sd.query_devices())"
|
||||
```
|
||||
|
||||
### GPU Not Detected
|
||||
|
||||
Check if CUDA is available:
|
||||
```bash
|
||||
uv run python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
|
||||
```
|
||||
|
||||
### Model Download Fails
|
||||
|
||||
Models are downloaded to `~/.cache/huggingface/`. If download fails:
|
||||
- Check internet connection
|
||||
- Ensure sufficient disk space (~1-3 GB depending on model size)
|
||||
|
||||
### uv Command Not Found
|
||||
|
||||
Make sure uv is in your PATH:
|
||||
```bash
|
||||
# Add to ~/.bashrc or ~/.zshrc
|
||||
export PATH="$HOME/.cargo/bin:$PATH"
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
For best real-time performance:
|
||||
|
||||
1. **Use GPU if available** - 5-10x faster than CPU
|
||||
2. **Start with smaller models**:
|
||||
- `tiny`: Fastest, ~39M parameters, 1-2s latency
|
||||
- `base`: Good balance, ~74M parameters, 2-3s latency
|
||||
- `small`: Better accuracy, ~244M parameters, 3-5s latency
|
||||
3. **Enable VAD** (Voice Activity Detection) to skip silent audio
|
||||
4. **Adjust chunk duration**: Smaller = lower latency, larger = better accuracy
|
||||
|
||||
## System Requirements
|
||||
|
||||
### Minimum:
|
||||
- CPU: Dual-core 2GHz+
|
||||
- RAM: 4GB
|
||||
- Model: tiny or base
|
||||
|
||||
### Recommended:
|
||||
- CPU: Quad-core 3GHz+ or GPU (NVIDIA GTX 1060+)
|
||||
- RAM: 8GB
|
||||
- Model: base or small
|
||||
|
||||
### For Best Performance:
|
||||
- GPU: NVIDIA RTX 2060 or better
|
||||
- RAM: 16GB
|
||||
- Model: small or medium
|
||||
|
||||
## Development
|
||||
|
||||
### Install development dependencies:
|
||||
```bash
|
||||
uv sync --extra dev
|
||||
```
|
||||
|
||||
### Run tests:
|
||||
```bash
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
### Format code:
|
||||
```bash
|
||||
uv run black .
|
||||
uv run ruff check .
|
||||
```
|
||||
|
||||
## Why uv?
|
||||
|
||||
`uv` is significantly faster than pip:
|
||||
- **10-100x faster** dependency resolution
|
||||
- **Automatic virtual environment** management
|
||||
- **Reproducible builds** with lockfile
|
||||
- **Drop-in replacement** for pip commands
|
||||
|
||||
Learn more at [astral.sh/uv](https://astral.sh/uv)
|
||||
440
NEXT_STEPS.md
Normal file
440
NEXT_STEPS.md
Normal file
@@ -0,0 +1,440 @@
|
||||
# Next Steps for Local Transcription
|
||||
|
||||
This document outlines potential future enhancements and features for the Local Transcription application.
|
||||
|
||||
## Current Status: Phase 1 Complete ✅
|
||||
|
||||
The application currently has:
|
||||
- ✅ Desktop GUI with PySide6
|
||||
- ✅ Real-time transcription with Whisper (faster-whisper)
|
||||
- ✅ Audio capture with automatic sample rate detection and resampling
|
||||
- ✅ Noise suppression with Voice Activity Detection (VAD)
|
||||
- ✅ Web server for OBS browser source integration
|
||||
- ✅ Configurable display settings (font, timestamps, fade duration)
|
||||
- ✅ Settings apply without restart
|
||||
- ✅ Auto-fade for web display
|
||||
- ✅ Standalone executable builds for Linux and Windows
|
||||
- ✅ CUDA support (with automatic CPU fallback)
|
||||
|
||||
## Phase 2: Multi-User Server Architecture (Optional)
|
||||
|
||||
If you want to enable multiple users to sync their transcriptions to a shared display:
|
||||
|
||||
### Server Components
|
||||
|
||||
1. **WebSocket Server**
|
||||
- Accept connections from multiple clients
|
||||
- Aggregate transcriptions from all connected users
|
||||
- Broadcast to web display clients
|
||||
- Handle user authentication/authorization
|
||||
- Rate limiting and abuse prevention
|
||||
|
||||
2. **Database/Storage** (Optional)
|
||||
- Store transcription history
|
||||
- User management
|
||||
- Session logs for later review
|
||||
- Consider: SQLite, PostgreSQL, or Redis
|
||||
|
||||
3. **Web Admin Interface**
|
||||
- Monitor connected clients
|
||||
- View active sessions
|
||||
- Manage users and permissions
|
||||
- Export transcription logs
|
||||
|
||||
### Client Updates
|
||||
|
||||
1. **Server Sync Toggle**
|
||||
- Enable/disable server sync in Settings
|
||||
- Server URL configuration
|
||||
- API key/authentication setup
|
||||
- Connection status indicator
|
||||
|
||||
2. **Network Handling**
|
||||
- Auto-reconnect on connection loss
|
||||
- Queue transcriptions when offline
|
||||
- Sync when connection restored
|
||||
|
||||
### Implementation Technologies
|
||||
|
||||
- **Server Framework**: FastAPI (already used for web display)
|
||||
- **WebSocket**: Already integrated
|
||||
- **Database**: SQLAlchemy + SQLite/PostgreSQL
|
||||
- **Deployment**: Docker container for easy deployment
|
||||
|
||||
**Estimated Effort**: 2-3 weeks for full implementation
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Enhanced Features
|
||||
|
||||
### Transcription Improvements
|
||||
|
||||
1. **Multi-Language Support**
|
||||
- Automatic language detection
|
||||
- Real-time language switching
|
||||
- Translation between languages
|
||||
- Per-user language settings
|
||||
|
||||
2. **Speaker Diarization**
|
||||
- Detect and label different speakers
|
||||
- Use pyannote.audio or similar
|
||||
- Automatically assign speaker IDs
|
||||
|
||||
3. **Custom Vocabulary**
|
||||
- Add gaming terms, streamer names
|
||||
- Technical jargon support
|
||||
- Proper noun correction
|
||||
|
||||
4. **Punctuation & Formatting**
|
||||
- Automatic punctuation insertion
|
||||
- Sentence capitalization
|
||||
- Better text formatting
|
||||
|
||||
### Display Enhancements
|
||||
|
||||
1. **Theme System**
|
||||
- Light/dark themes
|
||||
- Custom color schemes
|
||||
- User-created themes (JSON/YAML)
|
||||
- Per-element styling
|
||||
|
||||
2. **Animation Options**
|
||||
- Different fade effects
|
||||
- Slide in/out animations
|
||||
- Configurable transition speeds
|
||||
- Particle effects (optional)
|
||||
|
||||
3. **Layout Modes**
|
||||
- Karaoke-style (word highlighting)
|
||||
- Ticker tape (scrolling bottom)
|
||||
- Multi-column for multiple users
|
||||
- Picture-in-picture mode
|
||||
|
||||
4. **Web Display Customization**
|
||||
- CSS customization interface
|
||||
- Live preview in settings
|
||||
- Save/load custom styles
|
||||
- Community theme sharing
|
||||
|
||||
### Audio Processing
|
||||
|
||||
1. **Advanced Noise Reduction**
|
||||
- RNNoise integration
|
||||
- Custom noise profiles
|
||||
- Adaptive filtering
|
||||
- Echo cancellation
|
||||
|
||||
2. **Audio Effects**
|
||||
- Equalization presets
|
||||
- Compression/normalization
|
||||
- Voice enhancement filters
|
||||
|
||||
3. **Multi-Input Support**
|
||||
- Multiple microphones simultaneously
|
||||
- Virtual audio cable integration
|
||||
- Audio routing/mixing
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Integration & Automation
|
||||
|
||||
### OBS Integration
|
||||
|
||||
1. **OBS Plugin** (Advanced)
|
||||
- Native OBS plugin instead of browser source
|
||||
- Lower resource usage
|
||||
- Better performance
|
||||
- Tighter integration
|
||||
|
||||
2. **Scene Integration**
|
||||
- Auto-show/hide based on speech
|
||||
- Integrate with OBS scene switcher
|
||||
- Hotkey support
|
||||
|
||||
### Streaming Platform Integration
|
||||
|
||||
1. **Twitch Integration**
|
||||
- Send captions to Twitch chat
|
||||
- Twitch API integration
|
||||
- Custom Twitch bot
|
||||
|
||||
2. **YouTube Integration**
|
||||
- Live caption upload
|
||||
- YouTube API integration
|
||||
|
||||
3. **Discord Integration**
|
||||
- Send transcriptions to Discord webhook
|
||||
- Discord bot for voice chat transcription
|
||||
|
||||
### Automation
|
||||
|
||||
1. **Hotkey Support**
|
||||
- Global hotkeys for start/stop
|
||||
- Toggle display visibility
|
||||
- Quick settings access
|
||||
|
||||
2. **Voice Commands**
|
||||
- "Hey Transcription, start/stop"
|
||||
- Command detection in audio stream
|
||||
- Configurable wake words
|
||||
|
||||
3. **Auto-Start Options**
|
||||
- Start with OBS
|
||||
- Start on system boot
|
||||
- Auto-detect streaming software
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Advanced Features
|
||||
|
||||
### AI Enhancements
|
||||
|
||||
1. **Summarization**
|
||||
- Real-time conversation summaries
|
||||
- Key point extraction
|
||||
- Topic detection
|
||||
|
||||
2. **Sentiment Analysis**
|
||||
- Detect tone/emotion
|
||||
- Highlight important moments
|
||||
- Filter profanity (optional)
|
||||
|
||||
3. **Context Awareness**
|
||||
- Remember conversation context
|
||||
- Better transcription accuracy
|
||||
- Adaptive vocabulary
|
||||
|
||||
### Analytics & Insights
|
||||
|
||||
1. **Usage Statistics**
|
||||
- Words per minute
|
||||
- Speaking time per user
|
||||
- Most common words/phrases
|
||||
- Accuracy metrics
|
||||
|
||||
2. **Export Options**
|
||||
- Export to SRT/VTT for video captions
|
||||
- PDF/Word document export
|
||||
- CSV for data analysis
|
||||
- JSON API for custom tools
|
||||
|
||||
3. **Search & Filter**
|
||||
- Search transcription history
|
||||
- Filter by user, date, keyword
|
||||
- Highlight search results
|
||||
|
||||
### Accessibility
|
||||
|
||||
1. **Screen Reader Support**
|
||||
- Full NVDA/JAWS compatibility
|
||||
- Keyboard navigation
|
||||
- Voice feedback
|
||||
|
||||
2. **High Contrast Modes**
|
||||
- Enhanced visibility options
|
||||
- Color blind friendly palettes
|
||||
|
||||
3. **Text-to-Speech**
|
||||
- Read back transcriptions
|
||||
- Multiple voice options
|
||||
- Speed control
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimizations
|
||||
|
||||
### Current Considerations
|
||||
|
||||
1. **Model Optimization**
|
||||
- Quantization (int8, int4)
|
||||
- Smaller model variants
|
||||
- TensorRT optimization (NVIDIA)
|
||||
- ONNX Runtime support
|
||||
|
||||
2. **Caching**
|
||||
- Cache common phrases
|
||||
- Model warm-up on startup
|
||||
- Preload frequently used resources
|
||||
|
||||
3. **Resource Management**
|
||||
- Dynamic batch sizing
|
||||
- Memory pooling
|
||||
- Thread pool optimization
|
||||
|
||||
### Future Optimizations
|
||||
|
||||
1. **Distributed Processing**
|
||||
- Offload to cloud GPU
|
||||
- Share processing across multiple machines
|
||||
- Load balancing
|
||||
|
||||
2. **Edge Computing**
|
||||
- Run on edge devices (Raspberry Pi)
|
||||
- Mobile app support
|
||||
- Embedded systems
|
||||
|
||||
---
|
||||
|
||||
## Community Features
|
||||
|
||||
### Sharing & Collaboration
|
||||
|
||||
1. **Theme Marketplace**
|
||||
- Share custom themes
|
||||
- Download community themes
|
||||
- Rating system
|
||||
|
||||
2. **Plugin System**
|
||||
- Allow community plugins
|
||||
- Custom audio filters
|
||||
- Display widgets
|
||||
- Integration modules
|
||||
|
||||
3. **Documentation**
|
||||
- Video tutorials
|
||||
- Wiki/knowledge base
|
||||
- API documentation
|
||||
- Developer guides
|
||||
|
||||
### User Support
|
||||
|
||||
1. **In-App Help**
|
||||
- Contextual help tooltips
|
||||
- Getting started wizard
|
||||
- Troubleshooting guide
|
||||
|
||||
2. **Community Forum**
|
||||
- GitHub Discussions
|
||||
- Discord server
|
||||
- Reddit community
|
||||
|
||||
---
|
||||
|
||||
## Technical Debt & Maintenance
|
||||
|
||||
### Code Quality
|
||||
|
||||
1. **Testing**
|
||||
- Unit tests for core modules
|
||||
- Integration tests
|
||||
- End-to-end tests
|
||||
- Performance benchmarks
|
||||
|
||||
2. **Documentation**
|
||||
- API documentation
|
||||
- Code comments
|
||||
- Architecture diagrams
|
||||
- Developer setup guide
|
||||
|
||||
3. **CI/CD**
|
||||
- Automated builds
|
||||
- Automated testing
|
||||
- Release automation
|
||||
- Cross-platform testing
|
||||
|
||||
### Security
|
||||
|
||||
1. **Security Audits**
|
||||
- Dependency scanning
|
||||
- Vulnerability assessment
|
||||
- Code security review
|
||||
|
||||
2. **Data Privacy**
|
||||
- Local-first by default
|
||||
- Optional cloud features
|
||||
- GDPR compliance (if applicable)
|
||||
- Clear privacy policy
|
||||
|
||||
---
|
||||
|
||||
## Immediate Quick Wins
|
||||
|
||||
These are small enhancements that could be implemented quickly:
|
||||
|
||||
### Easy (< 1 day)
|
||||
|
||||
- [ ] Add application icon
|
||||
- [ ] Add "About" dialog with version info
|
||||
- [ ] Add keyboard shortcuts (Ctrl+S for settings, etc.)
|
||||
- [ ] Add system tray icon
|
||||
- [ ] Save window position/size
|
||||
- [ ] Add "Check for Updates" feature
|
||||
- [ ] Export transcriptions to text file
|
||||
|
||||
### Medium (1-3 days)
|
||||
|
||||
- [ ] Add profanity filter (optional)
|
||||
- [ ] Add confidence score display
|
||||
- [ ] Add audio level meter
|
||||
- [ ] Multiple language support in UI
|
||||
- [ ] Dark/light theme toggle
|
||||
- [ ] Backup/restore settings
|
||||
- [ ] Recent transcriptions history
|
||||
|
||||
### Larger (1+ weeks)
|
||||
|
||||
- [ ] Cloud sync for settings
|
||||
- [ ] Mobile companion app
|
||||
- [ ] Browser extension
|
||||
- [ ] API server mode
|
||||
- [ ] Plugin architecture
|
||||
- [ ] Advanced audio visualization
|
||||
|
||||
---
|
||||
|
||||
## Resources & References
|
||||
|
||||
### Documentation
|
||||
- [Faster-Whisper](https://github.com/guillaumekln/faster-whisper)
|
||||
- [PySide6 Documentation](https://doc.qt.io/qtforpython/)
|
||||
- [FastAPI Documentation](https://fastapi.tiangolo.com/)
|
||||
- [PyInstaller Manual](https://pyinstaller.org/en/stable/)
|
||||
|
||||
### Similar Projects
|
||||
- [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - C++ implementation
|
||||
- [Buzz](https://github.com/chidiwilliams/buzz) - Desktop transcription tool
|
||||
- [OpenAI Whisper](https://github.com/openai/whisper) - Original implementation
|
||||
|
||||
### Community
|
||||
- Create GitHub Discussions for feature requests
|
||||
- Set up issue templates
|
||||
- Contributing guidelines
|
||||
- Code of conduct
|
||||
|
||||
---
|
||||
|
||||
## Decision Log
|
||||
|
||||
Track major architectural decisions here:
|
||||
|
||||
### 2025-12-25: PyInstaller for Distribution
|
||||
- **Decision**: Use PyInstaller for creating standalone executables
|
||||
- **Rationale**: Good PySide6 support, active development, cross-platform
|
||||
- **Alternatives Considered**: cx_Freeze, Nuitka, py2exe
|
||||
- **Impact**: Users can run without Python installation
|
||||
|
||||
### 2025-12-25: CUDA Build Strategy
|
||||
- **Decision**: Provide CUDA-enabled builds that bundle CUDA runtime
|
||||
- **Rationale**: Universal builds work everywhere, automatic GPU detection
|
||||
- **Trade-off**: Larger file size (~600MB extra) for better UX
|
||||
- **Impact**: Single build for both GPU and CPU users
|
||||
|
||||
### 2025-12-25: Web Server Always Running
|
||||
- **Decision**: Remove enable/disable toggle, always run web server
|
||||
- **Rationale**: Simplifies UX, no configuration needed for OBS
|
||||
- **Impact**: Uses one local port (8080 by default), minimal overhead
|
||||
|
||||
---
|
||||
|
||||
## Contact & Contribution
|
||||
|
||||
When this project is public:
|
||||
- **Issues**: Report bugs and request features on GitHub Issues
|
||||
- **Pull Requests**: Contributions welcome! See CONTRIBUTING.md
|
||||
- **Discussions**: Join GitHub Discussions for questions and ideas
|
||||
- **License**: [To be determined - consider MIT or Apache 2.0]
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2025-12-25*
|
||||
*Version: 1.0.0 (Phase 1 Complete)*
|
||||
494
README.md
Normal file
494
README.md
Normal file
@@ -0,0 +1,494 @@
|
||||
# Local Transcription for Streamers
|
||||
|
||||
A local speech-to-text application designed for streamers that provides real-time transcription using Whisper or similar models. Multiple users can run the application locally and sync their transcriptions to a centralized web stream that can be easily captured in OBS or other streaming software.
|
||||
|
||||
## Features
|
||||
|
||||
- **Standalone Desktop Application**: Use locally with built-in GUI display - no server required
|
||||
- **Local Transcription**: Run Whisper (or compatible models) locally on your machine
|
||||
- **CPU/GPU Support**: Choose between CPU or GPU processing based on your hardware
|
||||
- **Real-time Processing**: Live audio transcription with minimal latency
|
||||
- **Noise Suppression**: Built-in audio preprocessing to reduce background noise
|
||||
- **User Configuration**: Set your display name and preferences through the GUI
|
||||
- **Optional Multi-user Sync**: Connect to a server to sync transcriptions with other users
|
||||
- **OBS Integration**: Web-based output designed for easy browser source capture
|
||||
- **Privacy-First**: All processing happens locally; only transcription text is shared
|
||||
- **Customizable**: Configure model size, language, and streaming settings
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Running from Source
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
uv sync
|
||||
|
||||
# Run the application
|
||||
uv run python main.py
|
||||
```
|
||||
|
||||
### Building Standalone Executables
|
||||
|
||||
To create standalone executables for distribution:
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
./build.sh
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
```cmd
|
||||
build.bat
|
||||
```
|
||||
|
||||
For detailed build instructions, see [BUILD.md](BUILD.md).
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
The application can run in two modes:
|
||||
|
||||
### Standalone Mode (No Server Required):
|
||||
1. **Desktop Application**: Captures audio, performs speech-to-text, and displays transcriptions locally in a GUI window
|
||||
|
||||
### Multi-user Sync Mode (Optional):
|
||||
1. **Local Transcription Client**: Captures audio, performs speech-to-text, and sends results to the web server
|
||||
2. **Centralized Web Server**: Aggregates transcriptions from multiple clients and serves a web stream
|
||||
3. **Web Stream Interface**: Browser-accessible page displaying synchronized transcriptions (for OBS capture)
|
||||
|
||||
## Use Cases
|
||||
|
||||
- **Multi-language Streams**: Multiple translators transcribing in different languages
|
||||
- **Accessibility**: Provide real-time captions for viewers
|
||||
- **Collaborative Podcasts**: Multiple hosts with separate transcriptions
|
||||
- **Gaming Commentary**: Track who said what in multiplayer sessions
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Standalone Desktop Application
|
||||
|
||||
**Objective**: Build a fully functional standalone transcription app with GUI that works without any server
|
||||
|
||||
#### Components:
|
||||
1. **Audio Capture Module**
|
||||
- Capture system audio or microphone input
|
||||
- Support multiple audio sources (virtual audio cables, physical devices)
|
||||
- Real-time audio buffering with configurable chunk sizes
|
||||
- **Noise Suppression**: Preprocess audio to reduce background noise
|
||||
- Libraries: `pyaudio`, `sounddevice`, `noisereduce`, `webrtcvad`
|
||||
|
||||
2. **Noise Suppression Engine**
|
||||
- Real-time noise reduction using RNNoise or noisereduce
|
||||
- Adjustable noise reduction strength
|
||||
- Optional VAD (Voice Activity Detection) to skip silent segments
|
||||
- Libraries: `noisereduce`, `rnnoise-python`, `webrtcvad`
|
||||
|
||||
3. **Transcription Engine**
|
||||
- Integrate OpenAI Whisper (or alternatives: faster-whisper, whisper.cpp)
|
||||
- Support multiple model sizes (tiny, base, small, medium, large)
|
||||
- CPU and GPU inference options
|
||||
- Model management and automatic downloading
|
||||
- Libraries: `openai-whisper`, `faster-whisper`, `torch`
|
||||
|
||||
4. **Device Selection**
|
||||
- Auto-detect available compute devices (CPU, CUDA, MPS for Mac)
|
||||
- Allow user to specify preferred device via GUI
|
||||
- Graceful fallback if GPU unavailable
|
||||
- Display device status and performance metrics
|
||||
|
||||
5. **Desktop GUI Application**
|
||||
- Cross-platform GUI using PyQt6, Tkinter, or CustomTkinter
|
||||
- Main transcription display window (scrolling text area)
|
||||
- Settings panel for configuration
|
||||
- User name input field
|
||||
- Audio input device selector
|
||||
- Model size selector
|
||||
- CPU/GPU toggle
|
||||
- Start/Stop transcription button
|
||||
- Optional: System tray integration
|
||||
- Libraries: `PyQt6`, `customtkinter`, or `tkinter`
|
||||
|
||||
6. **Local Display**
|
||||
- Real-time transcription display in GUI window
|
||||
- Scrolling text with timestamps
|
||||
- User name/label shown with transcriptions
|
||||
- Copy transcription to clipboard
|
||||
- Optional: Save transcription to file (TXT, SRT, VTT)
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Set up project structure and dependencies
|
||||
- [ ] Implement audio capture with device selection
|
||||
- [ ] Add noise suppression and VAD preprocessing
|
||||
- [ ] Integrate Whisper model loading and inference
|
||||
- [ ] Add CPU/GPU device detection and selection logic
|
||||
- [ ] Create real-time audio buffer processing pipeline
|
||||
- [ ] Design and implement GUI layout (main window)
|
||||
- [ ] Add settings panel with user name configuration
|
||||
- [ ] Implement local transcription display area
|
||||
- [ ] Add start/stop controls and status indicators
|
||||
- [ ] Test transcription accuracy and latency
|
||||
- [ ] Test noise suppression effectiveness
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Web Server and Sync System
|
||||
|
||||
**Objective**: Create a centralized server to aggregate and serve transcriptions
|
||||
|
||||
#### Components:
|
||||
1. **Web Server**
|
||||
- FastAPI or Flask-based REST API
|
||||
- WebSocket support for real-time updates
|
||||
- User/client registration and management
|
||||
- Libraries: `fastapi`, `uvicorn`, `websockets`
|
||||
|
||||
2. **Transcription Aggregator**
|
||||
- Receive transcription chunks from multiple clients
|
||||
- Associate transcriptions with user IDs/names
|
||||
- Timestamp management and synchronization
|
||||
- Buffer management for smooth streaming
|
||||
|
||||
3. **Database/Storage** (Optional)
|
||||
- Store transcription history (SQLite for simplicity)
|
||||
- Session management
|
||||
- Export functionality (SRT, VTT, TXT formats)
|
||||
|
||||
#### API Endpoints:
|
||||
- `POST /api/register` - Register a new client
|
||||
- `POST /api/transcription` - Submit transcription chunk
|
||||
- `WS /api/stream` - WebSocket for real-time transcription stream
|
||||
- `GET /stream` - Web page for OBS browser source
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Set up FastAPI server with CORS support
|
||||
- [ ] Implement WebSocket handler for real-time streaming
|
||||
- [ ] Create client registration system
|
||||
- [ ] Build transcription aggregation logic
|
||||
- [ ] Add timestamp synchronization
|
||||
- [ ] Create data models for clients and transcriptions
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Client-Server Communication (Optional Multi-user Mode)
|
||||
|
||||
**Objective**: Add optional server connectivity to enable multi-user transcription sync
|
||||
|
||||
#### Components:
|
||||
1. **HTTP/WebSocket Client**
|
||||
- Register client with server on startup
|
||||
- Send transcription chunks as they're generated
|
||||
- Handle connection drops and reconnection
|
||||
- Libraries: `requests`, `websockets`
|
||||
|
||||
2. **Configuration System**
|
||||
- Config file for server URL, API keys, user settings
|
||||
- Model preferences (size, language)
|
||||
- Audio input settings
|
||||
- Format: YAML or JSON
|
||||
|
||||
3. **Status Monitoring**
|
||||
- Connection status indicator
|
||||
- Transcription queue health
|
||||
- Error handling and logging
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Add "Enable Server Sync" toggle to GUI
|
||||
- [ ] Add server URL configuration field in settings
|
||||
- [ ] Implement WebSocket client for sending transcriptions
|
||||
- [ ] Add configuration file support (YAML/JSON)
|
||||
- [ ] Create connection management with auto-reconnect
|
||||
- [ ] Add local logging and error handling
|
||||
- [ ] Add server connection status indicator to GUI
|
||||
- [ ] Allow app to function normally if server is unavailable
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Web Stream Interface (OBS Integration)
|
||||
|
||||
**Objective**: Create a web page that displays synchronized transcriptions for OBS
|
||||
|
||||
#### Components:
|
||||
1. **Web Frontend**
|
||||
- HTML/CSS/JavaScript page for displaying transcriptions
|
||||
- Responsive design with customizable styling
|
||||
- Auto-scroll with configurable retention window
|
||||
- Libraries: Vanilla JS or lightweight framework (Alpine.js, htmx)
|
||||
|
||||
2. **Styling Options**
|
||||
- Customizable fonts, colors, sizes
|
||||
- Background transparency for OBS chroma key
|
||||
- User name/ID display options
|
||||
- Timestamp display (optional)
|
||||
|
||||
3. **Display Modes**
|
||||
- Scrolling captions (like live TV captions)
|
||||
- Multi-user panel view (separate sections per user)
|
||||
- Overlay mode (minimal UI for transparency)
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Create HTML template for transcription display
|
||||
- [ ] Implement WebSocket client in JavaScript
|
||||
- [ ] Add CSS styling with OBS-friendly transparency
|
||||
- [ ] Create customization controls (URL parameters or UI)
|
||||
- [ ] Test with OBS browser source
|
||||
- [ ] Add configurable retention/scroll behavior
|
||||
|
||||
---
|
||||
|
||||
### Phase 5: Advanced Features
|
||||
|
||||
**Objective**: Enhance functionality and user experience
|
||||
|
||||
#### Features:
|
||||
1. **Language Detection**
|
||||
- Auto-detect spoken language
|
||||
- Multi-language support in single stream
|
||||
- Language selector in GUI
|
||||
|
||||
2. **Speaker Diarization** (Optional)
|
||||
- Identify different speakers
|
||||
- Label transcriptions by speaker
|
||||
- Useful for multi-host streams
|
||||
|
||||
3. **Profanity Filtering**
|
||||
- Optional word filtering/replacement
|
||||
- Customizable filter lists
|
||||
- Toggle in GUI settings
|
||||
|
||||
4. **Advanced Noise Profiles**
|
||||
- Save and load custom noise profiles
|
||||
- Adaptive noise suppression
|
||||
- Different profiles for different environments
|
||||
|
||||
5. **Export Functionality**
|
||||
- Save transcriptions in multiple formats (TXT, SRT, VTT, JSON)
|
||||
- Export button in GUI
|
||||
- Automatic session saving
|
||||
|
||||
6. **Hotkey Support**
|
||||
- Global hotkeys to start/stop transcription
|
||||
- Mute/unmute hotkey
|
||||
- Quick save hotkey
|
||||
|
||||
7. **Docker Support**
|
||||
- Containerized server deployment
|
||||
- Docker Compose for easy multi-component setup
|
||||
- Pre-built images for easy deployment
|
||||
|
||||
8. **Themes and Customization**
|
||||
- Dark/light theme toggle
|
||||
- Customizable font sizes and colors for display
|
||||
- OBS-friendly transparent overlay mode
|
||||
|
||||
#### Tasks:
|
||||
- [ ] Add language detection and multi-language support
|
||||
- [ ] Implement speaker diarization
|
||||
- [ ] Create optional profanity filter
|
||||
- [ ] Add export functionality (SRT, VTT, plain text, JSON)
|
||||
- [ ] Implement global hotkey support
|
||||
- [ ] Create Docker containers for server component
|
||||
- [ ] Add theme customization options
|
||||
- [ ] Create advanced noise profile management
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
### Local Client:
|
||||
- **Python 3.9+**
|
||||
- **GUI**: PyQt6 / CustomTkinter / tkinter
|
||||
- **Audio**: PyAudio / sounddevice
|
||||
- **Noise Suppression**: noisereduce / rnnoise-python
|
||||
- **VAD**: webrtcvad
|
||||
- **ML Framework**: PyTorch (for Whisper)
|
||||
- **Transcription**: openai-whisper / faster-whisper
|
||||
- **Networking**: websockets, requests (optional for server sync)
|
||||
- **Config**: PyYAML / json
|
||||
|
||||
### Server:
|
||||
- **Backend**: FastAPI / Flask
|
||||
- **WebSocket**: python-websockets / FastAPI WebSockets
|
||||
- **Server**: Uvicorn / Gunicorn
|
||||
- **Database** (optional): SQLite / PostgreSQL
|
||||
- **CORS**: fastapi-cors
|
||||
|
||||
### Web Interface:
|
||||
- **Frontend**: HTML5, CSS3, JavaScript (ES6+)
|
||||
- **Real-time**: WebSocket API
|
||||
- **Styling**: CSS Grid/Flexbox for layout
|
||||
|
||||
---
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
local-transcription/
|
||||
| ||||