Add unified per-speaker font support and remote transcription service

Font changes:
- Consolidate font settings into single Display Settings section
- Support Web-Safe, Google Fonts, and Custom File uploads for both displays
- Fix Google Fonts URL encoding (use + instead of %2B for spaces)
- Fix per-speaker font inline style quote escaping in Node.js display
- Add font debug logging to help diagnose font issues
- Update web server to sync all font settings on settings change
- Remove deprecated PHP server documentation files

New features:
- Add remote transcription service for GPU offloading
- Add instance lock to prevent multiple app instances
- Add version tracking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-11 18:56:12 -08:00
parent f035bdb927
commit ff067b3368
23 changed files with 2486 additions and 1160 deletions

View File

@@ -0,0 +1,173 @@
# Remote Transcription Service
A standalone GPU-accelerated transcription service that accepts audio streams over WebSocket and returns transcriptions. Designed for offloading transcription processing from client machines to a GPU-equipped server.
## Features
- WebSocket-based audio streaming
- API key authentication
- GPU acceleration (CUDA)
- Multiple simultaneous clients
- Health check endpoints
## Requirements
- Python 3.10+
- NVIDIA GPU with CUDA support (recommended)
- 4GB+ VRAM for base model, 8GB+ for large models
## Installation
```bash
cd server/transcription-service
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# For GPU support, install CUDA version of PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu121
```
## Configuration
Set environment variables before starting:
```bash
# Required: API key(s) for authentication
export TRANSCRIPTION_API_KEY="your-secret-key"
# Or multiple keys (comma-separated)
export TRANSCRIPTION_API_KEYS="key1,key2,key3"
# Optional: Model selection (default: base.en)
export TRANSCRIPTION_MODEL="base.en"
```
## Running
```bash
# Start the service
python server.py --host 0.0.0.0 --port 8765
# Or with custom model
python server.py --host 0.0.0.0 --port 8765 --model medium.en
```
## API Endpoints
### Health Check
```
GET /
GET /health
```
### WebSocket Transcription
```
WS /ws/transcribe
```
## WebSocket Protocol
1. **Authentication**
```json
// Client sends
{"type": "auth", "api_key": "your-key"}
// Server responds
{"type": "auth_result", "success": true, "message": "..."}
```
2. **Send Audio**
```json
// Client sends (audio as base64-encoded float32 numpy array)
{"type": "audio", "data": "base64...", "sample_rate": 16000}
// Server responds
{"type": "transcription", "text": "Hello world", "is_preview": false, "timestamp": "..."}
```
3. **Keep-alive**
```json
// Client sends
{"type": "ping"}
// Server responds
{"type": "pong"}
```
4. **Disconnect**
```json
// Client sends
{"type": "end"}
```
## Client Integration
The Local Transcription app includes a remote transcription client. Configure in Settings:
1. Enable "Remote Processing"
2. Set Server URL: `ws://your-server:8765/ws/transcribe`
3. Enter your API key
## Deployment
### Docker
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY server.py .
ENV TRANSCRIPTION_MODEL=base.en
EXPOSE 8765
CMD ["python", "server.py", "--host", "0.0.0.0", "--port", "8765"]
```
### Systemd Service
```ini
[Unit]
Description=Remote Transcription Service
After=network.target
[Service]
Type=simple
User=transcription
WorkingDirectory=/opt/transcription-service
Environment=TRANSCRIPTION_API_KEY=your-key
Environment=TRANSCRIPTION_MODEL=base.en
ExecStart=/opt/transcription-service/venv/bin/python server.py
Restart=always
[Install]
WantedBy=multi-user.target
```
## Models
Available Whisper models (larger = better quality, slower):
| Model | Parameters | VRAM | Speed |
|-------|-----------|------|-------|
| tiny.en | 39M | ~1GB | Fastest |
| base.en | 74M | ~1GB | Fast |
| small.en | 244M | ~2GB | Moderate |
| medium.en | 769M | ~5GB | Slow |
| large-v3 | 1550M | ~10GB | Slowest |
## Security Notes
- Always use API key authentication in production
- Use HTTPS/WSS in production (via reverse proxy)
- Rate limit connections if needed
- Monitor GPU usage to prevent overload

View File

@@ -0,0 +1,8 @@
fastapi>=0.100.0
uvicorn>=0.22.0
websockets>=11.0
numpy>=1.24.0
pydantic>=2.0.0
faster-whisper>=0.10.0
RealtimeSTT>=0.1.0
torch>=2.0.0

View File

@@ -0,0 +1,366 @@
"""
Remote Transcription Service
A standalone FastAPI WebSocket server that accepts audio streams and returns transcriptions.
Designed to run on a GPU-equipped server for offloading transcription processing.
Usage:
python server.py [--host HOST] [--port PORT] [--model MODEL]
Environment variables:
TRANSCRIPTION_API_KEY: Required API key for authentication
TRANSCRIPTION_MODEL: Whisper model to use (default: base.en)
"""
import asyncio
import argparse
import os
import sys
import json
import base64
import logging
from datetime import datetime
from pathlib import Path
from typing import Optional, Dict, Set
from threading import Thread, Lock
import numpy as np
from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException, Depends
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import uvicorn
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# API Key authentication
API_KEYS: Set[str] = set()
def load_api_keys():
"""Load API keys from environment variable."""
global API_KEYS
keys_env = os.environ.get('TRANSCRIPTION_API_KEYS', '')
if keys_env:
API_KEYS = set(key.strip() for key in keys_env.split(',') if key.strip())
# Also support single key
single_key = os.environ.get('TRANSCRIPTION_API_KEY', '')
if single_key:
API_KEYS.add(single_key)
if not API_KEYS:
logger.warning("No API keys configured. Set TRANSCRIPTION_API_KEY or TRANSCRIPTION_API_KEYS environment variable.")
logger.warning("Service will accept all connections (INSECURE for production).")
def verify_api_key(api_key: str) -> bool:
"""Verify if the API key is valid."""
if not API_KEYS:
return True # No authentication if no keys configured
return api_key in API_KEYS
app = FastAPI(
title="Remote Transcription Service",
description="GPU-accelerated speech-to-text transcription service",
version="1.0.0"
)
# Enable CORS for all origins (configure appropriately for production)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
class TranscriptionEngine:
"""Manages the transcription engine with thread-safe access."""
def __init__(self, model: str = "base.en", device: str = "auto"):
self.model_name = model
self.device = device
self.recorder = None
self.lock = Lock()
self.is_initialized = False
def initialize(self):
"""Initialize the transcription engine."""
if self.is_initialized:
return True
try:
from RealtimeSTT import AudioToTextRecorder
# Determine device
if self.device == "auto":
import torch
if torch.cuda.is_available():
self.device = "cuda"
else:
self.device = "cpu"
logger.info(f"Initializing transcription engine with model={self.model_name}, device={self.device}")
# Create recorder with minimal configuration
# We'll feed audio directly, not capture from microphone
self.recorder = AudioToTextRecorder(
model=self.model_name,
language="en",
device=self.device,
compute_type="default",
input_device_index=None, # No mic capture
silero_sensitivity=0.4,
webrtc_sensitivity=3,
post_speech_silence_duration=0.3,
min_length_of_recording=0.5,
enable_realtime_transcription=True,
realtime_model_type="tiny.en",
)
self.is_initialized = True
logger.info("Transcription engine initialized successfully")
return True
except Exception as e:
logger.error(f"Failed to initialize transcription engine: {e}")
return False
def transcribe(self, audio_data: np.ndarray, sample_rate: int = 16000) -> Optional[str]:
"""
Transcribe audio data.
Args:
audio_data: Audio data as numpy array
sample_rate: Sample rate of the audio
Returns:
Transcribed text or None if failed
"""
with self.lock:
if not self.is_initialized:
return None
try:
# Use faster-whisper directly for one-shot transcription
from faster_whisper import WhisperModel
if not hasattr(self, '_whisper_model'):
self._whisper_model = WhisperModel(
self.model_name,
device=self.device,
compute_type="default"
)
# Transcribe
segments, info = self._whisper_model.transcribe(
audio_data,
beam_size=5,
language="en"
)
# Combine segments
text = " ".join(segment.text for segment in segments)
return text.strip()
except Exception as e:
logger.error(f"Transcription error: {e}")
return None
# Global transcription engine
engine: Optional[TranscriptionEngine] = None
class ClientConnection:
"""Represents an active client connection."""
def __init__(self, websocket: WebSocket, client_id: str):
self.websocket = websocket
self.client_id = client_id
self.audio_buffer = []
self.sample_rate = 16000
self.connected_at = datetime.now()
# Active connections
active_connections: Dict[str, ClientConnection] = {}
@app.on_event("startup")
async def startup_event():
"""Initialize service on startup."""
load_api_keys()
global engine
model = os.environ.get('TRANSCRIPTION_MODEL', 'base.en')
engine = TranscriptionEngine(model=model)
# Initialize in background thread to not block startup
def init_engine():
engine.initialize()
Thread(target=init_engine, daemon=True).start()
logger.info("Remote Transcription Service started")
@app.get("/")
async def root():
"""Health check endpoint."""
return {
"service": "Remote Transcription Service",
"status": "running",
"model": engine.model_name if engine else "not loaded",
"device": engine.device if engine else "unknown",
"active_connections": len(active_connections)
}
@app.get("/health")
async def health():
"""Detailed health check."""
return {
"status": "healthy" if engine and engine.is_initialized else "initializing",
"model": engine.model_name if engine else None,
"device": engine.device if engine else None,
"initialized": engine.is_initialized if engine else False,
"connections": len(active_connections)
}
@app.websocket("/ws/transcribe")
async def websocket_transcribe(websocket: WebSocket):
"""
WebSocket endpoint for audio transcription.
Protocol:
1. Client sends: {"type": "auth", "api_key": "your-key"}
2. Server responds: {"type": "auth_result", "success": true/false}
3. Client sends audio chunks: {"type": "audio", "data": base64_audio, "sample_rate": 16000}
4. Server responds with transcription: {"type": "transcription", "text": "...", "is_preview": false}
5. Client can send: {"type": "end"} to close connection
"""
await websocket.accept()
client_id = f"client_{id(websocket)}_{datetime.now().timestamp()}"
authenticated = False
logger.info(f"New WebSocket connection: {client_id}")
try:
while True:
data = await websocket.receive_text()
message = json.loads(data)
msg_type = message.get("type", "")
if msg_type == "auth":
# Authenticate client
api_key = message.get("api_key", "")
if verify_api_key(api_key):
authenticated = True
active_connections[client_id] = ClientConnection(websocket, client_id)
await websocket.send_json({
"type": "auth_result",
"success": True,
"message": "Authentication successful"
})
logger.info(f"Client {client_id} authenticated")
else:
await websocket.send_json({
"type": "auth_result",
"success": False,
"message": "Invalid API key"
})
logger.warning(f"Client {client_id} failed authentication")
await websocket.close(code=4001, reason="Invalid API key")
return
elif msg_type == "audio":
if not authenticated:
await websocket.send_json({
"type": "error",
"message": "Not authenticated"
})
continue
# Decode audio data
audio_b64 = message.get("data", "")
sample_rate = message.get("sample_rate", 16000)
if audio_b64:
try:
audio_bytes = base64.b64decode(audio_b64)
audio_data = np.frombuffer(audio_bytes, dtype=np.float32)
# Transcribe
if engine and engine.is_initialized:
text = engine.transcribe(audio_data, sample_rate)
if text:
await websocket.send_json({
"type": "transcription",
"text": text,
"is_preview": False,
"timestamp": datetime.now().isoformat()
})
else:
await websocket.send_json({
"type": "error",
"message": "Transcription engine not ready"
})
except Exception as e:
logger.error(f"Audio processing error: {e}")
await websocket.send_json({
"type": "error",
"message": f"Audio processing error: {str(e)}"
})
elif msg_type == "end":
logger.info(f"Client {client_id} requested disconnect")
break
elif msg_type == "ping":
await websocket.send_json({"type": "pong"})
except WebSocketDisconnect:
logger.info(f"Client {client_id} disconnected")
except Exception as e:
logger.error(f"WebSocket error for {client_id}: {e}")
finally:
if client_id in active_connections:
del active_connections[client_id]
def main():
"""Main entry point."""
parser = argparse.ArgumentParser(description="Remote Transcription Service")
parser.add_argument("--host", default="0.0.0.0", help="Host to bind to")
parser.add_argument("--port", type=int, default=8765, help="Port to bind to")
parser.add_argument("--model", default="base.en", help="Whisper model to use")
args = parser.parse_args()
# Set model from command line
os.environ.setdefault('TRANSCRIPTION_MODEL', args.model)
logger.info(f"Starting Remote Transcription Service on {args.host}:{args.port}")
logger.info(f"Model: {args.model}")
uvicorn.run(
app,
host=args.host,
port=args.port,
log_level="info"
)
if __name__ == "__main__":
main()