LLM/WhisperLiveKit

Fork 0

mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-03-07 06:14:05 +00:00

Files

Quentin Fuxa ece02db6a3 Use optional new separate NLLW package for translation

2025-10-30 19:36:28 +01:00

13 KiB

Raw Permalink Blame History

WhisperLiveKit

Real-time, Fully Local Speech-to-Text with Speaker Identification

Real-time transcription directly to your browser, with a ready-to-use backend+server and a simple frontend.

Powered by Leading Research:

Simul-Whisper/Streaming (SOTA 2025) - Ultra-low latency transcription using AlignAtt policy
NLLW (2025), based on distilled NLLB (2022, 2024) - Simulatenous translation from & to 200 languages.
WhisperStreaming (SOTA 2023) - Low latency transcription using LocalAgreement policy
Streaming Sortformer (SOTA 2025) - Advanced real-time speaker diarization
Diart (SOTA 2021) - Real-time speaker diarization
Silero VAD (2024) - Enterprise-grade Voice Activity Detection

Why not just run a simple Whisper model on every audio batch? Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.

Architecture

The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.

Installation & Quick Start

pip install whisperlivekit

You can also clone the repo and pip install -e . for the latest version.

Quick Start

Start the transcription server:

whisperlivekit-server --model base --language en

Open your browser and navigate to http://localhost:8000. Start speaking and watch your words appear in real-time!

See tokenizer.py for the list of all available languages.

For HTTPS requirements, see the Parameters section for SSL configuration options.

Use it to capture audio from web pages.

Go to chrome-extension for instructions.

Optional Dependencies

Optional	`pip install`
Speaker diarization	`git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]`
Apple Silicon optimizations	`mlx-whisper`
Translation	`nllw`
[Not recommanded] Speaker diarization with Diart	`diart`
[Not recommanded] Original Whisper backend	`whisper`
[Not recommanded] Improved timestamps backend	`whisper-timestamped`
OpenAI API backend	`openai`

See Parameters & Configuration below on how to use them.

Usage Examples

Command-line Interface: Start the transcription server with various options:

# Large model and translate from french to danish
whisperlivekit-server --model large-v3 --language fr --target-language da

# Diarization and server listening on */80 
whisperlivekit-server --host 0.0.0.0 --port 80 --model medium --diarization --language fr

Python API Integration: Check basic_server for a more complete example of how to use the functions and classes.

from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from contextlib import asynccontextmanager
import asyncio

transcription_engine = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global transcription_engine
    transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
    yield

app = FastAPI(lifespan=lifespan)

async def handle_websocket_results(websocket: WebSocket, results_generator):
    async for response in results_generator:
        await websocket.send_json(response)
    await websocket.send_json({"type": "ready_to_stop"})

@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
    global transcription_engine

    # Create a new AudioProcessor for each connection, passing the shared engine
    audio_processor = AudioProcessor(transcription_engine=transcription_engine)    
    results_generator = await audio_processor.create_tasks()
    results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
    await websocket.accept()
    while True:
        message = await websocket.receive_bytes()
        await audio_processor.process_audio(message)

Frontend Implementation: The package includes an HTML/JavaScript implementation here. You can also import it using from whisperlivekit import get_inline_ui_html & page = get_inline_ui_html()

Parameters & Configuration

Parameter	Description	Default
`--model`	Whisper model size. List and recommandations here	`small`
`--model-path`	.pt file/directory containing whisper model. Overrides `--model`. Recommandations here	`None`
`--language`	List here. If you use `auto`, the model attempts to detect the language automatically, but it tends to bias towards English.	`auto`
`--target-language`	If sets, translate to using NLLB. Ex: `fr`. 200 languages available. If you want to translate to english, you should rather use `--task translate`, since Whisper can do it directly.	`None`
`--task`	Set to `translate` to translate only to english, using Whisper translation.	`transcribe`
`--diarization`	Enable speaker identification	`False`
`--backend`	Processing backend. You can switch to `faster-whisper` if `simulstreaming` does not work correctly	`simulstreaming`
`--no-vac`	Disable Voice Activity Controller	`False`
`--no-vad`	Disable Voice Activity Detection	`False`
`--warmup-file`	Audio file path for model warmup	`jfk.wav`
`--host`	Server host address	`localhost`
`--port`	Server port	`8000`
`--ssl-certfile`	Path to the SSL certificate file (for HTTPS support)	`None`
`--ssl-keyfile`	Path to the SSL private key file (for HTTPS support)	`None`
`--forwarded-allow-ips`	Ip or Ips allowed to reverse proxy the whisperlivekit-server. Supported types are IP Addresses (e.g. 127.0.0.1), IP Networks (e.g. 10.100.0.0/16), or Literals (e.g. /path/to/socket.sock)	`None`
`--pcm-input`	raw PCM (s16le) data is expected as input and FFmpeg will be bypassed. Frontend will use AudioWorklet instead of MediaRecorder	`False`

Translation options	Description	Default
`--nllb-backend`	`transformers` or `ctranslate2`	`ctranslate2`
`--nllb-size`	`600M` or `1.3B`	`600M`

Diarization options	Description	Default
`--diarization-backend`	`diart` or `sortformer`	`sortformer`
`--disable-punctuation-split`	Disable punctuation based splits. See #214	`False`
`--segmentation-model`	Hugging Face model ID for Diart segmentation model. Available models	`pyannote/segmentation-3.0`
`--embedding-model`	Hugging Face model ID for Diart embedding model. Available models	`speechbrain/spkrec-ecapa-voxceleb`

SimulStreaming backend options	Description	Default
`--disable-fast-encoder`	Disable Faster Whisper or MLX Whisper backends for the encoder (if installed). Inference can be slower but helpful when GPU memory is limited	`False`
`--custom-alignment-heads`	Use your own alignment heads, useful when `--model-dir` is used	`None`
`--frame-threshold`	AlignAtt frame threshold (lower = faster, higher = more accurate)	`25`
`--beams`	Number of beams for beam search (1 = greedy decoding)	`1`
`--decoder`	Force decoder type (`beam` or `greedy`)	`auto`
`--audio-max-len`	Maximum audio buffer length (seconds)	`30.0`
`--audio-min-len`	Minimum audio length to process (seconds)	`0.0`
`--cif-ckpt-path`	Path to CIF model for word boundary detection	`None`
`--never-fire`	Never truncate incomplete words	`False`
`--init-prompt`	Initial prompt for the model	`None`
`--static-init-prompt`	Static prompt that doesn't scroll	`None`
`--max-context-tokens`	Maximum context tokens	`None`
`--preload-model-count`	Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent users)	`1`

WhisperStreaming backend options	Description	Default
`--confidence-validation`	Use confidence scores for faster validation	`False`
`--buffer_trimming`	Buffer trimming strategy (`sentence` or `segment`)	`segment`

For diarization using Diart, you need to accept user conditions here for the pyannote/segmentation model, here for the pyannote/segmentation-3.0 model and here for the pyannote/embedding model. Then, login to HuggingFace: huggingface-cli login

🚀 Deployment Guide

To deploy WhisperLiveKit in production:

Server Setup: Install production ASGI server & launch with multiple workers

pip install uvicorn gunicorn
gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app

Frontend: Host your customized version of the html example & ensure WebSocket connection points correctly

Nginx Configuration (recommended for production):

server {
   listen 80;
   server_name your-domain.com;
    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
}}

HTTPS Support: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL

🐋 Docker

Deploy the application easily using Docker with GPU or CPU support.

Prerequisites

Docker installed on your system
For GPU support: NVIDIA Docker runtime installed

Quick Start

With GPU acceleration (recommended):

docker build -t wlk .
docker run --gpus all -p 8000:8000 --name wlk wlk

CPU only:

docker build -f Dockerfile.cpu -t wlk .
docker run -p 8000:8000 --name wlk wlk

Advanced Usage

Custom configuration:

# Example with custom model and language
docker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr

Memory Requirements

Large models: Ensure your Docker runtime has sufficient memory allocated

Customization

--build-arg Options:
- EXTRAS="whisper-timestamped" - Add extras to the image's installation (no spaces). Remember to set necessary container options!
- HF_PRECACHE_DIR="./.cache/" - Pre-load a model cache for faster first-time start
- HF_TKN_FILE="./token" - Add your Hugging Face Hub access token to download gated models

🔮 Use Cases

Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...

13 KiB Raw Permalink Blame History

WhisperLiveKit

Powered by Leading Research:

Architecture

Installation & Quick Start

Quick Start

Use it to capture audio from web pages.

Optional Dependencies

Usage Examples

Parameters & Configuration

🚀 Deployment Guide

🐋 Docker

Prerequisites

Quick Start

Advanced Usage

Memory Requirements

Customization

🔮 Use Cases

13 KiB

Raw Permalink Blame History