# WhisperLiveKit API Reference This document describes all APIs: the WebSocket streaming API, the OpenAI-compatible REST API, and the CLI. --- ## REST API (OpenAI-compatible) ### POST /v1/audio/transcriptions Drop-in replacement for the OpenAI Audio Transcriptions API. Accepts the same parameters. ```bash curl http://localhost:8000/v1/audio/transcriptions \ -F file=@audio.wav \ -F response_format=json ``` **Parameters (multipart form):** | Parameter | Type | Default | Description | |--------------------------|----------|---------|-------------| | `file` | file | required | Audio file (any format ffmpeg can decode) | | `model` | string | `""` | Accepted but ignored (uses server's backend) | | `language` | string | `null` | ISO 639-1 language code or null for auto-detection | | `prompt` | string | `""` | Accepted for compatibility, not yet used | | `response_format` | string | `"json"` | `json`, `verbose_json`, `text`, `srt`, `vtt` | | `timestamp_granularities`| array | `null` | Accepted for compatibility | **Response formats:** `json` (default): ```json {"text": "Hello world, how are you?"} ``` `verbose_json`: ```json { "task": "transcribe", "language": "en", "duration": 7.16, "text": "Hello world", "words": [{"word": "Hello", "start": 0.0, "end": 0.5}, ...], "segments": [{"id": 0, "start": 0.0, "end": 3.5, "text": "Hello world"}] } ``` `text`: Plain text response. `srt` / `vtt`: Subtitle format. ### GET /v1/models List the currently loaded model. ```bash curl http://localhost:8000/v1/models ``` ### GET /health Server health check. ```bash curl http://localhost:8000/health ``` --- ## Deepgram-Compatible WebSocket API ### WS /v1/listen Drop-in compatible with Deepgram's Live Transcription WebSocket. Connect using any Deepgram client SDK pointed at your local server. ```python from deepgram import DeepgramClient, LiveOptions deepgram = DeepgramClient(api_key="unused", config={"url": "localhost:8000"}) connection = deepgram.listen.websocket.v("1") connection.start(LiveOptions(model="nova-2", language="en")) ``` **Query Parameters:** Same as Deepgram (`language`, `punctuate`, `interim_results`, `vad_events`, etc.). **Client Messages:** - Binary audio frames - `{"type": "KeepAlive"}` — keep connection alive - `{"type": "CloseStream"}` — graceful close - `{"type": "Finalize"}` — flush pending audio **Server Messages:** - `Metadata` — sent once at connection start - `Results` — transcription results with `is_final`/`speech_final` flags - `UtteranceEnd` — silence detected after speech - `SpeechStarted` — speech begins (requires `vad_events=true`) **Limitations vs Deepgram:** - No authentication (self-hosted) - Word timestamps are interpolated from segment boundaries - Confidence scores are 0.0 (not available) --- ## CLI ### `wlk` / `wlk serve` Start the transcription server. ```bash wlk # Start with defaults wlk --backend voxtral --model base # Specific backend wlk serve --port 9000 --lan fr # Explicit serve command ``` ### `wlk listen` Live microphone transcription. Requires `sounddevice` (`pip install sounddevice`). ```bash wlk listen # Transcribe from microphone wlk listen --backend voxtral # Use specific backend wlk listen --language fr # Force French wlk listen --diarization # With speaker identification wlk listen -o transcript.txt # Save to file on exit ``` Committed lines print as they are finalized. The current buffer (partial transcription) is shown in gray and updates in-place. Press Ctrl+C to stop; remaining audio is flushed before exit. ### `wlk run` Auto-pull model if not downloaded, then start the server. ```bash wlk run voxtral # Pull voxtral + start server wlk run large-v3 # Pull large-v3 + start server wlk run faster-whisper:base # Specific backend + model wlk run qwen3:1.7b # Qwen3-ASR wlk run voxtral --lan fr --port 9000 # Extra server options passed through ``` ### `wlk transcribe` Transcribe audio files offline (no server needed). ```bash wlk transcribe audio.wav # Plain text output wlk transcribe --format srt audio.wav # SRT subtitles wlk transcribe --format json audio.wav # JSON output wlk transcribe --backend voxtral audio.wav # Specific backend wlk transcribe --model large-v3 --language fr *.wav # Multiple files wlk transcribe --output result.srt --format srt audio.wav ``` ### `wlk bench` Benchmark speed (RTF) and accuracy (WER) on standard test audio. ```bash wlk bench # Benchmark with defaults wlk bench --backend faster-whisper # Specific backend wlk bench --model large-v3 # Larger model wlk bench --json results.json # Export results ``` Downloads test audio from LibriSpeech on first run. Reports WER (Word Error Rate) and RTF (Real-Time Factor: processing time / audio duration). ### `wlk diagnose` Run pipeline diagnostics on an audio file. Feeds audio through the full pipeline while probing internal backend state at regular intervals. Produces a timeline, flags anomalies, and prints health checks. ```bash wlk diagnose audio.wav # Diagnose with default backend wlk diagnose audio.wav --backend voxtral # Diagnose specific backend wlk diagnose --speed 0 --probe-interval 1 # Instant feed, probe every 1s wlk diagnose # Use built-in test sample ``` Useful for debugging issues like: no output appearing, slow transcription, stuck pipelines, or generate thread errors. ### `wlk models` List available backends, installation status, and downloaded models. ```bash wlk models ``` ### `wlk pull` Download models for offline use. ```bash wlk pull base # Download for best available backend wlk pull faster-whisper:large-v3 # Specific backend + model wlk pull voxtral # Voxtral HF model wlk pull qwen3:1.7b # Qwen3-ASR 1.7B ``` ### `wlk rm` Delete downloaded models to free disk space. ```bash wlk rm base # Delete base model wlk rm voxtral # Delete Voxtral model wlk rm faster-whisper:large-v3 # Delete specific backend model ``` ### `wlk check` Verify system dependencies (Python, ffmpeg, torch, etc.). ### `wlk version` Print the installed version. ### Python Client (OpenAI SDK) WhisperLiveKit's REST API is compatible with the OpenAI Python SDK: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused") with open("audio.wav", "rb") as f: result = client.audio.transcriptions.create( model="whisper-base", # ignored, uses server's backend file=f, response_format="verbose_json", ) print(result.text) ``` ### Programmatic Python API For direct in-process usage without a server: ```python import asyncio from whisperlivekit import TranscriptionEngine, AudioProcessor async def transcribe(audio_path): engine = TranscriptionEngine(model_size="base", lan="en") # ... use AudioProcessor for full pipeline control ``` Or use the TestHarness for simpler usage: ```python import asyncio from whisperlivekit import TestHarness async def main(): async with TestHarness(model_size="base", lan="en") as h: await h.feed("audio.wav", speed=0) result = await h.finish() print(result.text) asyncio.run(main()) ``` --- ## WebSocket Streaming API This section describes the WebSocket API for clients that want to stream audio and receive real-time transcription results from a WhisperLiveKit server. --- ## Connection ### Endpoint ``` ws://:/asr ``` ### Query Parameters | Parameter | Type | Default | Description | |------------|--------|----------|-------------| | `language` | string | _(none)_ | Per-session language override. ISO 639-1 code (e.g. `fr`, `en`) or `"auto"` for automatic detection. When omitted, uses the server-wide language setting. Multiple sessions with different languages work concurrently. | | `mode` | string | `"full"` | Output mode. `"full"` sends complete state on every update. `"diff"` sends incremental diffs after an initial snapshot. | Example: ``` ws://localhost:8000/asr?language=fr&mode=diff ``` ### Connection Flow 1. Client opens a WebSocket connection to `/asr`. 2. Server accepts the connection and immediately sends a **config message**. 3. Client streams binary audio frames to the server. 4. Server sends transcription updates as JSON messages. 5. Client sends empty bytes (`b""`) to signal end of audio. 6. Server finishes processing remaining audio and sends a **ready_to_stop** message. --- ## Server to Client Messages ### Config Message Sent once, immediately after the connection is accepted. ```json { "type": "config", "useAudioWorklet": true, "mode": "full" } ``` | Field | Type | Description | |-------------------|--------|-------------| | `type` | string | Always `"config"`. | | `useAudioWorklet` | bool | `true` when the server expects PCM s16le 16kHz mono input (started with `--pcm-input`). `false` when the server expects encoded audio (decoded server-side via FFmpeg). | | `mode` | string | `"full"` or `"diff"`, echoing the requested mode. | ### Transcription Update (full mode) Sent repeatedly as audio is processed. This message has **no `type` field**. ```json { "status": "active_transcription", "lines": [ { "speaker": 1, "text": "Hello world, how are you?", "start": "0:00:00", "end": "0:00:03" }, { "speaker": 2, "text": "I am fine, thanks.", "start": "0:00:04", "end": "0:00:06", "translation": "Je vais bien, merci.", "detected_language": "en" } ], "buffer_transcription": "And you", "buffer_diarization": "", "buffer_translation": "", "remaining_time_transcription": 1.2, "remaining_time_diarization": 0.5 } ``` | Field | Type | Description | |--------------------------------|--------|-------------| | `status` | string | `"active_transcription"` during normal operation. `"no_audio_detected"` when no speech has been detected yet. | | `lines` | array | Committed transcription segments. Each update sends the **full list** of all committed lines (not incremental). | | `buffer_transcription` | string | Ephemeral transcription text not yet committed to a line. Displayed in real time but overwritten on every update. | | `buffer_diarization` | string | Ephemeral text waiting for speaker attribution. | | `buffer_translation` | string | Ephemeral translation text for the current buffer. | | `remaining_time_transcription` | float | Seconds of audio waiting to be transcribed (processing lag). | | `remaining_time_diarization` | float | Seconds of audio waiting for speaker diarization. | | `error` | string | Only present when an error occurred (e.g. FFmpeg failure). | #### Line Object Each element in `lines` has the following shape: | Field | Type | Presence | Description | |---------------------|--------|-------------|-------------| | `speaker` | int | Always | Speaker ID. Normally `1`, `2`, `3`, etc. The special value `-2` indicates a silence segment. When diarization is disabled, defaults to `1`. | | `text` | string | Always | The transcribed text for this segment. `null` for silence segments. | | `start` | string | Always | Start timestamp formatted as `H:MM:SS` (e.g. `"0:00:03"`). | | `end` | string | Always | End timestamp formatted as `H:MM:SS`. | | `translation` | string | Conditional | Present only when translation is enabled and available for this line. | | `detected_language` | string | Conditional | Present only when language detection produced a result for this line (e.g. `"en"`). | ### Snapshot (diff mode) When `mode=diff`, the first transcription message is always a snapshot containing the full state. It has the same fields as a full-mode transcription update, plus metadata fields. ```json { "type": "snapshot", "seq": 1, "status": "active_transcription", "lines": [ ... ], "buffer_transcription": "", "buffer_diarization": "", "buffer_translation": "", "remaining_time_transcription": 0.0, "remaining_time_diarization": 0.0 } ``` | Field | Type | Description | |--------|--------|-------------| | `type` | string | `"snapshot"`. | | `seq` | int | Monotonically increasing sequence number, starting at 1. | | _(remaining fields)_ | | Same as a full-mode transcription update. | ### Diff (diff mode) All messages after the initial snapshot are diffs. ```json { "type": "diff", "seq": 4, "status": "active_transcription", "n_lines": 5, "lines_pruned": 1, "new_lines": [ { "speaker": 1, "text": "This is a new line.", "start": "0:00:12", "end": "0:00:14" } ], "buffer_transcription": "partial text", "buffer_diarization": "", "buffer_translation": "", "remaining_time_transcription": 0.3, "remaining_time_diarization": 0.1 } ``` | Field | Type | Presence | Description | |--------------------------------|--------|-------------|-------------| | `type` | string | Always | `"diff"`. | | `seq` | int | Always | Sequence number. | | `status` | string | Always | Same as full mode. | | `n_lines` | int | Always | Total number of lines the client should have after applying this diff. Use this to verify sync. | | `lines_pruned` | int | Conditional | Number of lines to remove from the **front** of the client's line list. Only present when > 0. | | `new_lines` | array | Conditional | Lines to append to the **end** of the client's line list. Only present when there are new lines. | | `buffer_transcription` | string | Always | Replaces the previous buffer value. | | `buffer_diarization` | string | Always | Replaces the previous buffer value. | | `buffer_translation` | string | Always | Replaces the previous buffer value. | | `remaining_time_transcription` | float | Always | Replaces the previous value. | | `remaining_time_diarization` | float | Always | Replaces the previous value. | | `error` | string | Conditional | Only present on error. | ### Ready to Stop Sent after all audio has been processed (i.e., after the client sent the end-of-audio signal and the server finished processing the remaining audio). ```json { "type": "ready_to_stop" } ``` --- ## Client to Server Messages ### Audio Frames Send binary WebSocket frames containing audio data. **When `useAudioWorklet` is `true` (server started with `--pcm-input`):** - PCM signed 16-bit little-endian, 16 kHz, mono (`s16le`). - Any chunk size works. A typical chunk is 0.5 seconds (16,000 bytes). **When `useAudioWorklet` is `false`:** - Raw encoded audio bytes (any format FFmpeg can decode: WAV, MP3, FLAC, OGG, etc.). - The server pipes these bytes through FFmpeg for decoding. ### End-of-Audio Signal Send an empty binary frame (`b""`) to tell the server that no more audio will follow. The server will finish processing any remaining audio and then send a `ready_to_stop` message. --- ## Diff Protocol: Client Reconstruction Clients using `mode=diff` must maintain a local list of lines and apply diffs incrementally. ### Algorithm ```python def reconstruct_state(msg, lines): """Apply a snapshot or diff message to a local lines list. Args: msg: The parsed JSON message from the server. lines: The client's mutable list of line objects. Returns: A full-state dict with all fields. """ if msg["type"] == "snapshot": lines.clear() lines.extend(msg.get("lines", [])) return msg # Apply diff n_pruned = msg.get("lines_pruned", 0) if n_pruned > 0: del lines[:n_pruned] new_lines = msg.get("new_lines", []) lines.extend(new_lines) # Volatile fields are replaced wholesale return { "status": msg.get("status", ""), "lines": lines[:], "buffer_transcription": msg.get("buffer_transcription", ""), "buffer_diarization": msg.get("buffer_diarization", ""), "buffer_translation": msg.get("buffer_translation", ""), "remaining_time_transcription": msg.get("remaining_time_transcription", 0), "remaining_time_diarization": msg.get("remaining_time_diarization", 0), } ``` ### Verification After applying a diff, check that `len(lines) == msg["n_lines"]`. A mismatch indicates the client fell out of sync and should reconnect. --- ## Silence Representation Silence segments are represented as lines with `speaker` set to `-2` and `text` set to `null`: ```json { "speaker": -2, "text": null, "start": "0:00:10", "end": "0:00:12" } ``` Silence segments are only generated for pauses longer than 5 seconds. --- ## Per-Session Language The `language` query parameter creates an isolated language context for the session using `SessionASRProxy`. The proxy temporarily overrides the shared ASR backend's language during transcription calls, protected by a lock. This means: - Each WebSocket session can transcribe in a different language. - Sessions are thread-safe and do not interfere with each other. - Pass `"auto"` to use automatic language detection for the session regardless of the server-wide setting.