18 KiB
WhisperLiveKit API Reference
This document describes all APIs: the WebSocket streaming API, the OpenAI-compatible REST API, and the CLI.
REST API (OpenAI-compatible)
POST /v1/audio/transcriptions
Drop-in replacement for the OpenAI Audio Transcriptions API. Accepts the same parameters.
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav \
-F response_format=json
Parameters (multipart form):
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
file | required | Audio file (any format ffmpeg can decode) |
model |
string | "" |
Accepted but ignored (uses server's backend) |
language |
string | null |
ISO 639-1 language code or null for auto-detection |
prompt |
string | "" |
Accepted for compatibility, not yet used |
response_format |
string | "json" |
json, verbose_json, text, srt, vtt |
timestamp_granularities |
array | null |
Accepted for compatibility |
Response formats:
json (default):
{"text": "Hello world, how are you?"}
verbose_json:
{
"task": "transcribe",
"language": "en",
"duration": 7.16,
"text": "Hello world",
"words": [{"word": "Hello", "start": 0.0, "end": 0.5}, ...],
"segments": [{"id": 0, "start": 0.0, "end": 3.5, "text": "Hello world"}]
}
text: Plain text response.
srt / vtt: Subtitle format.
GET /v1/models
List the currently loaded model.
curl http://localhost:8000/v1/models
GET /health
Server health check.
curl http://localhost:8000/health
Deepgram-Compatible WebSocket API
WS /v1/listen
Drop-in compatible with Deepgram's Live Transcription WebSocket. Connect using any Deepgram client SDK pointed at your local server.
from deepgram import DeepgramClient, LiveOptions
deepgram = DeepgramClient(api_key="unused", config={"url": "localhost:8000"})
connection = deepgram.listen.websocket.v("1")
connection.start(LiveOptions(model="nova-2", language="en"))
Query Parameters: Same as Deepgram (language, punctuate, interim_results, vad_events, etc.).
Client Messages:
- Binary audio frames
{"type": "KeepAlive"}— keep connection alive{"type": "CloseStream"}— graceful close{"type": "Finalize"}— flush pending audio
Server Messages:
Metadata— sent once at connection startResults— transcription results withis_final/speech_finalflagsUtteranceEnd— silence detected after speechSpeechStarted— speech begins (requiresvad_events=true)
Limitations vs Deepgram:
- No authentication (self-hosted)
- Word timestamps are interpolated from segment boundaries
- Confidence scores are 0.0 (not available)
CLI
wlk / wlk serve
Start the transcription server.
wlk # Start with defaults
wlk --backend voxtral --model base # Specific backend
wlk serve --port 9000 --lan fr # Explicit serve command
wlk listen
Live microphone transcription. Requires sounddevice (pip install sounddevice).
wlk listen # Transcribe from microphone
wlk listen --backend voxtral # Use specific backend
wlk listen --language fr # Force French
wlk listen --diarization # With speaker identification
wlk listen -o transcript.txt # Save to file on exit
Committed lines print as they are finalized. The current buffer (partial transcription) is shown in gray and updates in-place. Press Ctrl+C to stop; remaining audio is flushed before exit.
wlk run
Auto-pull model if not downloaded, then start the server.
wlk run voxtral # Pull voxtral + start server
wlk run large-v3 # Pull large-v3 + start server
wlk run faster-whisper:base # Specific backend + model
wlk run qwen3:1.7b # Qwen3-ASR
wlk run voxtral --lan fr --port 9000 # Extra server options passed through
wlk transcribe
Transcribe audio files offline (no server needed).
wlk transcribe audio.wav # Plain text output
wlk transcribe --format srt audio.wav # SRT subtitles
wlk transcribe --format json audio.wav # JSON output
wlk transcribe --backend voxtral audio.wav # Specific backend
wlk transcribe --model large-v3 --language fr *.wav # Multiple files
wlk transcribe --output result.srt --format srt audio.wav
wlk bench
Benchmark speed (RTF) and accuracy (WER) on standard test audio.
wlk bench # Benchmark with defaults
wlk bench --backend faster-whisper # Specific backend
wlk bench --model large-v3 # Larger model
wlk bench --json results.json # Export results
Downloads test audio from LibriSpeech on first run. Reports WER (Word Error Rate) and RTF (Real-Time Factor: processing time / audio duration).
wlk diagnose
Run pipeline diagnostics on an audio file. Feeds audio through the full pipeline while probing internal backend state at regular intervals. Produces a timeline, flags anomalies, and prints health checks.
wlk diagnose audio.wav # Diagnose with default backend
wlk diagnose audio.wav --backend voxtral # Diagnose specific backend
wlk diagnose --speed 0 --probe-interval 1 # Instant feed, probe every 1s
wlk diagnose # Use built-in test sample
Useful for debugging issues like: no output appearing, slow transcription, stuck pipelines, or generate thread errors.
wlk models
List available backends, installation status, and downloaded models.
wlk models
wlk pull
Download models for offline use.
wlk pull base # Download for best available backend
wlk pull faster-whisper:large-v3 # Specific backend + model
wlk pull voxtral # Voxtral HF model
wlk pull qwen3:1.7b # Qwen3-ASR 1.7B
wlk rm
Delete downloaded models to free disk space.
wlk rm base # Delete base model
wlk rm voxtral # Delete Voxtral model
wlk rm faster-whisper:large-v3 # Delete specific backend model
wlk check
Verify system dependencies (Python, ffmpeg, torch, etc.).
wlk version
Print the installed version.
Python Client (OpenAI SDK)
WhisperLiveKit's REST API is compatible with the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
with open("audio.wav", "rb") as f:
result = client.audio.transcriptions.create(
model="whisper-base", # ignored, uses server's backend
file=f,
response_format="verbose_json",
)
print(result.text)
Programmatic Python API
For direct in-process usage without a server:
import asyncio
from whisperlivekit import TranscriptionEngine, AudioProcessor
async def transcribe(audio_path):
engine = TranscriptionEngine(model_size="base", lan="en")
# ... use AudioProcessor for full pipeline control
Or use the TestHarness for simpler usage:
import asyncio
from whisperlivekit import TestHarness
async def main():
async with TestHarness(model_size="base", lan="en") as h:
await h.feed("audio.wav", speed=0)
result = await h.finish()
print(result.text)
asyncio.run(main())
WebSocket Streaming API
This section describes the WebSocket API for clients that want to stream audio and receive real-time transcription results from a WhisperLiveKit server.
Connection
Endpoint
ws://<host>:<port>/asr
Query Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
language |
string | (none) | Per-session language override. ISO 639-1 code (e.g. fr, en) or "auto" for automatic detection. When omitted, uses the server-wide language setting. Multiple sessions with different languages work concurrently. |
mode |
string | "full" |
Output mode. "full" sends complete state on every update. "diff" sends incremental diffs after an initial snapshot. |
Example:
ws://localhost:8000/asr?language=fr&mode=diff
Connection Flow
- Client opens a WebSocket connection to
/asr. - Server accepts the connection and immediately sends a config message.
- Client streams binary audio frames to the server.
- Server sends transcription updates as JSON messages.
- Client sends empty bytes (
b"") to signal end of audio. - Server finishes processing remaining audio and sends a ready_to_stop message.
Server to Client Messages
Config Message
Sent once, immediately after the connection is accepted.
{
"type": "config",
"useAudioWorklet": true,
"mode": "full"
}
| Field | Type | Description |
|---|---|---|
type |
string | Always "config". |
useAudioWorklet |
bool | true when the server expects PCM s16le 16kHz mono input (started with --pcm-input). false when the server expects encoded audio (decoded server-side via FFmpeg). |
mode |
string | "full" or "diff", echoing the requested mode. |
Transcription Update (full mode)
Sent repeatedly as audio is processed. This message has no type field.
{
"status": "active_transcription",
"lines": [
{
"speaker": 1,
"text": "Hello world, how are you?",
"start": "0:00:00",
"end": "0:00:03"
},
{
"speaker": 2,
"text": "I am fine, thanks.",
"start": "0:00:04",
"end": "0:00:06",
"translation": "Je vais bien, merci.",
"detected_language": "en"
}
],
"buffer_transcription": "And you",
"buffer_diarization": "",
"buffer_translation": "",
"remaining_time_transcription": 1.2,
"remaining_time_diarization": 0.5
}
| Field | Type | Description |
|---|---|---|
status |
string | "active_transcription" during normal operation. "no_audio_detected" when no speech has been detected yet. |
lines |
array | Committed transcription segments. Each update sends the full list of all committed lines (not incremental). |
buffer_transcription |
string | Ephemeral transcription text not yet committed to a line. Displayed in real time but overwritten on every update. |
buffer_diarization |
string | Ephemeral text waiting for speaker attribution. |
buffer_translation |
string | Ephemeral translation text for the current buffer. |
remaining_time_transcription |
float | Seconds of audio waiting to be transcribed (processing lag). |
remaining_time_diarization |
float | Seconds of audio waiting for speaker diarization. |
error |
string | Only present when an error occurred (e.g. FFmpeg failure). |
Line Object
Each element in lines has the following shape:
| Field | Type | Presence | Description |
|---|---|---|---|
speaker |
int | Always | Speaker ID. Normally 1, 2, 3, etc. The special value -2 indicates a silence segment. When diarization is disabled, defaults to 1. |
text |
string | Always | The transcribed text for this segment. null for silence segments. |
start |
string | Always | Start timestamp formatted as H:MM:SS (e.g. "0:00:03"). |
end |
string | Always | End timestamp formatted as H:MM:SS. |
translation |
string | Conditional | Present only when translation is enabled and available for this line. |
detected_language |
string | Conditional | Present only when language detection produced a result for this line (e.g. "en"). |
Snapshot (diff mode)
When mode=diff, the first transcription message is always a snapshot containing the full state. It has the same fields as a full-mode transcription update, plus metadata fields.
{
"type": "snapshot",
"seq": 1,
"status": "active_transcription",
"lines": [ ... ],
"buffer_transcription": "",
"buffer_diarization": "",
"buffer_translation": "",
"remaining_time_transcription": 0.0,
"remaining_time_diarization": 0.0
}
| Field | Type | Description |
|---|---|---|
type |
string | "snapshot". |
seq |
int | Monotonically increasing sequence number, starting at 1. |
| (remaining fields) | Same as a full-mode transcription update. |
Diff (diff mode)
All messages after the initial snapshot are diffs.
{
"type": "diff",
"seq": 4,
"status": "active_transcription",
"n_lines": 5,
"lines_pruned": 1,
"new_lines": [
{
"speaker": 1,
"text": "This is a new line.",
"start": "0:00:12",
"end": "0:00:14"
}
],
"buffer_transcription": "partial text",
"buffer_diarization": "",
"buffer_translation": "",
"remaining_time_transcription": 0.3,
"remaining_time_diarization": 0.1
}
| Field | Type | Presence | Description |
|---|---|---|---|
type |
string | Always | "diff". |
seq |
int | Always | Sequence number. |
status |
string | Always | Same as full mode. |
n_lines |
int | Always | Total number of lines the client should have after applying this diff. Use this to verify sync. |
lines_pruned |
int | Conditional | Number of lines to remove from the front of the client's line list. Only present when > 0. |
new_lines |
array | Conditional | Lines to append to the end of the client's line list. Only present when there are new lines. |
buffer_transcription |
string | Always | Replaces the previous buffer value. |
buffer_diarization |
string | Always | Replaces the previous buffer value. |
buffer_translation |
string | Always | Replaces the previous buffer value. |
remaining_time_transcription |
float | Always | Replaces the previous value. |
remaining_time_diarization |
float | Always | Replaces the previous value. |
error |
string | Conditional | Only present on error. |
Ready to Stop
Sent after all audio has been processed (i.e., after the client sent the end-of-audio signal and the server finished processing the remaining audio).
{
"type": "ready_to_stop"
}
Client to Server Messages
Audio Frames
Send binary WebSocket frames containing audio data.
When useAudioWorklet is true (server started with --pcm-input):
- PCM signed 16-bit little-endian, 16 kHz, mono (
s16le). - Any chunk size works. A typical chunk is 0.5 seconds (16,000 bytes).
When useAudioWorklet is false:
- Raw encoded audio bytes (any format FFmpeg can decode: WAV, MP3, FLAC, OGG, etc.).
- The server pipes these bytes through FFmpeg for decoding.
End-of-Audio Signal
Send an empty binary frame (b"") to tell the server that no more audio will follow. The server will finish processing any remaining audio and then send a ready_to_stop message.
Diff Protocol: Client Reconstruction
Clients using mode=diff must maintain a local list of lines and apply diffs incrementally.
Algorithm
def reconstruct_state(msg, lines):
"""Apply a snapshot or diff message to a local lines list.
Args:
msg: The parsed JSON message from the server.
lines: The client's mutable list of line objects.
Returns:
A full-state dict with all fields.
"""
if msg["type"] == "snapshot":
lines.clear()
lines.extend(msg.get("lines", []))
return msg
# Apply diff
n_pruned = msg.get("lines_pruned", 0)
if n_pruned > 0:
del lines[:n_pruned]
new_lines = msg.get("new_lines", [])
lines.extend(new_lines)
# Volatile fields are replaced wholesale
return {
"status": msg.get("status", ""),
"lines": lines[:],
"buffer_transcription": msg.get("buffer_transcription", ""),
"buffer_diarization": msg.get("buffer_diarization", ""),
"buffer_translation": msg.get("buffer_translation", ""),
"remaining_time_transcription": msg.get("remaining_time_transcription", 0),
"remaining_time_diarization": msg.get("remaining_time_diarization", 0),
}
Verification
After applying a diff, check that len(lines) == msg["n_lines"]. A mismatch indicates the client fell out of sync and should reconnect.
Silence Representation
Silence segments are represented as lines with speaker set to -2 and text set to null:
{
"speaker": -2,
"text": null,
"start": "0:00:10",
"end": "0:00:12"
}
Silence segments are only generated for pauses longer than 5 seconds.
Per-Session Language
The language query parameter creates an isolated language context for the session using SessionASRProxy. The proxy temporarily overrides the shared ASR backend's language during transcription calls, protected by a lock. This means:
- Each WebSocket session can transcribe in a different language.
- Sessions are thread-safe and do not interfere with each other.
- Pass
"auto"to use automatic language detection for the session regardless of the server-wide setting.