Files
WhisperLiveKit/docs/API.md
2026-03-08 15:14:00 +01:00

18 KiB

WhisperLiveKit API Reference

This document describes all APIs: the WebSocket streaming API, the OpenAI-compatible REST API, and the CLI.


REST API (OpenAI-compatible)

POST /v1/audio/transcriptions

Drop-in replacement for the OpenAI Audio Transcriptions API. Accepts the same parameters.

curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F response_format=json

Parameters (multipart form):

Parameter Type Default Description
file file required Audio file (any format ffmpeg can decode)
model string "" Accepted but ignored (uses server's backend)
language string null ISO 639-1 language code or null for auto-detection
prompt string "" Accepted for compatibility, not yet used
response_format string "json" json, verbose_json, text, srt, vtt
timestamp_granularities array null Accepted for compatibility

Response formats:

json (default):

{"text": "Hello world, how are you?"}

verbose_json:

{
  "task": "transcribe",
  "language": "en",
  "duration": 7.16,
  "text": "Hello world",
  "words": [{"word": "Hello", "start": 0.0, "end": 0.5}, ...],
  "segments": [{"id": 0, "start": 0.0, "end": 3.5, "text": "Hello world"}]
}

text: Plain text response.

srt / vtt: Subtitle format.

GET /v1/models

List the currently loaded model.

curl http://localhost:8000/v1/models

GET /health

Server health check.

curl http://localhost:8000/health

Deepgram-Compatible WebSocket API

WS /v1/listen

Drop-in compatible with Deepgram's Live Transcription WebSocket. Connect using any Deepgram client SDK pointed at your local server.

from deepgram import DeepgramClient, LiveOptions

deepgram = DeepgramClient(api_key="unused", config={"url": "localhost:8000"})
connection = deepgram.listen.websocket.v("1")
connection.start(LiveOptions(model="nova-2", language="en"))

Query Parameters: Same as Deepgram (language, punctuate, interim_results, vad_events, etc.).

Client Messages:

  • Binary audio frames
  • {"type": "KeepAlive"} — keep connection alive
  • {"type": "CloseStream"} — graceful close
  • {"type": "Finalize"} — flush pending audio

Server Messages:

  • Metadata — sent once at connection start
  • Results — transcription results with is_final/speech_final flags
  • UtteranceEnd — silence detected after speech
  • SpeechStarted — speech begins (requires vad_events=true)

Limitations vs Deepgram:

  • No authentication (self-hosted)
  • Word timestamps are interpolated from segment boundaries
  • Confidence scores are 0.0 (not available)

CLI

wlk / wlk serve

Start the transcription server.

wlk                                    # Start with defaults
wlk --backend voxtral --model base     # Specific backend
wlk serve --port 9000 --lan fr         # Explicit serve command

wlk listen

Live microphone transcription. Requires sounddevice (pip install sounddevice).

wlk listen                             # Transcribe from microphone
wlk listen --backend voxtral           # Use specific backend
wlk listen --language fr               # Force French
wlk listen --diarization               # With speaker identification
wlk listen -o transcript.txt           # Save to file on exit

Committed lines print as they are finalized. The current buffer (partial transcription) is shown in gray and updates in-place. Press Ctrl+C to stop; remaining audio is flushed before exit.

wlk run

Auto-pull model if not downloaded, then start the server.

wlk run voxtral                        # Pull voxtral + start server
wlk run large-v3                       # Pull large-v3 + start server
wlk run faster-whisper:base            # Specific backend + model
wlk run qwen3:1.7b                     # Qwen3-ASR
wlk run voxtral --lan fr --port 9000   # Extra server options passed through

wlk transcribe

Transcribe audio files offline (no server needed).

wlk transcribe audio.wav                          # Plain text output
wlk transcribe --format srt audio.wav             # SRT subtitles
wlk transcribe --format json audio.wav             # JSON output
wlk transcribe --backend voxtral audio.wav         # Specific backend
wlk transcribe --model large-v3 --language fr *.wav # Multiple files
wlk transcribe --output result.srt --format srt audio.wav

wlk bench

Benchmark speed (RTF) and accuracy (WER) on standard test audio.

wlk bench                              # Benchmark with defaults
wlk bench --backend faster-whisper     # Specific backend
wlk bench --model large-v3             # Larger model
wlk bench --json results.json          # Export results

Downloads test audio from LibriSpeech on first run. Reports WER (Word Error Rate) and RTF (Real-Time Factor: processing time / audio duration).

wlk diagnose

Run pipeline diagnostics on an audio file. Feeds audio through the full pipeline while probing internal backend state at regular intervals. Produces a timeline, flags anomalies, and prints health checks.

wlk diagnose audio.wav                        # Diagnose with default backend
wlk diagnose audio.wav --backend voxtral      # Diagnose specific backend
wlk diagnose --speed 0 --probe-interval 1     # Instant feed, probe every 1s
wlk diagnose                                   # Use built-in test sample

Useful for debugging issues like: no output appearing, slow transcription, stuck pipelines, or generate thread errors.

wlk models

List available backends, installation status, and downloaded models.

wlk models

wlk pull

Download models for offline use.

wlk pull base                      # Download for best available backend
wlk pull faster-whisper:large-v3   # Specific backend + model
wlk pull voxtral                   # Voxtral HF model
wlk pull qwen3:1.7b               # Qwen3-ASR 1.7B

wlk rm

Delete downloaded models to free disk space.

wlk rm base                        # Delete base model
wlk rm voxtral                     # Delete Voxtral model
wlk rm faster-whisper:large-v3     # Delete specific backend model

wlk check

Verify system dependencies (Python, ffmpeg, torch, etc.).

wlk version

Print the installed version.

Python Client (OpenAI SDK)

WhisperLiveKit's REST API is compatible with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

with open("audio.wav", "rb") as f:
    result = client.audio.transcriptions.create(
        model="whisper-base",  # ignored, uses server's backend
        file=f,
        response_format="verbose_json",
    )
print(result.text)

Programmatic Python API

For direct in-process usage without a server:

import asyncio
from whisperlivekit import TranscriptionEngine, AudioProcessor

async def transcribe(audio_path):
    engine = TranscriptionEngine(model_size="base", lan="en")
    # ... use AudioProcessor for full pipeline control

Or use the TestHarness for simpler usage:

import asyncio
from whisperlivekit import TestHarness

async def main():
    async with TestHarness(model_size="base", lan="en") as h:
        await h.feed("audio.wav", speed=0)
        result = await h.finish()
        print(result.text)

asyncio.run(main())

WebSocket Streaming API

This section describes the WebSocket API for clients that want to stream audio and receive real-time transcription results from a WhisperLiveKit server.


Connection

Endpoint

ws://<host>:<port>/asr

Query Parameters

Parameter Type Default Description
language string (none) Per-session language override. ISO 639-1 code (e.g. fr, en) or "auto" for automatic detection. When omitted, uses the server-wide language setting. Multiple sessions with different languages work concurrently.
mode string "full" Output mode. "full" sends complete state on every update. "diff" sends incremental diffs after an initial snapshot.

Example:

ws://localhost:8000/asr?language=fr&mode=diff

Connection Flow

  1. Client opens a WebSocket connection to /asr.
  2. Server accepts the connection and immediately sends a config message.
  3. Client streams binary audio frames to the server.
  4. Server sends transcription updates as JSON messages.
  5. Client sends empty bytes (b"") to signal end of audio.
  6. Server finishes processing remaining audio and sends a ready_to_stop message.

Server to Client Messages

Config Message

Sent once, immediately after the connection is accepted.

{
  "type": "config",
  "useAudioWorklet": true,
  "mode": "full"
}
Field Type Description
type string Always "config".
useAudioWorklet bool true when the server expects PCM s16le 16kHz mono input (started with --pcm-input). false when the server expects encoded audio (decoded server-side via FFmpeg).
mode string "full" or "diff", echoing the requested mode.

Transcription Update (full mode)

Sent repeatedly as audio is processed. This message has no type field.

{
  "status": "active_transcription",
  "lines": [
    {
      "speaker": 1,
      "text": "Hello world, how are you?",
      "start": "0:00:00",
      "end": "0:00:03"
    },
    {
      "speaker": 2,
      "text": "I am fine, thanks.",
      "start": "0:00:04",
      "end": "0:00:06",
      "translation": "Je vais bien, merci.",
      "detected_language": "en"
    }
  ],
  "buffer_transcription": "And you",
  "buffer_diarization": "",
  "buffer_translation": "",
  "remaining_time_transcription": 1.2,
  "remaining_time_diarization": 0.5
}
Field Type Description
status string "active_transcription" during normal operation. "no_audio_detected" when no speech has been detected yet.
lines array Committed transcription segments. Each update sends the full list of all committed lines (not incremental).
buffer_transcription string Ephemeral transcription text not yet committed to a line. Displayed in real time but overwritten on every update.
buffer_diarization string Ephemeral text waiting for speaker attribution.
buffer_translation string Ephemeral translation text for the current buffer.
remaining_time_transcription float Seconds of audio waiting to be transcribed (processing lag).
remaining_time_diarization float Seconds of audio waiting for speaker diarization.
error string Only present when an error occurred (e.g. FFmpeg failure).

Line Object

Each element in lines has the following shape:

Field Type Presence Description
speaker int Always Speaker ID. Normally 1, 2, 3, etc. The special value -2 indicates a silence segment. When diarization is disabled, defaults to 1.
text string Always The transcribed text for this segment. null for silence segments.
start string Always Start timestamp formatted as H:MM:SS (e.g. "0:00:03").
end string Always End timestamp formatted as H:MM:SS.
translation string Conditional Present only when translation is enabled and available for this line.
detected_language string Conditional Present only when language detection produced a result for this line (e.g. "en").

Snapshot (diff mode)

When mode=diff, the first transcription message is always a snapshot containing the full state. It has the same fields as a full-mode transcription update, plus metadata fields.

{
  "type": "snapshot",
  "seq": 1,
  "status": "active_transcription",
  "lines": [ ... ],
  "buffer_transcription": "",
  "buffer_diarization": "",
  "buffer_translation": "",
  "remaining_time_transcription": 0.0,
  "remaining_time_diarization": 0.0
}
Field Type Description
type string "snapshot".
seq int Monotonically increasing sequence number, starting at 1.
(remaining fields) Same as a full-mode transcription update.

Diff (diff mode)

All messages after the initial snapshot are diffs.

{
  "type": "diff",
  "seq": 4,
  "status": "active_transcription",
  "n_lines": 5,
  "lines_pruned": 1,
  "new_lines": [
    {
      "speaker": 1,
      "text": "This is a new line.",
      "start": "0:00:12",
      "end": "0:00:14"
    }
  ],
  "buffer_transcription": "partial text",
  "buffer_diarization": "",
  "buffer_translation": "",
  "remaining_time_transcription": 0.3,
  "remaining_time_diarization": 0.1
}
Field Type Presence Description
type string Always "diff".
seq int Always Sequence number.
status string Always Same as full mode.
n_lines int Always Total number of lines the client should have after applying this diff. Use this to verify sync.
lines_pruned int Conditional Number of lines to remove from the front of the client's line list. Only present when > 0.
new_lines array Conditional Lines to append to the end of the client's line list. Only present when there are new lines.
buffer_transcription string Always Replaces the previous buffer value.
buffer_diarization string Always Replaces the previous buffer value.
buffer_translation string Always Replaces the previous buffer value.
remaining_time_transcription float Always Replaces the previous value.
remaining_time_diarization float Always Replaces the previous value.
error string Conditional Only present on error.

Ready to Stop

Sent after all audio has been processed (i.e., after the client sent the end-of-audio signal and the server finished processing the remaining audio).

{
  "type": "ready_to_stop"
}

Client to Server Messages

Audio Frames

Send binary WebSocket frames containing audio data.

When useAudioWorklet is true (server started with --pcm-input):

  • PCM signed 16-bit little-endian, 16 kHz, mono (s16le).
  • Any chunk size works. A typical chunk is 0.5 seconds (16,000 bytes).

When useAudioWorklet is false:

  • Raw encoded audio bytes (any format FFmpeg can decode: WAV, MP3, FLAC, OGG, etc.).
  • The server pipes these bytes through FFmpeg for decoding.

End-of-Audio Signal

Send an empty binary frame (b"") to tell the server that no more audio will follow. The server will finish processing any remaining audio and then send a ready_to_stop message.


Diff Protocol: Client Reconstruction

Clients using mode=diff must maintain a local list of lines and apply diffs incrementally.

Algorithm

def reconstruct_state(msg, lines):
    """Apply a snapshot or diff message to a local lines list.

    Args:
        msg: The parsed JSON message from the server.
        lines: The client's mutable list of line objects.

    Returns:
        A full-state dict with all fields.
    """
    if msg["type"] == "snapshot":
        lines.clear()
        lines.extend(msg.get("lines", []))
        return msg

    # Apply diff
    n_pruned = msg.get("lines_pruned", 0)
    if n_pruned > 0:
        del lines[:n_pruned]

    new_lines = msg.get("new_lines", [])
    lines.extend(new_lines)

    # Volatile fields are replaced wholesale
    return {
        "status": msg.get("status", ""),
        "lines": lines[:],
        "buffer_transcription": msg.get("buffer_transcription", ""),
        "buffer_diarization": msg.get("buffer_diarization", ""),
        "buffer_translation": msg.get("buffer_translation", ""),
        "remaining_time_transcription": msg.get("remaining_time_transcription", 0),
        "remaining_time_diarization": msg.get("remaining_time_diarization", 0),
    }

Verification

After applying a diff, check that len(lines) == msg["n_lines"]. A mismatch indicates the client fell out of sync and should reconnect.


Silence Representation

Silence segments are represented as lines with speaker set to -2 and text set to null:

{
  "speaker": -2,
  "text": null,
  "start": "0:00:10",
  "end": "0:00:12"
}

Silence segments are only generated for pauses longer than 5 seconds.


Per-Session Language

The language query parameter creates an isolated language context for the session using SessionASRProxy. The proxy temporarily overrides the shared ASR backend's language during transcription calls, protected by a lock. This means:

  • Each WebSocket session can transcribe in a different language.
  • Sessions are thread-safe and do not interfere with each other.
  • Pass "auto" to use automatic language detection for the session regardless of the server-wide setting.