LLM/WhisperLiveKit

Fork 0

mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-03-22 00:50:44 +00:00

Files

Quentin Fuxa 10d85ff65f Update docs, CI, and architecture diagram

2026-03-08 15:14:00 +01:00

18 KiB

Raw Permalink Blame History

WhisperLiveKit API Reference

This document describes all APIs: the WebSocket streaming API, the OpenAI-compatible REST API, and the CLI.

REST API (OpenAI-compatible)

POST /v1/audio/transcriptions

Drop-in replacement for the OpenAI Audio Transcriptions API. Accepts the same parameters.

curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F response_format=json

Parameters (multipart form):

Parameter	Type	Default	Description
`file`	file	required	Audio file (any format ffmpeg can decode)
`model`	string	`""`	Accepted but ignored (uses server's backend)
`language`	string	`null`	ISO 639-1 language code or null for auto-detection
`prompt`	string	`""`	Accepted for compatibility, not yet used
`response_format`	string	`"json"`	`json`, `verbose_json`, `text`, `srt`, `vtt`
`timestamp_granularities`	array	`null`	Accepted for compatibility

Response formats:

json (default):

{"text": "Hello world, how are you?"}

verbose_json:

{
  "task": "transcribe",
  "language": "en",
  "duration": 7.16,
  "text": "Hello world",
  "words": [{"word": "Hello", "start": 0.0, "end": 0.5}, ...],
  "segments": [{"id": 0, "start": 0.0, "end": 3.5, "text": "Hello world"}]
}

text: Plain text response.

srt / vtt: Subtitle format.

GET /v1/models

List the currently loaded model.

curl http://localhost:8000/v1/models

GET /health

Server health check.

curl http://localhost:8000/health

Deepgram-Compatible WebSocket API

WS /v1/listen

Drop-in compatible with Deepgram's Live Transcription WebSocket. Connect using any Deepgram client SDK pointed at your local server.

from deepgram import DeepgramClient, LiveOptions

deepgram = DeepgramClient(api_key="unused", config={"url": "localhost:8000"})
connection = deepgram.listen.websocket.v("1")
connection.start(LiveOptions(model="nova-2", language="en"))

Query Parameters: Same as Deepgram (language, punctuate, interim_results, vad_events, etc.).

Client Messages:

Binary audio frames
{"type": "KeepAlive"} — keep connection alive
{"type": "CloseStream"} — graceful close
{"type": "Finalize"} — flush pending audio

Server Messages:

Metadata — sent once at connection start
Results — transcription results with is_final/speech_final flags
UtteranceEnd — silence detected after speech
SpeechStarted — speech begins (requires vad_events=true)

Limitations vs Deepgram:

No authentication (self-hosted)
Word timestamps are interpolated from segment boundaries
Confidence scores are 0.0 (not available)

CLI

`wlk` / `wlk serve`

Start the transcription server.

wlk                                    # Start with defaults
wlk --backend voxtral --model base     # Specific backend
wlk serve --port 9000 --lan fr         # Explicit serve command

`wlk listen`

Live microphone transcription. Requires sounddevice (pip install sounddevice).

wlk listen                             # Transcribe from microphone
wlk listen --backend voxtral           # Use specific backend
wlk listen --language fr               # Force French
wlk listen --diarization               # With speaker identification
wlk listen -o transcript.txt           # Save to file on exit

Committed lines print as they are finalized. The current buffer (partial transcription) is shown in gray and updates in-place. Press Ctrl+C to stop; remaining audio is flushed before exit.

`wlk run`

Auto-pull model if not downloaded, then start the server.

wlk run voxtral                        # Pull voxtral + start server
wlk run large-v3                       # Pull large-v3 + start server
wlk run faster-whisper:base            # Specific backend + model
wlk run qwen3:1.7b                     # Qwen3-ASR
wlk run voxtral --lan fr --port 9000   # Extra server options passed through

`wlk transcribe`

Transcribe audio files offline (no server needed).

wlk transcribe audio.wav                          # Plain text output
wlk transcribe --format srt audio.wav             # SRT subtitles
wlk transcribe --format json audio.wav             # JSON output
wlk transcribe --backend voxtral audio.wav         # Specific backend
wlk transcribe --model large-v3 --language fr *.wav # Multiple files
wlk transcribe --output result.srt --format srt audio.wav

`wlk bench`

Benchmark speed (RTF) and accuracy (WER) on standard test audio.

wlk bench                              # Benchmark with defaults
wlk bench --backend faster-whisper     # Specific backend
wlk bench --model large-v3             # Larger model
wlk bench --json results.json          # Export results

Downloads test audio from LibriSpeech on first run. Reports WER (Word Error Rate) and RTF (Real-Time Factor: processing time / audio duration).

`wlk diagnose`

Run pipeline diagnostics on an audio file. Feeds audio through the full pipeline while probing internal backend state at regular intervals. Produces a timeline, flags anomalies, and prints health checks.

wlk diagnose audio.wav                        # Diagnose with default backend
wlk diagnose audio.wav --backend voxtral      # Diagnose specific backend
wlk diagnose --speed 0 --probe-interval 1     # Instant feed, probe every 1s
wlk diagnose                                   # Use built-in test sample

Useful for debugging issues like: no output appearing, slow transcription, stuck pipelines, or generate thread errors.

`wlk models`

List available backends, installation status, and downloaded models.

wlk models

`wlk pull`

Download models for offline use.

wlk pull base                      # Download for best available backend
wlk pull faster-whisper:large-v3   # Specific backend + model
wlk pull voxtral                   # Voxtral HF model
wlk pull qwen3:1.7b               # Qwen3-ASR 1.7B

`wlk rm`

Delete downloaded models to free disk space.

wlk rm base                        # Delete base model
wlk rm voxtral                     # Delete Voxtral model
wlk rm faster-whisper:large-v3     # Delete specific backend model

`wlk check`

Verify system dependencies (Python, ffmpeg, torch, etc.).

`wlk version`

Print the installed version.

Python Client (OpenAI SDK)

WhisperLiveKit's REST API is compatible with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

with open("audio.wav", "rb") as f:
    result = client.audio.transcriptions.create(
        model="whisper-base",  # ignored, uses server's backend
        file=f,
        response_format="verbose_json",
    )
print(result.text)

Programmatic Python API

For direct in-process usage without a server:

import asyncio
from whisperlivekit import TranscriptionEngine, AudioProcessor

async def transcribe(audio_path):
    engine = TranscriptionEngine(model_size="base", lan="en")
    # ... use AudioProcessor for full pipeline control

Or use the TestHarness for simpler usage:

import asyncio
from whisperlivekit import TestHarness

async def main():
    async with TestHarness(model_size="base", lan="en") as h:
        await h.feed("audio.wav", speed=0)
        result = await h.finish()
        print(result.text)

asyncio.run(main())

WebSocket Streaming API

This section describes the WebSocket API for clients that want to stream audio and receive real-time transcription results from a WhisperLiveKit server.

Connection

Endpoint

ws://<host>:<port>/asr

Query Parameters

Parameter	Type	Default	Description
`language`	string	(none)	Per-session language override. ISO 639-1 code (e.g. `fr`, `en`) or `"auto"` for automatic detection. When omitted, uses the server-wide language setting. Multiple sessions with different languages work concurrently.
`mode`	string	`"full"`	Output mode. `"full"` sends complete state on every update. `"diff"` sends incremental diffs after an initial snapshot.

Example:

ws://localhost:8000/asr?language=fr&mode=diff

Connection Flow

Client opens a WebSocket connection to /asr.
Server accepts the connection and immediately sends a config message.
Client streams binary audio frames to the server.
Server sends transcription updates as JSON messages.
Client sends empty bytes (b"") to signal end of audio.
Server finishes processing remaining audio and sends a ready_to_stop message.

Server to Client Messages

Config Message

Sent once, immediately after the connection is accepted.

{
  "type": "config",
  "useAudioWorklet": true,
  "mode": "full"
}

Field	Type	Description
`type`	string	Always `"config"`.
`useAudioWorklet`	bool	`true` when the server expects PCM s16le 16kHz mono input (started with `--pcm-input`). `false` when the server expects encoded audio (decoded server-side via FFmpeg).
`mode`	string	`"full"` or `"diff"`, echoing the requested mode.

Transcription Update (full mode)

Sent repeatedly as audio is processed. This message has no type field.

{
  "status": "active_transcription",
  "lines": [
    {
      "speaker": 1,
      "text": "Hello world, how are you?",
      "start": "0:00:00",
      "end": "0:00:03"
    },
    {
      "speaker": 2,
      "text": "I am fine, thanks.",
      "start": "0:00:04",
      "end": "0:00:06",
      "translation": "Je vais bien, merci.",
      "detected_language": "en"
    }
  ],
  "buffer_transcription": "And you",
  "buffer_diarization": "",
  "buffer_translation": "",
  "remaining_time_transcription": 1.2,
  "remaining_time_diarization": 0.5
}

Field	Type	Description
`status`	string	`"active_transcription"` during normal operation. `"no_audio_detected"` when no speech has been detected yet.
`lines`	array	Committed transcription segments. Each update sends the full list of all committed lines (not incremental).
`buffer_transcription`	string	Ephemeral transcription text not yet committed to a line. Displayed in real time but overwritten on every update.
`buffer_diarization`	string	Ephemeral text waiting for speaker attribution.
`buffer_translation`	string	Ephemeral translation text for the current buffer.
`remaining_time_transcription`	float	Seconds of audio waiting to be transcribed (processing lag).
`remaining_time_diarization`	float	Seconds of audio waiting for speaker diarization.
`error`	string	Only present when an error occurred (e.g. FFmpeg failure).

Line Object

Each element in lines has the following shape:

Field	Type	Presence	Description
`speaker`	int	Always	Speaker ID. Normally `1`, `2`, `3`, etc. The special value `-2` indicates a silence segment. When diarization is disabled, defaults to `1`.
`text`	string	Always	The transcribed text for this segment. `null` for silence segments.
`start`	string	Always	Start timestamp formatted as `H:MM:SS` (e.g. `"0:00:03"`).
`end`	string	Always	End timestamp formatted as `H:MM:SS`.
`translation`	string	Conditional	Present only when translation is enabled and available for this line.
`detected_language`	string	Conditional	Present only when language detection produced a result for this line (e.g. `"en"`).

Snapshot (diff mode)

When mode=diff, the first transcription message is always a snapshot containing the full state. It has the same fields as a full-mode transcription update, plus metadata fields.

{
  "type": "snapshot",
  "seq": 1,
  "status": "active_transcription",
  "lines": [ ... ],
  "buffer_transcription": "",
  "buffer_diarization": "",
  "buffer_translation": "",
  "remaining_time_transcription": 0.0,
  "remaining_time_diarization": 0.0
}

Field	Type	Description
`type`	string	`"snapshot"`.
`seq`	int	Monotonically increasing sequence number, starting at 1.
(remaining fields)		Same as a full-mode transcription update.

Diff (diff mode)

All messages after the initial snapshot are diffs.

{
  "type": "diff",
  "seq": 4,
  "status": "active_transcription",
  "n_lines": 5,
  "lines_pruned": 1,
  "new_lines": [
    {
      "speaker": 1,
      "text": "This is a new line.",
      "start": "0:00:12",
      "end": "0:00:14"
    }
  ],
  "buffer_transcription": "partial text",
  "buffer_diarization": "",
  "buffer_translation": "",
  "remaining_time_transcription": 0.3,
  "remaining_time_diarization": 0.1
}

Field	Type	Presence	Description
`type`	string	Always	`"diff"`.
`seq`	int	Always	Sequence number.
`status`	string	Always	Same as full mode.
`n_lines`	int	Always	Total number of lines the client should have after applying this diff. Use this to verify sync.
`lines_pruned`	int	Conditional	Number of lines to remove from the front of the client's line list. Only present when > 0.
`new_lines`	array	Conditional	Lines to append to the end of the client's line list. Only present when there are new lines.
`buffer_transcription`	string	Always	Replaces the previous buffer value.
`buffer_diarization`	string	Always	Replaces the previous buffer value.
`buffer_translation`	string	Always	Replaces the previous buffer value.
`remaining_time_transcription`	float	Always	Replaces the previous value.
`remaining_time_diarization`	float	Always	Replaces the previous value.
`error`	string	Conditional	Only present on error.

Ready to Stop

Sent after all audio has been processed (i.e., after the client sent the end-of-audio signal and the server finished processing the remaining audio).

{
  "type": "ready_to_stop"
}

Client to Server Messages

Audio Frames

Send binary WebSocket frames containing audio data.

When useAudioWorklet is true (server started with --pcm-input):

PCM signed 16-bit little-endian, 16 kHz, mono (s16le).
Any chunk size works. A typical chunk is 0.5 seconds (16,000 bytes).

When useAudioWorklet is false:

Raw encoded audio bytes (any format FFmpeg can decode: WAV, MP3, FLAC, OGG, etc.).
The server pipes these bytes through FFmpeg for decoding.

End-of-Audio Signal

Send an empty binary frame (b"") to tell the server that no more audio will follow. The server will finish processing any remaining audio and then send a ready_to_stop message.

Diff Protocol: Client Reconstruction

Clients using mode=diff must maintain a local list of lines and apply diffs incrementally.

Algorithm

def reconstruct_state(msg, lines):
    """Apply a snapshot or diff message to a local lines list.

    Args:
        msg: The parsed JSON message from the server.
        lines: The client's mutable list of line objects.

    Returns:
        A full-state dict with all fields.
    """
    if msg["type"] == "snapshot":
        lines.clear()
        lines.extend(msg.get("lines", []))
        return msg

    # Apply diff
    n_pruned = msg.get("lines_pruned", 0)
    if n_pruned > 0:
        del lines[:n_pruned]

    new_lines = msg.get("new_lines", [])
    lines.extend(new_lines)

    # Volatile fields are replaced wholesale
    return {
        "status": msg.get("status", ""),
        "lines": lines[:],
        "buffer_transcription": msg.get("buffer_transcription", ""),
        "buffer_diarization": msg.get("buffer_diarization", ""),
        "buffer_translation": msg.get("buffer_translation", ""),
        "remaining_time_transcription": msg.get("remaining_time_transcription", 0),
        "remaining_time_diarization": msg.get("remaining_time_diarization", 0),
    }

Verification

After applying a diff, check that len(lines) == msg["n_lines"]. A mismatch indicates the client fell out of sync and should reconnect.

Silence Representation

Silence segments are represented as lines with speaker set to -2 and text set to null:

{
  "speaker": -2,
  "text": null,
  "start": "0:00:10",
  "end": "0:00:12"
}

Silence segments are only generated for pauses longer than 5 seconds.

Per-Session Language

The language query parameter creates an isolated language context for the session using SessionASRProxy. The proxy temporarily overrides the shared ASR backend's language during transcription calls, protected by a lock. This means:

Each WebSocket session can transcribe in a different language.
Sessions are thread-safe and do not interfere with each other.
Pass "auto" to use automatic language detection for the session regardless of the server-wide setting.

18 KiB Raw Permalink Blame History