mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-04-05 04:51:20 +00:00
550 lines
18 KiB
Markdown
550 lines
18 KiB
Markdown
# WhisperLiveKit API Reference
|
|
|
|
This document describes all APIs: the WebSocket streaming API, the OpenAI-compatible REST API, and the CLI.
|
|
|
|
---
|
|
|
|
## REST API (OpenAI-compatible)
|
|
|
|
### POST /v1/audio/transcriptions
|
|
|
|
Drop-in replacement for the OpenAI Audio Transcriptions API. Accepts the same parameters.
|
|
|
|
```bash
|
|
curl http://localhost:8000/v1/audio/transcriptions \
|
|
-F file=@audio.wav \
|
|
-F response_format=json
|
|
```
|
|
|
|
**Parameters (multipart form):**
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|--------------------------|----------|---------|-------------|
|
|
| `file` | file | required | Audio file (any format ffmpeg can decode) |
|
|
| `model` | string | `""` | Accepted but ignored (uses server's backend) |
|
|
| `language` | string | `null` | ISO 639-1 language code or null for auto-detection |
|
|
| `prompt` | string | `""` | Accepted for compatibility, not yet used |
|
|
| `response_format` | string | `"json"` | `json`, `verbose_json`, `text`, `srt`, `vtt` |
|
|
| `timestamp_granularities`| array | `null` | Accepted for compatibility |
|
|
|
|
**Response formats:**
|
|
|
|
`json` (default):
|
|
```json
|
|
{"text": "Hello world, how are you?"}
|
|
```
|
|
|
|
`verbose_json`:
|
|
```json
|
|
{
|
|
"task": "transcribe",
|
|
"language": "en",
|
|
"duration": 7.16,
|
|
"text": "Hello world",
|
|
"words": [{"word": "Hello", "start": 0.0, "end": 0.5}, ...],
|
|
"segments": [{"id": 0, "start": 0.0, "end": 3.5, "text": "Hello world"}]
|
|
}
|
|
```
|
|
|
|
`text`: Plain text response.
|
|
|
|
`srt` / `vtt`: Subtitle format.
|
|
|
|
### GET /v1/models
|
|
|
|
List the currently loaded model.
|
|
|
|
```bash
|
|
curl http://localhost:8000/v1/models
|
|
```
|
|
|
|
### GET /health
|
|
|
|
Server health check.
|
|
|
|
```bash
|
|
curl http://localhost:8000/health
|
|
```
|
|
|
|
---
|
|
|
|
## Deepgram-Compatible WebSocket API
|
|
|
|
### WS /v1/listen
|
|
|
|
Drop-in compatible with Deepgram's Live Transcription WebSocket. Connect using any Deepgram client SDK pointed at your local server.
|
|
|
|
```python
|
|
from deepgram import DeepgramClient, LiveOptions
|
|
|
|
deepgram = DeepgramClient(api_key="unused", config={"url": "localhost:8000"})
|
|
connection = deepgram.listen.websocket.v("1")
|
|
connection.start(LiveOptions(model="nova-2", language="en"))
|
|
```
|
|
|
|
**Query Parameters:** Same as Deepgram (`language`, `punctuate`, `interim_results`, `vad_events`, etc.).
|
|
|
|
**Client Messages:**
|
|
- Binary audio frames
|
|
- `{"type": "KeepAlive"}` — keep connection alive
|
|
- `{"type": "CloseStream"}` — graceful close
|
|
- `{"type": "Finalize"}` — flush pending audio
|
|
|
|
**Server Messages:**
|
|
- `Metadata` — sent once at connection start
|
|
- `Results` — transcription results with `is_final`/`speech_final` flags
|
|
- `UtteranceEnd` — silence detected after speech
|
|
- `SpeechStarted` — speech begins (requires `vad_events=true`)
|
|
|
|
**Limitations vs Deepgram:**
|
|
- No authentication (self-hosted)
|
|
- Word timestamps are interpolated from segment boundaries
|
|
- Confidence scores are 0.0 (not available)
|
|
|
|
---
|
|
|
|
## CLI
|
|
|
|
### `wlk` / `wlk serve`
|
|
|
|
Start the transcription server.
|
|
|
|
```bash
|
|
wlk # Start with defaults
|
|
wlk --backend voxtral --model base # Specific backend
|
|
wlk serve --port 9000 --lan fr # Explicit serve command
|
|
```
|
|
|
|
### `wlk listen`
|
|
|
|
Live microphone transcription. Requires `sounddevice` (`pip install sounddevice`).
|
|
|
|
```bash
|
|
wlk listen # Transcribe from microphone
|
|
wlk listen --backend voxtral # Use specific backend
|
|
wlk listen --language fr # Force French
|
|
wlk listen --diarization # With speaker identification
|
|
wlk listen -o transcript.txt # Save to file on exit
|
|
```
|
|
|
|
Committed lines print as they are finalized. The current buffer (partial transcription) is shown in gray and updates in-place. Press Ctrl+C to stop; remaining audio is flushed before exit.
|
|
|
|
### `wlk run`
|
|
|
|
Auto-pull model if not downloaded, then start the server.
|
|
|
|
```bash
|
|
wlk run voxtral # Pull voxtral + start server
|
|
wlk run large-v3 # Pull large-v3 + start server
|
|
wlk run faster-whisper:base # Specific backend + model
|
|
wlk run qwen3:1.7b # Qwen3-ASR
|
|
wlk run voxtral --lan fr --port 9000 # Extra server options passed through
|
|
```
|
|
|
|
### `wlk transcribe`
|
|
|
|
Transcribe audio files offline (no server needed).
|
|
|
|
```bash
|
|
wlk transcribe audio.wav # Plain text output
|
|
wlk transcribe --format srt audio.wav # SRT subtitles
|
|
wlk transcribe --format json audio.wav # JSON output
|
|
wlk transcribe --backend voxtral audio.wav # Specific backend
|
|
wlk transcribe --model large-v3 --language fr *.wav # Multiple files
|
|
wlk transcribe --output result.srt --format srt audio.wav
|
|
```
|
|
|
|
### `wlk bench`
|
|
|
|
Benchmark speed (RTF) and accuracy (WER) on standard test audio.
|
|
|
|
```bash
|
|
wlk bench # Benchmark with defaults
|
|
wlk bench --backend faster-whisper # Specific backend
|
|
wlk bench --model large-v3 # Larger model
|
|
wlk bench --json results.json # Export results
|
|
```
|
|
|
|
Downloads test audio from LibriSpeech on first run. Reports WER (Word Error Rate) and RTF (Real-Time Factor: processing time / audio duration).
|
|
|
|
### `wlk diagnose`
|
|
|
|
Run pipeline diagnostics on an audio file. Feeds audio through the full pipeline while probing internal backend state at regular intervals. Produces a timeline, flags anomalies, and prints health checks.
|
|
|
|
```bash
|
|
wlk diagnose audio.wav # Diagnose with default backend
|
|
wlk diagnose audio.wav --backend voxtral # Diagnose specific backend
|
|
wlk diagnose --speed 0 --probe-interval 1 # Instant feed, probe every 1s
|
|
wlk diagnose # Use built-in test sample
|
|
```
|
|
|
|
Useful for debugging issues like: no output appearing, slow transcription, stuck pipelines, or generate thread errors.
|
|
|
|
### `wlk models`
|
|
|
|
List available backends, installation status, and downloaded models.
|
|
|
|
```bash
|
|
wlk models
|
|
```
|
|
|
|
### `wlk pull`
|
|
|
|
Download models for offline use.
|
|
|
|
```bash
|
|
wlk pull base # Download for best available backend
|
|
wlk pull faster-whisper:large-v3 # Specific backend + model
|
|
wlk pull voxtral # Voxtral HF model
|
|
wlk pull qwen3:1.7b # Qwen3-ASR 1.7B
|
|
```
|
|
|
|
### `wlk rm`
|
|
|
|
Delete downloaded models to free disk space.
|
|
|
|
```bash
|
|
wlk rm base # Delete base model
|
|
wlk rm voxtral # Delete Voxtral model
|
|
wlk rm faster-whisper:large-v3 # Delete specific backend model
|
|
```
|
|
|
|
### `wlk check`
|
|
|
|
Verify system dependencies (Python, ffmpeg, torch, etc.).
|
|
|
|
### `wlk version`
|
|
|
|
Print the installed version.
|
|
|
|
### Python Client (OpenAI SDK)
|
|
|
|
WhisperLiveKit's REST API is compatible with the OpenAI Python SDK:
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
|
|
|
|
with open("audio.wav", "rb") as f:
|
|
result = client.audio.transcriptions.create(
|
|
model="whisper-base", # ignored, uses server's backend
|
|
file=f,
|
|
response_format="verbose_json",
|
|
)
|
|
print(result.text)
|
|
```
|
|
|
|
### Programmatic Python API
|
|
|
|
For direct in-process usage without a server:
|
|
|
|
```python
|
|
import asyncio
|
|
from whisperlivekit import TranscriptionEngine, AudioProcessor
|
|
|
|
async def transcribe(audio_path):
|
|
engine = TranscriptionEngine(model_size="base", lan="en")
|
|
# ... use AudioProcessor for full pipeline control
|
|
```
|
|
|
|
Or use the TestHarness for simpler usage:
|
|
|
|
```python
|
|
import asyncio
|
|
from whisperlivekit import TestHarness
|
|
|
|
async def main():
|
|
async with TestHarness(model_size="base", lan="en") as h:
|
|
await h.feed("audio.wav", speed=0)
|
|
result = await h.finish()
|
|
print(result.text)
|
|
|
|
asyncio.run(main())
|
|
```
|
|
|
|
---
|
|
|
|
## WebSocket Streaming API
|
|
|
|
This section describes the WebSocket API for clients that want to stream audio and receive real-time transcription results from a WhisperLiveKit server.
|
|
|
|
---
|
|
|
|
## Connection
|
|
|
|
### Endpoint
|
|
|
|
```
|
|
ws://<host>:<port>/asr
|
|
```
|
|
|
|
### Query Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|------------|--------|----------|-------------|
|
|
| `language` | string | _(none)_ | Per-session language override. ISO 639-1 code (e.g. `fr`, `en`) or `"auto"` for automatic detection. When omitted, uses the server-wide language setting. Multiple sessions with different languages work concurrently. |
|
|
| `mode` | string | `"full"` | Output mode. `"full"` sends complete state on every update. `"diff"` sends incremental diffs after an initial snapshot. |
|
|
|
|
Example:
|
|
```
|
|
ws://localhost:8000/asr?language=fr&mode=diff
|
|
```
|
|
|
|
### Connection Flow
|
|
|
|
1. Client opens a WebSocket connection to `/asr`.
|
|
2. Server accepts the connection and immediately sends a **config message**.
|
|
3. Client streams binary audio frames to the server.
|
|
4. Server sends transcription updates as JSON messages.
|
|
5. Client sends empty bytes (`b""`) to signal end of audio.
|
|
6. Server finishes processing remaining audio and sends a **ready_to_stop** message.
|
|
|
|
---
|
|
|
|
## Server to Client Messages
|
|
|
|
### Config Message
|
|
|
|
Sent once, immediately after the connection is accepted.
|
|
|
|
```json
|
|
{
|
|
"type": "config",
|
|
"useAudioWorklet": true,
|
|
"mode": "full"
|
|
}
|
|
```
|
|
|
|
| Field | Type | Description |
|
|
|-------------------|--------|-------------|
|
|
| `type` | string | Always `"config"`. |
|
|
| `useAudioWorklet` | bool | `true` when the server expects PCM s16le 16kHz mono input (started with `--pcm-input`). `false` when the server expects encoded audio (decoded server-side via FFmpeg). |
|
|
| `mode` | string | `"full"` or `"diff"`, echoing the requested mode. |
|
|
|
|
### Transcription Update (full mode)
|
|
|
|
Sent repeatedly as audio is processed. This message has **no `type` field**.
|
|
|
|
```json
|
|
{
|
|
"status": "active_transcription",
|
|
"lines": [
|
|
{
|
|
"speaker": 1,
|
|
"text": "Hello world, how are you?",
|
|
"start": "0:00:00",
|
|
"end": "0:00:03"
|
|
},
|
|
{
|
|
"speaker": 2,
|
|
"text": "I am fine, thanks.",
|
|
"start": "0:00:04",
|
|
"end": "0:00:06",
|
|
"translation": "Je vais bien, merci.",
|
|
"detected_language": "en"
|
|
}
|
|
],
|
|
"buffer_transcription": "And you",
|
|
"buffer_diarization": "",
|
|
"buffer_translation": "",
|
|
"remaining_time_transcription": 1.2,
|
|
"remaining_time_diarization": 0.5
|
|
}
|
|
```
|
|
|
|
| Field | Type | Description |
|
|
|--------------------------------|--------|-------------|
|
|
| `status` | string | `"active_transcription"` during normal operation. `"no_audio_detected"` when no speech has been detected yet. |
|
|
| `lines` | array | Committed transcription segments. Each update sends the **full list** of all committed lines (not incremental). |
|
|
| `buffer_transcription` | string | Ephemeral transcription text not yet committed to a line. Displayed in real time but overwritten on every update. |
|
|
| `buffer_diarization` | string | Ephemeral text waiting for speaker attribution. |
|
|
| `buffer_translation` | string | Ephemeral translation text for the current buffer. |
|
|
| `remaining_time_transcription` | float | Seconds of audio waiting to be transcribed (processing lag). |
|
|
| `remaining_time_diarization` | float | Seconds of audio waiting for speaker diarization. |
|
|
| `error` | string | Only present when an error occurred (e.g. FFmpeg failure). |
|
|
|
|
#### Line Object
|
|
|
|
Each element in `lines` has the following shape:
|
|
|
|
| Field | Type | Presence | Description |
|
|
|---------------------|--------|-------------|-------------|
|
|
| `speaker` | int | Always | Speaker ID. Normally `1`, `2`, `3`, etc. The special value `-2` indicates a silence segment. When diarization is disabled, defaults to `1`. |
|
|
| `text` | string | Always | The transcribed text for this segment. `null` for silence segments. |
|
|
| `start` | string | Always | Start timestamp formatted as `H:MM:SS` (e.g. `"0:00:03"`). |
|
|
| `end` | string | Always | End timestamp formatted as `H:MM:SS`. |
|
|
| `translation` | string | Conditional | Present only when translation is enabled and available for this line. |
|
|
| `detected_language` | string | Conditional | Present only when language detection produced a result for this line (e.g. `"en"`). |
|
|
|
|
### Snapshot (diff mode)
|
|
|
|
When `mode=diff`, the first transcription message is always a snapshot containing the full state. It has the same fields as a full-mode transcription update, plus metadata fields.
|
|
|
|
```json
|
|
{
|
|
"type": "snapshot",
|
|
"seq": 1,
|
|
"status": "active_transcription",
|
|
"lines": [ ... ],
|
|
"buffer_transcription": "",
|
|
"buffer_diarization": "",
|
|
"buffer_translation": "",
|
|
"remaining_time_transcription": 0.0,
|
|
"remaining_time_diarization": 0.0
|
|
}
|
|
```
|
|
|
|
| Field | Type | Description |
|
|
|--------|--------|-------------|
|
|
| `type` | string | `"snapshot"`. |
|
|
| `seq` | int | Monotonically increasing sequence number, starting at 1. |
|
|
| _(remaining fields)_ | | Same as a full-mode transcription update. |
|
|
|
|
### Diff (diff mode)
|
|
|
|
All messages after the initial snapshot are diffs.
|
|
|
|
```json
|
|
{
|
|
"type": "diff",
|
|
"seq": 4,
|
|
"status": "active_transcription",
|
|
"n_lines": 5,
|
|
"lines_pruned": 1,
|
|
"new_lines": [
|
|
{
|
|
"speaker": 1,
|
|
"text": "This is a new line.",
|
|
"start": "0:00:12",
|
|
"end": "0:00:14"
|
|
}
|
|
],
|
|
"buffer_transcription": "partial text",
|
|
"buffer_diarization": "",
|
|
"buffer_translation": "",
|
|
"remaining_time_transcription": 0.3,
|
|
"remaining_time_diarization": 0.1
|
|
}
|
|
```
|
|
|
|
| Field | Type | Presence | Description |
|
|
|--------------------------------|--------|-------------|-------------|
|
|
| `type` | string | Always | `"diff"`. |
|
|
| `seq` | int | Always | Sequence number. |
|
|
| `status` | string | Always | Same as full mode. |
|
|
| `n_lines` | int | Always | Total number of lines the client should have after applying this diff. Use this to verify sync. |
|
|
| `lines_pruned` | int | Conditional | Number of lines to remove from the **front** of the client's line list. Only present when > 0. |
|
|
| `new_lines` | array | Conditional | Lines to append to the **end** of the client's line list. Only present when there are new lines. |
|
|
| `buffer_transcription` | string | Always | Replaces the previous buffer value. |
|
|
| `buffer_diarization` | string | Always | Replaces the previous buffer value. |
|
|
| `buffer_translation` | string | Always | Replaces the previous buffer value. |
|
|
| `remaining_time_transcription` | float | Always | Replaces the previous value. |
|
|
| `remaining_time_diarization` | float | Always | Replaces the previous value. |
|
|
| `error` | string | Conditional | Only present on error. |
|
|
|
|
### Ready to Stop
|
|
|
|
Sent after all audio has been processed (i.e., after the client sent the end-of-audio signal and the server finished processing the remaining audio).
|
|
|
|
```json
|
|
{
|
|
"type": "ready_to_stop"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Client to Server Messages
|
|
|
|
### Audio Frames
|
|
|
|
Send binary WebSocket frames containing audio data.
|
|
|
|
**When `useAudioWorklet` is `true` (server started with `--pcm-input`):**
|
|
- PCM signed 16-bit little-endian, 16 kHz, mono (`s16le`).
|
|
- Any chunk size works. A typical chunk is 0.5 seconds (16,000 bytes).
|
|
|
|
**When `useAudioWorklet` is `false`:**
|
|
- Raw encoded audio bytes (any format FFmpeg can decode: WAV, MP3, FLAC, OGG, etc.).
|
|
- The server pipes these bytes through FFmpeg for decoding.
|
|
|
|
### End-of-Audio Signal
|
|
|
|
Send an empty binary frame (`b""`) to tell the server that no more audio will follow. The server will finish processing any remaining audio and then send a `ready_to_stop` message.
|
|
|
|
---
|
|
|
|
## Diff Protocol: Client Reconstruction
|
|
|
|
Clients using `mode=diff` must maintain a local list of lines and apply diffs incrementally.
|
|
|
|
### Algorithm
|
|
|
|
```python
|
|
def reconstruct_state(msg, lines):
|
|
"""Apply a snapshot or diff message to a local lines list.
|
|
|
|
Args:
|
|
msg: The parsed JSON message from the server.
|
|
lines: The client's mutable list of line objects.
|
|
|
|
Returns:
|
|
A full-state dict with all fields.
|
|
"""
|
|
if msg["type"] == "snapshot":
|
|
lines.clear()
|
|
lines.extend(msg.get("lines", []))
|
|
return msg
|
|
|
|
# Apply diff
|
|
n_pruned = msg.get("lines_pruned", 0)
|
|
if n_pruned > 0:
|
|
del lines[:n_pruned]
|
|
|
|
new_lines = msg.get("new_lines", [])
|
|
lines.extend(new_lines)
|
|
|
|
# Volatile fields are replaced wholesale
|
|
return {
|
|
"status": msg.get("status", ""),
|
|
"lines": lines[:],
|
|
"buffer_transcription": msg.get("buffer_transcription", ""),
|
|
"buffer_diarization": msg.get("buffer_diarization", ""),
|
|
"buffer_translation": msg.get("buffer_translation", ""),
|
|
"remaining_time_transcription": msg.get("remaining_time_transcription", 0),
|
|
"remaining_time_diarization": msg.get("remaining_time_diarization", 0),
|
|
}
|
|
```
|
|
|
|
### Verification
|
|
|
|
After applying a diff, check that `len(lines) == msg["n_lines"]`. A mismatch indicates the client fell out of sync and should reconnect.
|
|
|
|
---
|
|
|
|
## Silence Representation
|
|
|
|
Silence segments are represented as lines with `speaker` set to `-2` and `text` set to `null`:
|
|
|
|
```json
|
|
{
|
|
"speaker": -2,
|
|
"text": null,
|
|
"start": "0:00:10",
|
|
"end": "0:00:12"
|
|
}
|
|
```
|
|
|
|
Silence segments are only generated for pauses longer than 5 seconds.
|
|
|
|
---
|
|
|
|
## Per-Session Language
|
|
|
|
The `language` query parameter creates an isolated language context for the session using `SessionASRProxy`. The proxy temporarily overrides the shared ASR backend's language during transcription calls, protected by a lock. This means:
|
|
|
|
- Each WebSocket session can transcribe in a different language.
|
|
- Sessions are thread-safe and do not interfere with each other.
|
|
- Pass `"auto"` to use automatic language detection for the session regardless of the server-wide setting.
|