6.5 KiB
WhisperLiveKit WebSocket API Documentation
WLK provides real-time speech transcription, speaker diarization, and translation through a WebSocket API. The server sends updates as audio is processed, allowing clients to display live transcription results with minimal latency.
Endpoints
| Endpoint | Description |
|---|---|
/ |
Main web interface with visual styling |
/text |
Simple text-based interface for easy copy/paste (debug/development) |
/asr |
WebSocket endpoint for audio streaming |
Message Format
Transcript Update (Server → Client)
{
"type": "transcript_update",
"status": "active_transcription" | "no_audio_detected",
"segments": [
{
"id": number,
"speaker": number,
"text": string,
"start_speaker": string, // HH:MM:SS format
"start": string, // HH:MM:SS format
"end": string, // HH:MM:SS format
"language": string | null,
"translation": string,
"buffer": {
"transcription": string,
"diarization": string,
"translation": string
}
}
],
"metadata": {
"remaining_time_transcription": float,
"remaining_time_diarization": float
}
}
Other Message Types
Config Message (sent on connection)
{
"type": "config",
"useAudioWorklet": true
}
useAudioWorklet: Iftrue, client should use AudioWorklet for PCM streaming. Iffalse, use MediaRecorder for WebM.
Ready to Stop Message (sent after processing complete)
{
"type": "ready_to_stop"
}
Indicates all audio has been processed and the client can safely close the connection.
Field Descriptions
Segment Fields
| Field | Type | Description |
|---|---|---|
id |
number |
Unique identifier for this segment. |
speaker |
number |
Speaker ID (1, 2, 3...). Special value -2 indicates silence. |
text |
string |
Validated transcription text. |
start_speaker |
string |
Timestamp (HH:MM:SS) when this speaker segment began. |
start |
string |
Timestamp (HH:MM:SS) of the first word. |
end |
string |
Timestamp (HH:MM:SS) of the last word. |
language |
string | null |
ISO language code (e.g., "en", "fr"). null until detected. |
translation |
string |
Validated translation text. |
buffer |
Object |
Per-segment temporary buffers (see below). |
Buffer Object (Per-Segment)
Buffers are ephemeral. They should be displayed to the user but are overwritten on each update. Only the last non-silent segment contains buffer content.
| Field | Type | Description |
|---|---|---|
transcription |
string |
Text pending validation (waiting for more context). |
diarization |
string |
Text pending speaker assignment (diarization hasn't caught up). |
translation |
string |
Translation pending validation. |
Metadata Fields
| Field | Type | Description |
|---|---|---|
remaining_time_transcription |
float |
Seconds of audio waiting for transcription. |
remaining_time_diarization |
float |
Seconds of audio waiting for diarization. |
Status Values
| Status | Description |
|---|---|
active_transcription |
Normal operation, transcription is active. |
no_audio_detected |
No audio/speech has been detected yet. |
Behavior Notes
Silence Handling
- Short silences (< 2 seconds) are filtered out and not displayed.
- Only significant pauses appear as silence segments with
speaker: -2. - Consecutive same-speaker segments are merged even across short silences.
Update Frequency
- Active transcription: ~20 updates/second (every 50ms)
- During silence: ~2 updates/second (every 500ms) to reduce bandwidth
Token-by-Token Validation (Diarization Mode)
When diarization is enabled, text is validated token-by-token as soon as diarization covers each token, rather than waiting for punctuation. This provides:
- Faster text validation
- More responsive speaker attribution
- Buffer only contains tokens that diarization hasn't processed yet
Example Messages
Normal Transcription
{
"type": "transcript_update",
"status": "active_transcription",
"segments": [
{
"id": 1,
"speaker": 1,
"text": "Hello, how are you today?",
"start_speaker": "0:00:02",
"start": "0:00:02",
"end": "0:00:05",
"language": "en",
"translation": "",
"buffer": {
"transcription": " I'm doing",
"diarization": "",
"translation": ""
}
}
],
"metadata": {
"remaining_time_transcription": 0.5,
"remaining_time_diarization": 0
}
}
With Diarization Buffer
{
"type": "transcript_update",
"status": "active_transcription",
"segments": [
{
"id": 1,
"speaker": 1,
"text": "The meeting starts at nine.",
"start_speaker": "0:00:03",
"start": "0:00:03",
"end": "0:00:06",
"language": "en",
"translation": "",
"buffer": {
"transcription": "",
"diarization": " Let me check my calendar",
"translation": ""
}
}
],
"metadata": {
"remaining_time_transcription": 0.3,
"remaining_time_diarization": 2.1
}
}
Silence Segment
{
"id": 5,
"speaker": -2,
"text": "",
"start_speaker": "0:00:10",
"start": "0:00:10",
"end": "0:00:15",
"language": null,
"translation": "",
"buffer": {
"transcription": "",
"diarization": "",
"translation": ""
}
}
Text Transcript Endpoint (/text)
The /text endpoint provides a simple, monospace text interface designed for:
- Easy copy/paste of transcripts
- Debugging and development
- Integration testing
Output uses text markers instead of HTML styling:
[METADATA transcription_lag=0.5s diarization_lag=1.2s]
[SPEAKER 1] 0:00:03 - 0:00:11 [LANG: en]
Hello world, how are you doing today?[DIAR_BUFFER] I'm doing fine[/DIAR_BUFFER]
[SILENCE 0:00:15 - 0:00:18]
[SPEAKER 2] 0:00:18 - 0:00:22 [LANG: en]
That's great to hear!
[TRANSLATION]C'est super à entendre![/TRANSLATION]
Markers
| Marker | Description |
|---|---|
[SPEAKER N] |
Speaker label with ID |
[SILENCE start - end] |
Silence segment |
[LANG: xx] |
Detected language code |
[DIAR_BUFFER]...[/DIAR_BUFFER] |
Text pending speaker assignment |
[TRANS_BUFFER]...[/TRANS_BUFFER] |
Text pending validation |
[TRANSLATION]...[/TRANSLATION] |
Translation content |
[METADATA ...] |
Lag/timing information |