Files
WhisperLiveKit/docs/API.md

6.5 KiB

WhisperLiveKit WebSocket API Documentation

WLK provides real-time speech transcription, speaker diarization, and translation through a WebSocket API. The server sends updates as audio is processed, allowing clients to display live transcription results with minimal latency.


Endpoints

Endpoint Description
/ Main web interface with visual styling
/text Simple text-based interface for easy copy/paste (debug/development)
/asr WebSocket endpoint for audio streaming

Message Format

Transcript Update (Server → Client)

{
  "type": "transcript_update",
  "status": "active_transcription" | "no_audio_detected",
  "segments": [
    {
      "id": number,
      "speaker": number,
      "text": string,
      "start_speaker": string,    // HH:MM:SS format
      "start": string,            // HH:MM:SS format  
      "end": string,              // HH:MM:SS format
      "language": string | null,
      "translation": string,
      "buffer": {
        "transcription": string,
        "diarization": string,
        "translation": string
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": float,
    "remaining_time_diarization": float
  }
}

Other Message Types

Config Message (sent on connection)

{
  "type": "config",
  "useAudioWorklet": true
}
  • useAudioWorklet: If true, client should use AudioWorklet for PCM streaming. If false, use MediaRecorder for WebM.

Ready to Stop Message (sent after processing complete)

{
  "type": "ready_to_stop"
}

Indicates all audio has been processed and the client can safely close the connection.


Field Descriptions

Segment Fields

Field Type Description
id number Unique identifier for this segment.
speaker number Speaker ID (1, 2, 3...). Special value -2 indicates silence.
text string Validated transcription text.
start_speaker string Timestamp (HH:MM:SS) when this speaker segment began.
start string Timestamp (HH:MM:SS) of the first word.
end string Timestamp (HH:MM:SS) of the last word.
language string | null ISO language code (e.g., "en", "fr"). null until detected.
translation string Validated translation text.
buffer Object Per-segment temporary buffers (see below).

Buffer Object (Per-Segment)

Buffers are ephemeral. They should be displayed to the user but are overwritten on each update. Only the last non-silent segment contains buffer content.

Field Type Description
transcription string Text pending validation (waiting for more context).
diarization string Text pending speaker assignment (diarization hasn't caught up).
translation string Translation pending validation.

Metadata Fields

Field Type Description
remaining_time_transcription float Seconds of audio waiting for transcription.
remaining_time_diarization float Seconds of audio waiting for diarization.

Status Values

Status Description
active_transcription Normal operation, transcription is active.
no_audio_detected No audio/speech has been detected yet.

Behavior Notes

Silence Handling

  • Short silences (< 2 seconds) are filtered out and not displayed.
  • Only significant pauses appear as silence segments with speaker: -2.
  • Consecutive same-speaker segments are merged even across short silences.

Update Frequency

  • Active transcription: ~20 updates/second (every 50ms)
  • During silence: ~2 updates/second (every 500ms) to reduce bandwidth

Token-by-Token Validation (Diarization Mode)

When diarization is enabled, text is validated token-by-token as soon as diarization covers each token, rather than waiting for punctuation. This provides:

  • Faster text validation
  • More responsive speaker attribution
  • Buffer only contains tokens that diarization hasn't processed yet

Example Messages

Normal Transcription

{
  "type": "transcript_update",
  "status": "active_transcription",
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": "Hello, how are you today?",
      "start_speaker": "0:00:02",
      "start": "0:00:02",
      "end": "0:00:05",
      "language": "en",
      "translation": "",
      "buffer": {
        "transcription": " I'm doing",
        "diarization": "",
        "translation": ""
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": 0.5,
    "remaining_time_diarization": 0
  }
}

With Diarization Buffer

{
  "type": "transcript_update",
  "status": "active_transcription",
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": "The meeting starts at nine.",
      "start_speaker": "0:00:03",
      "start": "0:00:03",
      "end": "0:00:06",
      "language": "en",
      "translation": "",
      "buffer": {
        "transcription": "",
        "diarization": " Let me check my calendar",
        "translation": ""
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": 0.3,
    "remaining_time_diarization": 2.1
  }
}

Silence Segment

{
  "id": 5,
  "speaker": -2,
  "text": "",
  "start_speaker": "0:00:10",
  "start": "0:00:10",
  "end": "0:00:15",
  "language": null,
  "translation": "",
  "buffer": {
    "transcription": "",
    "diarization": "",
    "translation": ""
  }
}

Text Transcript Endpoint (/text)

The /text endpoint provides a simple, monospace text interface designed for:

  • Easy copy/paste of transcripts
  • Debugging and development
  • Integration testing

Output uses text markers instead of HTML styling:

[METADATA transcription_lag=0.5s diarization_lag=1.2s]

[SPEAKER 1] 0:00:03 - 0:00:11 [LANG: en]
Hello world, how are you doing today?[DIAR_BUFFER] I'm doing fine[/DIAR_BUFFER]

[SILENCE 0:00:15 - 0:00:18]

[SPEAKER 2] 0:00:18 - 0:00:22 [LANG: en]
That's great to hear!
[TRANSLATION]C'est super à entendre![/TRANSLATION]

Markers

Marker Description
[SPEAKER N] Speaker label with ID
[SILENCE start - end] Silence segment
[LANG: xx] Detected language code
[DIAR_BUFFER]...[/DIAR_BUFFER] Text pending speaker assignment
[TRANS_BUFFER]...[/TRANS_BUFFER] Text pending validation
[TRANSLATION]...[/TRANSLATION] Translation content
[METADATA ...] Lag/timing information