Files
WhisperLiveKit/docs/API.md
Quentin Fuxa 971f8473eb update api doc
2025-10-05 11:09:47 +02:00

7.1 KiB

WhisperLiveKit WebSocket API Documentation

!! Note: The new API structure described in this document is currently under deployment. This documentation is intended for devs who want to build custom frontends.

WLK provides real-time speech transcription, speaker diarization, and translation through a WebSocket API. The server sends incremental updates as audio is processed, allowing clients to display live transcription results with minimal latency.


Legacy API (Current)

Message Structure

The current API sends complete state snapshots on each update (several time per second)

{
  "type": str,
  "status": str,
  "lines": [
    {
      "speaker": int,
      "text": str,
      "start": float,
      "end": float,
      "translation": str | null,
      "detected_language": str
    }
  ],
  "buffer_transcription": str,
  "buffer_diarization": str,
  "remaining_time_transcription": float,
  "remaining_time_diarization": float
}

New API (Under Development)

Philosophy

Principles:

  • Incremental Updates: Only updates and new segments are sent
  • Ephemeral Buffers: Temporary, unvalidated data displayed in real-time but overwritten on next update, at speaker level

Message Format

{
  "type": "transcript_update",
  "status": "active_transcription" | "no_audio_detected",
  "segments": [
    {
      "id": number,
      "speaker": number,
      "text": string,
      "start_speaker": float,
      "start": float,
      "end": float,
      "language": string | null,
      "translation": string,
      "words": [
        {
          "text": string,
          "start": float,
          "end": float,
          "validated": {
            "text": boolean,
            "speaker": boolean,
          }
        }
      ],
      "buffer": {
        "transcription": string,
        "diarization": string,
        "translation": string
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": float,
    "remaining_time_diarization": float
  }
}

Other Message Types

Config Message (sent on connection)

{
  "type": "config",
  "useAudioWorklet": true / false
}

Ready to Stop Message (sent after processing complete)

{
  "type": "ready_to_stop"
}

Field Descriptions

Segment Fields

Field Type Description
id number Unique identifier for this segment. Used by clients to update specific segments efficiently.
speaker number Speaker ID (1, 2, 3...). Special value -2 indicates silence.
text string Validated transcription text for this update. Should be appended to the segment's text on the client side.
start_speaker float Timestamp (seconds) when this speaker segment began.
start float Timestamp (seconds) of the first word in this update.
end float Timestamp (seconds) of the last word in this update.
language string | null ISO language code (e.g., "en", "fr"). null until language is detected.
translation string Validated translation text for this update. Should be appended to the segment's translation on the client side.
words Array Array of word-level objects with timing and validation information.
buffer Object Per-segment temporary buffers, see below

Word Object

Field Type Description
text string The word text.
start number Start timestamp (seconds) of this word.
end number End timestamp (seconds) of this word.
validated.text boolean Whether the transcription text has been validated. if false, word is also in buffer: transcription
validated.speaker boolean Whether the speaker assignment has been validated. if false, word is also in buffer: diarization
validated.language boolean Whether the language detection has been validated. if false, word is also in buffer: translation

Buffer Object (Per-Segment)

Buffers are ephemeral. They should be displayed to the user but not stored permanently in the frontend. Each update may contain a completely different buffer value, and previous buffer is likely to be in the next validated text.

Field Type Description
transcription string Pending transcription text. Displayed immediately but overwritten on next update.
diarization string Pending diarization text (text waiting for speaker assignment). Displayed immediately but overwritten on next update.
translation string Pending translation text. Displayed immediately but overwritten on next update.

Metadata Fields

Field Type Description
remaining_time_transcription float Seconds of audio waiting for transcription processing.
remaining_time_diarization float Seconds of audio waiting for speaker diarization.

Status Values

Status Description
active_transcription Normal operation, transcription is active.
no_audio_detected No audio has been detected yet.

Update Behavior

Incremental Updates

The API sends only changed or new segments. Clients should:

  1. Maintain a local map of segments by ID
  2. When receiving an update, merge/update segments by ID
  3. Render only the changed segments

Language Detection

When language is detected for a segment:

// Update 1: No language yet
{
  "segments": [
    {"id": 1, "speaker": 1, "text": "May see", "language": null}
  ]
}

// Update 2: Same segment ID, language now detected
{
  "segments": [
    {"id": 1, "speaker": 1, "text": "Merci", "language": "fr"}
  ]
}

Client behavior: Replace the existing segment with the same ID.

Buffer Behavior

Buffers are per-segment to handle multi-speaker scenarios correctly.

Example: Translation with diarization and translation

// Update 1
{
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": "Hello world, how are",
      "translation": "",
      "buffer": {
        "transcription": "",
        "diarization": " you on",
        "translation": "Bonjour le monde"
      }
    }
  ]
}


// ==== Frontend ====
// <SPEAKER>1</SPEAKER>
// <TRANSCRIPTION>Hello world, how are <DIARIZATION BUFFER> you on</DIARIZATION BUFFER></TRANSCRIPTION>
// <TRANSLATION><TRANSLATION BUFFER>Bonjour le monde</TRANSLATION BUFFER></TRANSLATION>


// Update 2
{
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": " you on this",
      "translation": "Bonjour tout le monde",
      "buffer": {
        "transcription": "",
        "diarization": " beautiful day",
        "translation": ",comment"
      }
    },
  ]
}


// ==== Frontend ====
// <SPEAKER>1</SPEAKER>
// <TRANSCRIPTION>Hello world, how are you on this<DIARIZATION BUFFER>  beautiful day</DIARIZATION BUFFER></TRANSCRIPTION>
// <TRANSLATION>Bonjour tout le monde<TRANSLATION BUFFER>, comment</TRANSLATION BUFFER><TRANSLATION>

Silence Segments

Silence is represented with the speaker id = -2:

{
  "id": 5,
  "speaker": -2,
  "text": "",
  "start": 10.5,
  "end": 12.3
}