mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-03-07 06:14:05 +00:00

Files

Quentin Fuxa 971f8473eb update api doc

2025-10-05 11:09:47 +02:00

7.1 KiB

Raw Blame History

WhisperLiveKit WebSocket API Documentation

!! Note: The new API structure described in this document is currently under deployment. This documentation is intended for devs who want to build custom frontends.

WLK provides real-time speech transcription, speaker diarization, and translation through a WebSocket API. The server sends incremental updates as audio is processed, allowing clients to display live transcription results with minimal latency.

Legacy API (Current)

Message Structure

The current API sends complete state snapshots on each update (several time per second)

{
  "type": str,
  "status": str,
  "lines": [
    {
      "speaker": int,
      "text": str,
      "start": float,
      "end": float,
      "translation": str | null,
      "detected_language": str
    }
  ],
  "buffer_transcription": str,
  "buffer_diarization": str,
  "remaining_time_transcription": float,
  "remaining_time_diarization": float
}

New API (Under Development)

Philosophy

Principles:

Incremental Updates: Only updates and new segments are sent
Ephemeral Buffers: Temporary, unvalidated data displayed in real-time but overwritten on next update, at speaker level

Message Format

{
  "type": "transcript_update",
  "status": "active_transcription" | "no_audio_detected",
  "segments": [
    {
      "id": number,
      "speaker": number,
      "text": string,
      "start_speaker": float,
      "start": float,
      "end": float,
      "language": string | null,
      "translation": string,
      "words": [
        {
          "text": string,
          "start": float,
          "end": float,
          "validated": {
            "text": boolean,
            "speaker": boolean,
          }
        }
      ],
      "buffer": {
        "transcription": string,
        "diarization": string,
        "translation": string
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": float,
    "remaining_time_diarization": float
  }
}

Other Message Types

Config Message (sent on connection)

{
  "type": "config",
  "useAudioWorklet": true / false
}

Ready to Stop Message (sent after processing complete)

{
  "type": "ready_to_stop"
}

Field Descriptions

Segment Fields

Field	Type	Description
`id`	`number`	Unique identifier for this segment. Used by clients to update specific segments efficiently.
`speaker`	`number`	Speaker ID (1, 2, 3...). Special value `-2` indicates silence.
`text`	`string`	Validated transcription text for this update. Should be appended to the segment's text on the client side.
`start_speaker`	`float`	Timestamp (seconds) when this speaker segment began.
`start`	`float`	Timestamp (seconds) of the first word in this update.
`end`	`float`	Timestamp (seconds) of the last word in this update.
`language`	`string \| null`	ISO language code (e.g., "en", "fr"). `null` until language is detected.
`translation`	`string`	Validated translation text for this update. Should be appended to the segment's translation on the client side.
`words`	`Array`	Array of word-level objects with timing and validation information.
`buffer`	`Object`	Per-segment temporary buffers, see below

Word Object

Field	Type	Description
`text`	`string`	The word text.
`start`	`number`	Start timestamp (seconds) of this word.
`end`	`number`	End timestamp (seconds) of this word.
`validated.text`	`boolean`	Whether the transcription text has been validated. if false, word is also in buffer: transcription
`validated.speaker`	`boolean`	Whether the speaker assignment has been validated. if false, word is also in buffer: diarization
`validated.language`	`boolean`	Whether the language detection has been validated. if false, word is also in buffer: translation

Buffer Object (Per-Segment)

Buffers are ephemeral. They should be displayed to the user but not stored permanently in the frontend. Each update may contain a completely different buffer value, and previous buffer is likely to be in the next validated text.

Field	Type	Description
`transcription`	`string`	Pending transcription text. Displayed immediately but overwritten on next update.
`diarization`	`string`	Pending diarization text (text waiting for speaker assignment). Displayed immediately but overwritten on next update.
`translation`	`string`	Pending translation text. Displayed immediately but overwritten on next update.

Metadata Fields

Field	Type	Description
`remaining_time_transcription`	`float`	Seconds of audio waiting for transcription processing.
`remaining_time_diarization`	`float`	Seconds of audio waiting for speaker diarization.

Status Values

Status	Description
`active_transcription`	Normal operation, transcription is active.
`no_audio_detected`	No audio has been detected yet.

Update Behavior

Incremental Updates

The API sends only changed or new segments. Clients should:

Maintain a local map of segments by ID
When receiving an update, merge/update segments by ID
Render only the changed segments

Language Detection

When language is detected for a segment:

// Update 1: No language yet
{
  "segments": [
    {"id": 1, "speaker": 1, "text": "May see", "language": null}
  ]
}

// Update 2: Same segment ID, language now detected
{
  "segments": [
    {"id": 1, "speaker": 1, "text": "Merci", "language": "fr"}
  ]
}

Client behavior: Replace the existing segment with the same ID.

Buffer Behavior

Buffers are per-segment to handle multi-speaker scenarios correctly.

Example: Translation with diarization and translation

// Update 1
{
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": "Hello world, how are",
      "translation": "",
      "buffer": {
        "transcription": "",
        "diarization": " you on",
        "translation": "Bonjour le monde"
      }
    }
  ]
}


// ==== Frontend ====
// <SPEAKER>1</SPEAKER>
// <TRANSCRIPTION>Hello world, how are <DIARIZATION BUFFER> you on</DIARIZATION BUFFER></TRANSCRIPTION>
// <TRANSLATION><TRANSLATION BUFFER>Bonjour le monde</TRANSLATION BUFFER></TRANSLATION>


// Update 2
{
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": " you on this",
      "translation": "Bonjour tout le monde",
      "buffer": {
        "transcription": "",
        "diarization": " beautiful day",
        "translation": ",comment"
      }
    },
  ]
}


// ==== Frontend ====
// <SPEAKER>1</SPEAKER>
// <TRANSCRIPTION>Hello world, how are you on this<DIARIZATION BUFFER>  beautiful day</DIARIZATION BUFFER></TRANSCRIPTION>
// <TRANSLATION>Bonjour tout le monde<TRANSLATION BUFFER>, comment</TRANSLATION BUFFER><TRANSLATION>

Silence Segments

Silence is represented with the speaker id = -2:

{
  "id": 5,
  "speaker": -2,
  "text": "",
  "start": 10.5,
  "end": 12.3
}

7.1 KiB Raw Blame History