mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-03-07 22:33:36 +00:00

Files

Quentin Fuxa d9a4c8dcb2 Refactor transcription and diarization handling with token-by-token validation. Introduce segment buffers for ephemeral content and update API to return structured segment data. Enhance silence handling and improve web interface for text transcripts.

2025-11-30 16:39:27 +01:00

6.5 KiB

Raw Blame History

WhisperLiveKit WebSocket API Documentation

WLK provides real-time speech transcription, speaker diarization, and translation through a WebSocket API. The server sends updates as audio is processed, allowing clients to display live transcription results with minimal latency.

Endpoints

Endpoint	Description
`/`	Main web interface with visual styling
`/text`	Simple text-based interface for easy copy/paste (debug/development)
`/asr`	WebSocket endpoint for audio streaming

Message Format

Transcript Update (Server → Client)

{
  "type": "transcript_update",
  "status": "active_transcription" | "no_audio_detected",
  "segments": [
    {
      "id": number,
      "speaker": number,
      "text": string,
      "start_speaker": string,    // HH:MM:SS format
      "start": string,            // HH:MM:SS format  
      "end": string,              // HH:MM:SS format
      "language": string | null,
      "translation": string,
      "buffer": {
        "transcription": string,
        "diarization": string,
        "translation": string
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": float,
    "remaining_time_diarization": float
  }
}

Other Message Types

Config Message (sent on connection)

{
  "type": "config",
  "useAudioWorklet": true
}

useAudioWorklet: If true, client should use AudioWorklet for PCM streaming. If false, use MediaRecorder for WebM.

Ready to Stop Message (sent after processing complete)

{
  "type": "ready_to_stop"
}

Indicates all audio has been processed and the client can safely close the connection.

Field Descriptions

Segment Fields

Field	Type	Description
`id`	`number`	Unique identifier for this segment.
`speaker`	`number`	Speaker ID (1, 2, 3...). Special value `-2` indicates silence.
`text`	`string`	Validated transcription text.
`start_speaker`	`string`	Timestamp (HH:MM:SS) when this speaker segment began.
`start`	`string`	Timestamp (HH:MM:SS) of the first word.
`end`	`string`	Timestamp (HH:MM:SS) of the last word.
`language`	`string \| null`	ISO language code (e.g., "en", "fr"). `null` until detected.
`translation`	`string`	Validated translation text.
`buffer`	`Object`	Per-segment temporary buffers (see below).

Buffer Object (Per-Segment)

Buffers are ephemeral. They should be displayed to the user but are overwritten on each update. Only the last non-silent segment contains buffer content.

Field	Type	Description
`transcription`	`string`	Text pending validation (waiting for more context).
`diarization`	`string`	Text pending speaker assignment (diarization hasn't caught up).
`translation`	`string`	Translation pending validation.

Metadata Fields

Field	Type	Description
`remaining_time_transcription`	`float`	Seconds of audio waiting for transcription.
`remaining_time_diarization`	`float`	Seconds of audio waiting for diarization.

Status Values

Status	Description
`active_transcription`	Normal operation, transcription is active.
`no_audio_detected`	No audio/speech has been detected yet.

Behavior Notes

Silence Handling

Short silences (< 2 seconds) are filtered out and not displayed.
Only significant pauses appear as silence segments with speaker: -2.
Consecutive same-speaker segments are merged even across short silences.

Update Frequency

Active transcription: ~20 updates/second (every 50ms)
During silence: ~2 updates/second (every 500ms) to reduce bandwidth

Token-by-Token Validation (Diarization Mode)

When diarization is enabled, text is validated token-by-token as soon as diarization covers each token, rather than waiting for punctuation. This provides:

Faster text validation
More responsive speaker attribution
Buffer only contains tokens that diarization hasn't processed yet

Example Messages

Normal Transcription

{
  "type": "transcript_update",
  "status": "active_transcription",
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": "Hello, how are you today?",
      "start_speaker": "0:00:02",
      "start": "0:00:02",
      "end": "0:00:05",
      "language": "en",
      "translation": "",
      "buffer": {
        "transcription": " I'm doing",
        "diarization": "",
        "translation": ""
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": 0.5,
    "remaining_time_diarization": 0
  }
}

With Diarization Buffer

{
  "type": "transcript_update",
  "status": "active_transcription",
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": "The meeting starts at nine.",
      "start_speaker": "0:00:03",
      "start": "0:00:03",
      "end": "0:00:06",
      "language": "en",
      "translation": "",
      "buffer": {
        "transcription": "",
        "diarization": " Let me check my calendar",
        "translation": ""
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": 0.3,
    "remaining_time_diarization": 2.1
  }
}

Silence Segment

{
  "id": 5,
  "speaker": -2,
  "text": "",
  "start_speaker": "0:00:10",
  "start": "0:00:10",
  "end": "0:00:15",
  "language": null,
  "translation": "",
  "buffer": {
    "transcription": "",
    "diarization": "",
    "translation": ""
  }
}

Text Transcript Endpoint (`/text`)

The /text endpoint provides a simple, monospace text interface designed for:

Easy copy/paste of transcripts
Debugging and development
Integration testing

Output uses text markers instead of HTML styling:

[METADATA transcription_lag=0.5s diarization_lag=1.2s]

[SPEAKER 1] 0:00:03 - 0:00:11 [LANG: en]
Hello world, how are you doing today?[DIAR_BUFFER] I'm doing fine[/DIAR_BUFFER]

[SILENCE 0:00:15 - 0:00:18]

[SPEAKER 2] 0:00:18 - 0:00:22 [LANG: en]
That's great to hear!
[TRANSLATION]C'est super à entendre![/TRANSLATION]

Markers

Marker	Description
`[SPEAKER N]`	Speaker label with ID
`[SILENCE start - end]`	Silence segment
`[LANG: xx]`	Detected language code
`[DIAR_BUFFER]...[/DIAR_BUFFER]`	Text pending speaker assignment
`[TRANS_BUFFER]...[/TRANS_BUFFER]`	Text pending validation
`[TRANSLATION]...[/TRANSLATION]`	Translation content
`[METADATA ...]`	Lag/timing information

6.5 KiB Raw Blame History