WhisperLiveKit/docs/API.md

# WhisperLiveKit WebSocket API Documentation

WLK provides real-time speech transcription, speaker diarization, and translation through a WebSocket API. The server sends updates as audio is processed, allowing clients to display live transcription results with minimal latency.

---

## Endpoints

| Endpoint | Description |
|----------|-------------|
| `/` | Main web interface with visual styling |
| `/text` | Simple text-based interface for easy copy/paste (debug/development) |
| `/asr` | WebSocket endpoint for audio streaming |

---

## Message Format

### Transcript Update (Server → Client)

```typescript
{
  "type": "transcript_update",
  "status": "active_transcription" | "no_audio_detected",
  "segments": [
    {
      "id": number,
      "speaker": number,
      "text": string,
      "start_speaker": string,    // HH:MM:SS format
      "start": string,            // HH:MM:SS format
      "end": string,              // HH:MM:SS format
      "language": string | null,
      "translation": string,
      "buffer": {
        "transcription": string,
        "diarization": string,
        "translation": string
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": float,
    "remaining_time_diarization": float
  }
}
```

### Other Message Types

#### Config Message (sent on connection)
```json
{
  "type": "config",
  "useAudioWorklet": true
}
```
- `useAudioWorklet`: If `true`, client should use AudioWorklet for PCM streaming. If `false`, use MediaRecorder for WebM.

#### Ready to Stop Message (sent after processing complete)
```json
{
  "type": "ready_to_stop"
}
```
Indicates all audio has been processed and the client can safely close the connection.

---

## Field Descriptions

### Segment Fields

| Field | Type | Description |
|-------|------|-------------|
| `id` | `number` | Unique identifier for this segment. |
| `speaker` | `number` | Speaker ID (1, 2, 3...). Special value `-2` indicates silence. |
| `text` | `string` | Validated transcription text. |
| `start_speaker` | `string` | Timestamp (HH:MM:SS) when this speaker segment began. |
| `start` | `string` | Timestamp (HH:MM:SS) of the first word. |
| `end` | `string` | Timestamp (HH:MM:SS) of the last word. |
| `language` | `string \| null` | ISO language code (e.g., "en", "fr"). `null` until detected. |
| `translation` | `string` | Validated translation text. |
| `buffer` | `Object` | Per-segment temporary buffers (see below). |

### Buffer Object (Per-Segment)

Buffers are **ephemeral**. They should be displayed to the user but are overwritten on each update. Only the **last non-silent segment** contains buffer content.

| Field | Type | Description |
|-------|------|-------------|
| `transcription` | `string` | Text pending validation (waiting for more context). |
| `diarization` | `string` | Text pending speaker assignment (diarization hasn't caught up). |
| `translation` | `string` | Translation pending validation. |

### Metadata Fields

| Field | Type | Description |
|-------|------|-------------|
| `remaining_time_transcription` | `float` | Seconds of audio waiting for transcription. |
| `remaining_time_diarization` | `float` | Seconds of audio waiting for diarization. |

### Status Values

| Status | Description |
|--------|-------------|
| `active_transcription` | Normal operation, transcription is active. |
| `no_audio_detected` | No audio/speech has been detected yet. |

---

## Behavior Notes

### Silence Handling

- **Short silences (< 2 seconds)** are filtered out and not displayed.
- Only significant pauses appear as silence segments with `speaker: -2`.
- Consecutive same-speaker segments are merged even across short silences.

### Update Frequency

- **Active transcription**: ~20 updates/second (every 50ms)
- **During silence**: ~2 updates/second (every 500ms) to reduce bandwidth

### Token-by-Token Validation (Diarization Mode)

When diarization is enabled, text is validated **token-by-token** as soon as diarization covers each token, rather than waiting for punctuation. This provides:
- Faster text validation
- More responsive speaker attribution
- Buffer only contains tokens that diarization hasn't processed yet

---

## Example Messages

### Normal Transcription

```json
{
  "type": "transcript_update",
  "status": "active_transcription",
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": "Hello, how are you today?",
      "start_speaker": "0:00:02",
      "start": "0:00:02",
      "end": "0:00:05",
      "language": "en",
      "translation": "",
      "buffer": {
        "transcription": " I'm doing",
        "diarization": "",
        "translation": ""
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": 0.5,
    "remaining_time_diarization": 0
  }
}
```

### With Diarization Buffer

```json
{
  "type": "transcript_update",
  "status": "active_transcription",
  "segments": [
    {
      "id": 1,
      "speaker": 1,
      "text": "The meeting starts at nine.",
      "start_speaker": "0:00:03",
      "start": "0:00:03",
      "end": "0:00:06",
      "language": "en",
      "translation": "",
      "buffer": {
        "transcription": "",
        "diarization": " Let me check my calendar",
        "translation": ""
      }
    }
  ],
  "metadata": {
    "remaining_time_transcription": 0.3,
    "remaining_time_diarization": 2.1
  }
}
```

### Silence Segment

```json
{
  "id": 5,
  "speaker": -2,
  "text": "",
  "start_speaker": "0:00:10",
  "start": "0:00:10",
  "end": "0:00:15",
  "language": null,
  "translation": "",
  "buffer": {
    "transcription": "",
    "diarization": "",
    "translation": ""
  }
}
```

---

## Text Transcript Endpoint (`/text`)

The `/text` endpoint provides a simple, monospace text interface designed for:
- Easy copy/paste of transcripts
- Debugging and development
- Integration testing

Output uses text markers instead of HTML styling:

```
[METADATA transcription_lag=0.5s diarization_lag=1.2s]

[SPEAKER 1] 0:00:03 - 0:00:11 [LANG: en]
Hello world, how are you doing today?[DIAR_BUFFER] I'm doing fine[/DIAR_BUFFER]

[SILENCE 0:00:15 - 0:00:18]

[SPEAKER 2] 0:00:18 - 0:00:22 [LANG: en]
That's great to hear!
[TRANSLATION]C'est super à entendre![/TRANSLATION]
```

### Markers

| Marker | Description |
|--------|-------------|
| `[SPEAKER N]` | Speaker label with ID |
| `[SILENCE start - end]` | Silence segment |
| `[LANG: xx]` | Detected language code |
| `[DIAR_BUFFER]...[/DIAR_BUFFER]` | Text pending speaker assignment |
| `[TRANS_BUFFER]...[/TRANS_BUFFER]` | Text pending validation |
| `[TRANSLATION]...[/TRANSLATION]` | Translation content |
| `[METADATA ...]` | Lag/timing information |