From 73f36cc0ef658bb637fcbe6b4c764ad67f017d50 Mon Sep 17 00:00:00 2001 From: Quentin Fuxa Date: Thu, 2 Oct 2025 23:04:00 +0200 Subject: [PATCH] v0 doc new api --- docs/API.md | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 272 insertions(+) create mode 100644 docs/API.md diff --git a/docs/API.md b/docs/API.md new file mode 100644 index 0000000..1430cae --- /dev/null +++ b/docs/API.md @@ -0,0 +1,272 @@ +# WhisperLiveKit WebSocket API Documentation + +> !! **Note**: The new API structure described in this document is currently under deployment. +This documentation is intended for devs who want to build custom frontends. + +WLK provides real-time speech transcription, speaker diarization, and translation through a WebSocket API. The server sends incremental updates as audio is processed, allowing clients to display live transcription results with minimal latency. + +--- + +## Legacy API (Current) + +### Message Structure + +The current API sends complete state snapshots on each update (several time per second) + +```json +{ + "type": "transcript_update", + "status": "active_transcription", + "lines": [ + { + "speaker": 1, + "text": "Complete transcription text", + "start": "0:00:05", + "end": "0:00:08", + "translation": "Optional translation", + "detected_language": "en" + } + ], + "buffer_transcription": "pending transcription...", + "buffer_diarization": "pending diarization...", + "remaining_time_transcription": 0.5, + "remaining_time_diarization": 0.3 +} +``` + +--- + +## New API (Under Development) + +### Philosophy + +Designed with the following principles: + +1. **Incremental Updates**: Only updates and new segments are sent +2. **Word-Level Granularity**: Each word includes timing and validation status for text and speaker +3. **Per-Segment Buffers**: Buffers are associated with specific speakers +4. **Efficient Client-Side Handling**: Segments have IDs for easy front updates +5. **Ephemeral Buffers**: Temporary, unvalidated data displayed in real-time but overwritten on next update + +--- + +## Message Format + +### Transcript Update Message + +```typescript +{ + "type": "transcript_update", + "status": "active_transcription" | "no_audio_detected", + "segments": [ + { + "id": number, + "speaker": number, + "text": string, + "start_speaker": float, + "start": float, + "end": float, + "language": string | null, + "translation": string, + "words": [ + { + "text": string, + "start": float, + "end": float, + "validated": { + "text": boolean, + "speaker": boolean, + } + } + ], + "buffer": { + "transcription": string, + "diarization": string, + "translation": string + } + } + ], + "metadata": { + "remaining_time_transcription": float, + "remaining_time_diarization": float + } +} +``` + +### Other Message Types + +#### Config Message (sent on connection) +```json +{ + "type": "config", + "useAudioWorklet": true / false +} +``` + +#### Ready to Stop Message (sent after processing complete) +```json +{ + "type": "ready_to_stop" +} +``` + +--- + +## Field Descriptions + +### Segment Fields + +| Field | Type | Description | +|-------|------|-------------| +| `id` | `number` | Unique identifier for this segment. Used by clients to update specific segments efficiently. | +| `speaker` | `number` | Speaker ID (1, 2, 3...). Special value `-2` indicates silence. | +| `text` | `string` | Validated transcription text for this update. Should be **appended** to the segment's text on the client side. | +| `start_speaker` | `float` | Timestamp (seconds) when this speaker segment began. | +| `start` | `float` | Timestamp (seconds) of the first word in this update. | +| `end` | `float` | Timestamp (seconds) of the last word in this update. | +| `language` | `string \| null` | ISO language code (e.g., "en", "fr"). `null` until language is detected. | +| `translation` | `string` | Validated translation text for this update. Should be **appended** to the segment's translation on the client side. | +| `words` | `Array` | Array of word-level objects with timing and validation information. | +| `buffer` | `Object` | Per-segment temporary buffers, see below | + +### Word Object + +| Field | Type | Description | +|-------|------|-------------| +| `text` | `string` | The word text. | +| `start` | `number` | Start timestamp (seconds) of this word. | +| `end` | `number` | End timestamp (seconds) of this word. | +| `validated.text` | `boolean` | Whether the transcription text has been validated. if false, word is also in buffer: transcription | +| `validated.speaker` | `boolean` | Whether the speaker assignment has been validated. if false, word is also in buffer: diarization | +| `validated.language` | `boolean` | Whether the language detection has been validated. if false, word is also in buffer: translation | + +### Buffer Object (Per-Segment) + +Buffers are **ephemeral**. They should be displayed to the user but not stored permanently in the frontend. Each update may contain a completely different buffer value, and previous buffer is likely to be in the next validated text. + +| Field | Type | Description | +|-------|------|-------------| +| `transcription` | `string` | Pending transcription text. Displayed immediately but **overwritten** on next update. | +| `diarization` | `string` | Pending diarization text (text waiting for speaker assignment). Displayed immediately but **overwritten** on next update. | +| `translation` | `string` | Pending translation text. Displayed immediately but **overwritten** on next update. | + + +### Metadata Fields + +| Field | Type | Description | +|-------|------|-------------| +| `remaining_time_transcription` | `float` | Seconds of audio waiting for transcription processing. | +| `remaining_time_diarization` | `float` | Seconds of audio waiting for speaker diarization. | + +### Status Values + +| Status | Description | +|--------|-------------| +| `active_transcription` | Normal operation, transcription is active. | +| `no_audio_detected` | No audio has been detected yet. | + +--- + +## Update Behavior + +### Incremental Updates + +The API sends **only changed or new segments**. Clients should: + +1. Maintain a local map of segments by ID +2. When receiving an update, merge/update segments by ID +3. Render only the changed segments + +### Language Detection + +When language is detected for a segment: + +```json +// Update 1: No language yet +{ + "segments": [ + {"id": 1, "speaker": 1, "text": "May see", "language": null} + ] +} + +// Update 2: Same segment ID, language now detected +{ + "segments": [ + {"id": 1, "speaker": 1, "text": "Merci", "language": "fr"} + ] +} +``` + +**Client behavior**: **Replace** the existing segment with the same ID. + +### Buffer Behavior + +Buffers are **per-segment** to handle multi-speaker scenarios correctly. + +#### Example: Translation with diarization and translation + +```json +// Update 1 +{ + "segments": [ + { + "id": 1, + "speaker": 1, + "text": "Hello world, how are", + "translation": "", + "buffer": { + "transcription": "", + "diarization": " you on", + "translation": "Bonjour le monde" + } + } + ] +} + +""" +== Frontend == + +1 +Hello world, how are you on +Bonjour le monde +""" + +// Update 2 +{ + "segments": [ + { + "id": 1, + "speaker": 1, + "text": " you on this", + "translation": "Bonjour tout le monde", + "buffer": { + "transcription": "", + "diarization": " beautiful day", + "translation": ",comment" + } + }, + ] +} + +""" +== Frontend == + +1 +Hello world, how are you on this beautiful day +Bonjour tout le monde, comment +""" +``` + +### Silence Segments + +Silence is represented with the speaker id = `-2`: + +```json +{ + "id": 5, + "speaker": -2, + "text": "", + "start": 10.5, + "end": 12.3 +} +```