mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-04-01 18:34:22 +00:00
252 lines
6.5 KiB
Markdown
252 lines
6.5 KiB
Markdown
# WhisperLiveKit WebSocket API Documentation
|
|
|
|
WLK provides real-time speech transcription, speaker diarization, and translation through a WebSocket API. The server sends updates as audio is processed, allowing clients to display live transcription results with minimal latency.
|
|
|
|
---
|
|
|
|
## Endpoints
|
|
|
|
| Endpoint | Description |
|
|
|----------|-------------|
|
|
| `/` | Main web interface with visual styling |
|
|
| `/text` | Simple text-based interface for easy copy/paste (debug/development) |
|
|
| `/asr` | WebSocket endpoint for audio streaming |
|
|
|
|
---
|
|
|
|
## Message Format
|
|
|
|
### Transcript Update (Server → Client)
|
|
|
|
```typescript
|
|
{
|
|
"type": "transcript_update",
|
|
"status": "active_transcription" | "no_audio_detected",
|
|
"segments": [
|
|
{
|
|
"id": number,
|
|
"speaker": number,
|
|
"text": string,
|
|
"start_speaker": string, // HH:MM:SS format
|
|
"start": string, // HH:MM:SS format
|
|
"end": string, // HH:MM:SS format
|
|
"language": string | null,
|
|
"translation": string,
|
|
"buffer": {
|
|
"transcription": string,
|
|
"diarization": string,
|
|
"translation": string
|
|
}
|
|
}
|
|
],
|
|
"metadata": {
|
|
"remaining_time_transcription": float,
|
|
"remaining_time_diarization": float
|
|
}
|
|
}
|
|
```
|
|
|
|
### Other Message Types
|
|
|
|
#### Config Message (sent on connection)
|
|
```json
|
|
{
|
|
"type": "config",
|
|
"useAudioWorklet": true
|
|
}
|
|
```
|
|
- `useAudioWorklet`: If `true`, client should use AudioWorklet for PCM streaming. If `false`, use MediaRecorder for WebM.
|
|
|
|
#### Ready to Stop Message (sent after processing complete)
|
|
```json
|
|
{
|
|
"type": "ready_to_stop"
|
|
}
|
|
```
|
|
Indicates all audio has been processed and the client can safely close the connection.
|
|
|
|
---
|
|
|
|
## Field Descriptions
|
|
|
|
### Segment Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `id` | `number` | Unique identifier for this segment. |
|
|
| `speaker` | `number` | Speaker ID (1, 2, 3...). Special value `-2` indicates silence. |
|
|
| `text` | `string` | Validated transcription text. |
|
|
| `start_speaker` | `string` | Timestamp (HH:MM:SS) when this speaker segment began. |
|
|
| `start` | `string` | Timestamp (HH:MM:SS) of the first word. |
|
|
| `end` | `string` | Timestamp (HH:MM:SS) of the last word. |
|
|
| `language` | `string \| null` | ISO language code (e.g., "en", "fr"). `null` until detected. |
|
|
| `translation` | `string` | Validated translation text. |
|
|
| `buffer` | `Object` | Per-segment temporary buffers (see below). |
|
|
|
|
### Buffer Object (Per-Segment)
|
|
|
|
Buffers are **ephemeral**. They should be displayed to the user but are overwritten on each update. Only the **last non-silent segment** contains buffer content.
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `transcription` | `string` | Text pending validation (waiting for more context). |
|
|
| `diarization` | `string` | Text pending speaker assignment (diarization hasn't caught up). |
|
|
| `translation` | `string` | Translation pending validation. |
|
|
|
|
### Metadata Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `remaining_time_transcription` | `float` | Seconds of audio waiting for transcription. |
|
|
| `remaining_time_diarization` | `float` | Seconds of audio waiting for diarization. |
|
|
|
|
### Status Values
|
|
|
|
| Status | Description |
|
|
|--------|-------------|
|
|
| `active_transcription` | Normal operation, transcription is active. |
|
|
| `no_audio_detected` | No audio/speech has been detected yet. |
|
|
|
|
---
|
|
|
|
## Behavior Notes
|
|
|
|
### Silence Handling
|
|
|
|
- **Short silences (< 2 seconds)** are filtered out and not displayed.
|
|
- Only significant pauses appear as silence segments with `speaker: -2`.
|
|
- Consecutive same-speaker segments are merged even across short silences.
|
|
|
|
### Update Frequency
|
|
|
|
- **Active transcription**: ~20 updates/second (every 50ms)
|
|
- **During silence**: ~2 updates/second (every 500ms) to reduce bandwidth
|
|
|
|
### Token-by-Token Validation (Diarization Mode)
|
|
|
|
When diarization is enabled, text is validated **token-by-token** as soon as diarization covers each token, rather than waiting for punctuation. This provides:
|
|
- Faster text validation
|
|
- More responsive speaker attribution
|
|
- Buffer only contains tokens that diarization hasn't processed yet
|
|
|
|
---
|
|
|
|
## Example Messages
|
|
|
|
### Normal Transcription
|
|
|
|
```json
|
|
{
|
|
"type": "transcript_update",
|
|
"status": "active_transcription",
|
|
"segments": [
|
|
{
|
|
"id": 1,
|
|
"speaker": 1,
|
|
"text": "Hello, how are you today?",
|
|
"start_speaker": "0:00:02",
|
|
"start": "0:00:02",
|
|
"end": "0:00:05",
|
|
"language": "en",
|
|
"translation": "",
|
|
"buffer": {
|
|
"transcription": " I'm doing",
|
|
"diarization": "",
|
|
"translation": ""
|
|
}
|
|
}
|
|
],
|
|
"metadata": {
|
|
"remaining_time_transcription": 0.5,
|
|
"remaining_time_diarization": 0
|
|
}
|
|
}
|
|
```
|
|
|
|
### With Diarization Buffer
|
|
|
|
```json
|
|
{
|
|
"type": "transcript_update",
|
|
"status": "active_transcription",
|
|
"segments": [
|
|
{
|
|
"id": 1,
|
|
"speaker": 1,
|
|
"text": "The meeting starts at nine.",
|
|
"start_speaker": "0:00:03",
|
|
"start": "0:00:03",
|
|
"end": "0:00:06",
|
|
"language": "en",
|
|
"translation": "",
|
|
"buffer": {
|
|
"transcription": "",
|
|
"diarization": " Let me check my calendar",
|
|
"translation": ""
|
|
}
|
|
}
|
|
],
|
|
"metadata": {
|
|
"remaining_time_transcription": 0.3,
|
|
"remaining_time_diarization": 2.1
|
|
}
|
|
}
|
|
```
|
|
|
|
### Silence Segment
|
|
|
|
```json
|
|
{
|
|
"id": 5,
|
|
"speaker": -2,
|
|
"text": "",
|
|
"start_speaker": "0:00:10",
|
|
"start": "0:00:10",
|
|
"end": "0:00:15",
|
|
"language": null,
|
|
"translation": "",
|
|
"buffer": {
|
|
"transcription": "",
|
|
"diarization": "",
|
|
"translation": ""
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Text Transcript Endpoint (`/text`)
|
|
|
|
The `/text` endpoint provides a simple, monospace text interface designed for:
|
|
- Easy copy/paste of transcripts
|
|
- Debugging and development
|
|
- Integration testing
|
|
|
|
Output uses text markers instead of HTML styling:
|
|
|
|
```
|
|
[METADATA transcription_lag=0.5s diarization_lag=1.2s]
|
|
|
|
[SPEAKER 1] 0:00:03 - 0:00:11 [LANG: en]
|
|
Hello world, how are you doing today?[DIAR_BUFFER] I'm doing fine[/DIAR_BUFFER]
|
|
|
|
[SILENCE 0:00:15 - 0:00:18]
|
|
|
|
[SPEAKER 2] 0:00:18 - 0:00:22 [LANG: en]
|
|
That's great to hear!
|
|
[TRANSLATION]C'est super à entendre![/TRANSLATION]
|
|
```
|
|
|
|
### Markers
|
|
|
|
| Marker | Description |
|
|
|--------|-------------|
|
|
| `[SPEAKER N]` | Speaker label with ID |
|
|
| `[SILENCE start - end]` | Silence segment |
|
|
| `[LANG: xx]` | Detected language code |
|
|
| `[DIAR_BUFFER]...[/DIAR_BUFFER]` | Text pending speaker assignment |
|
|
| `[TRANS_BUFFER]...[/TRANS_BUFFER]` | Text pending validation |
|
|
| `[TRANSLATION]...[/TRANSLATION]` | Translation content |
|
|
| `[METADATA ...]` | Lag/timing information |
|
|
|