Files
WhisperLiveKit/docs/alignement_principles.md

82 lines
2.8 KiB
Markdown

# Alignment Principles
This document explains how transcription tokens are aligned with diarization (speaker identification) segments.
---
## Token-by-Token Validation
When diarization is enabled, text is validated **token-by-token** rather than waiting for sentence boundaries. As soon as diarization covers a token's time range, that token is validated and assigned to the appropriate speaker.
### How It Works
1. **Transcription produces tokens** with timestamps (start, end)
2. **Diarization produces speaker segments** with timestamps
3. **For each token**: Check if diarization has caught up to that token's time
- If yes → Find speaker with maximum overlap, validate token
- If no → Keep token in "pending" (becomes diarization buffer)
```
Timeline: 0s -------- 5s -------- 10s -------- 15s
| | | |
Transcription: [Hello, how are you doing today?]
|_______|___|____|_____|_____|_____|
tok1 tok2 tok3 tok4 tok5 tok6
Diarization: [SPEAKER 1 ][SPEAKER 2 ]
|__________________|__________________|
0s 8s 15s
At time t when diarization covers up to 8s:
- Tokens 1-4 (0s-7s) → Validated as SPEAKER 1
- Tokens 5-6 (7s-10s) → In buffer (diarization hasn't caught up)
```
---
## Silence Handling
- **Short silences (< 2 seconds)**: Filtered out, not displayed
- **Significant silences (≥ 2 seconds)**: Displayed as silence segments with `speaker: -2`
- **Same speaker across gaps**: Segments are merged even if separated by short silences
```
Before filtering:
[SPK1 0:00-0:03] [SILENCE 0:03-0:04] [SPK1 0:04-0:08]
After filtering (silence < 2s):
[SPK1 0:00-0:08] ← Merged into single segment
```
---
## Buffer Types
| Buffer | Contains | Displayed When |
|--------|----------|----------------|
| `transcription` | Text awaiting validation (more context needed) | Always on last segment |
| `diarization` | Text awaiting speaker assignment | When diarization lags behind transcription |
| `translation` | Translation awaiting validation | When translation is enabled |
---
## Legacy: Punctuation-Based Splitting
The previous approach split segments at punctuation marks and aligned with diarization at those boundaries. This is now replaced by token-by-token validation for faster, more responsive results.
### Historical Examples (for reference)
Example of punctuation-based alignment:
```text
punctuations_segments : __#_______.__________________!____
diarization_segments:
SPK1 __#____________
SPK2 # ___________________
-->
ALIGNED SPK1 __#_______.
ALIGNED SPK2 # __________________!____
```
With token-by-token validation, the alignment happens continuously rather than at punctuation boundaries.