mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-03-07 22:33:36 +00:00
2.8 KiB
2.8 KiB
Alignment Principles
This document explains how transcription tokens are aligned with diarization (speaker identification) segments.
Token-by-Token Validation
When diarization is enabled, text is validated token-by-token rather than waiting for sentence boundaries. As soon as diarization covers a token's time range, that token is validated and assigned to the appropriate speaker.
How It Works
- Transcription produces tokens with timestamps (start, end)
- Diarization produces speaker segments with timestamps
- For each token: Check if diarization has caught up to that token's time
- If yes → Find speaker with maximum overlap, validate token
- If no → Keep token in "pending" (becomes diarization buffer)
Timeline: 0s -------- 5s -------- 10s -------- 15s
| | | |
Transcription: [Hello, how are you doing today?]
|_______|___|____|_____|_____|_____|
tok1 tok2 tok3 tok4 tok5 tok6
Diarization: [SPEAKER 1 ][SPEAKER 2 ]
|__________________|__________________|
0s 8s 15s
At time t when diarization covers up to 8s:
- Tokens 1-4 (0s-7s) → Validated as SPEAKER 1
- Tokens 5-6 (7s-10s) → In buffer (diarization hasn't caught up)
Silence Handling
- Short silences (< 2 seconds): Filtered out, not displayed
- Significant silences (≥ 2 seconds): Displayed as silence segments with
speaker: -2 - Same speaker across gaps: Segments are merged even if separated by short silences
Before filtering:
[SPK1 0:00-0:03] [SILENCE 0:03-0:04] [SPK1 0:04-0:08]
After filtering (silence < 2s):
[SPK1 0:00-0:08] ← Merged into single segment
Buffer Types
| Buffer | Contains | Displayed When |
|---|---|---|
transcription |
Text awaiting validation (more context needed) | Always on last segment |
diarization |
Text awaiting speaker assignment | When diarization lags behind transcription |
translation |
Translation awaiting validation | When translation is enabled |
Legacy: Punctuation-Based Splitting
The previous approach split segments at punctuation marks and aligned with diarization at those boundaries. This is now replaced by token-by-token validation for faster, more responsive results.
Historical Examples (for reference)
Example of punctuation-based alignment:
punctuations_segments : __#_______.__________________!____
diarization_segments:
SPK1 __#____________
SPK2 # ___________________
-->
ALIGNED SPK1 __#_______.
ALIGNED SPK2 # __________________!____
With token-by-token validation, the alignment happens continuously rather than at punctuation boundaries.