WhisperLiveKit/docs/alignement_principles.md

# Alignment Principles

This document explains how transcription tokens are aligned with diarization (speaker identification) segments.

---

## Token-by-Token Validation

When diarization is enabled, text is validated **token-by-token** rather than waiting for sentence boundaries. As soon as diarization covers a token's time range, that token is validated and assigned to the appropriate speaker.

### How It Works

1. **Transcription produces tokens** with timestamps (start, end)
2. **Diarization produces speaker segments** with timestamps
3. **For each token**: Check if diarization has caught up to that token's time
   - If yes → Find speaker with maximum overlap, validate token
   - If no → Keep token in "pending" (becomes diarization buffer)

```
Timeline:        0s -------- 5s -------- 10s -------- 15s
                 |           |            |            |
Transcription:   [Hello, how are you doing today?]
                 |_______|___|____|_____|_____|_____|
                   tok1  tok2 tok3 tok4  tok5  tok6

Diarization:     [SPEAKER 1        ][SPEAKER 2        ]
                 |__________________|__________________|
                 0s               8s                  15s

At time t when diarization covers up to 8s:
- Tokens 1-4 (0s-7s) → Validated as SPEAKER 1
- Tokens 5-6 (7s-10s) → In buffer (diarization hasn't caught up)
```

---

## Silence Handling

- **Short silences (< 2 seconds)**: Filtered out, not displayed
- **Significant silences (≥ 2 seconds)**: Displayed as silence segments with `speaker: -2`
- **Same speaker across gaps**: Segments are merged even if separated by short silences

```
Before filtering:
[SPK1 0:00-0:03] [SILENCE 0:03-0:04] [SPK1 0:04-0:08]

After filtering (silence < 2s):
[SPK1 0:00-0:08]  ← Merged into single segment
```

---

## Buffer Types

| Buffer | Contains | Displayed When |
|--------|----------|----------------|
| `transcription` | Text awaiting validation (more context needed) | Always on last segment |
| `diarization` | Text awaiting speaker assignment | When diarization lags behind transcription |
| `translation` | Translation awaiting validation | When translation is enabled |

---

## Legacy: Punctuation-Based Splitting

The previous approach split segments at punctuation marks and aligned with diarization at those boundaries. This is now replaced by token-by-token validation for faster, more responsive results.

### Historical Examples (for reference)

Example of punctuation-based alignment:

```text
punctuations_segments : __#_______.__________________!____
diarization_segments:
SPK1                    __#____________
SPK2                      #            ___________________
-->
ALIGNED SPK1            __#_______.
ALIGNED SPK2              #        __________________!____
```

With token-by-token validation, the alignment happens continuously rather than at punctuation boundaries.