Alignment Principles

This document explains how transcription tokens are aligned with diarization (speaker identification) segments.

Token-by-Token Validation

When diarization is enabled, text is validated token-by-token rather than waiting for sentence boundaries. As soon as diarization covers a token's time range, that token is validated and assigned to the appropriate speaker.

How It Works

Transcription produces tokens with timestamps (start, end)
Diarization produces speaker segments with timestamps
For each token: Check if diarization has caught up to that token's time
- If yes → Find speaker with maximum overlap, validate token
- If no → Keep token in "pending" (becomes diarization buffer)

Timeline:        0s -------- 5s -------- 10s -------- 15s
                 |           |            |            |
Transcription:   [Hello, how are you doing today?]
                 |_______|___|____|_____|_____|_____|
                   tok1  tok2 tok3 tok4  tok5  tok6

Diarization:     [SPEAKER 1        ][SPEAKER 2        ]
                 |__________________|__________________|
                 0s               8s                  15s

At time t when diarization covers up to 8s:
- Tokens 1-4 (0s-7s) → Validated as SPEAKER 1
- Tokens 5-6 (7s-10s) → In buffer (diarization hasn't caught up)

Silence Handling

Short silences (< 2 seconds): Filtered out, not displayed
Significant silences (≥ 2 seconds): Displayed as silence segments with speaker: -2
Same speaker across gaps: Segments are merged even if separated by short silences

Before filtering:
[SPK1 0:00-0:03] [SILENCE 0:03-0:04] [SPK1 0:04-0:08]

After filtering (silence < 2s):
[SPK1 0:00-0:08]  ← Merged into single segment

Buffer Types

Buffer	Contains	Displayed When
`transcription`	Text awaiting validation (more context needed)	Always on last segment
`diarization`	Text awaiting speaker assignment	When diarization lags behind transcription
`translation`	Translation awaiting validation	When translation is enabled

Legacy: Punctuation-Based Splitting

The previous approach split segments at punctuation marks and aligned with diarization at those boundaries. This is now replaced by token-by-token validation for faster, more responsive results.

Historical Examples (for reference)

Example of punctuation-based alignment:

punctuations_segments : __#_______.__________________!____
diarization_segments:
SPK1                    __#____________
SPK2                      #            ___________________
-->
ALIGNED SPK1            __#_______.
ALIGNED SPK2              #        __________________!____

With token-by-token validation, the alignment happens continuously rather than at punctuation boundaries.

2.8 KiB Raw Permalink Blame History