CLAUDE.md -- WhisperLiveKit

Build & Test

Install for development:

pip install -e ".[test]"

Test with real audio using TestHarness (requires models + audio files):

import asyncio
from whisperlivekit import TestHarness

async def main():
    async with TestHarness(model_size="base", lan="en", diarization=True) as h:
        await h.feed("audio.wav", speed=1.0)     # feed at real-time
        await h.drain(2.0)                         # let ASR catch up
        h.print_state()                            # see current output

        await h.silence(7.0, speed=1.0)            # 7s silence
        await h.wait_for_silence()                 # verify detection

        result = await h.finish()
        print(f"WER: {result.wer('expected text'):.2%}")
        print(f"Speakers: {result.speakers}")
        print(f"Text at 3s: {result.text_at(3.0)}")

asyncio.run(main())

Architecture

WhisperLiveKit is a real-time speech transcription system using WebSockets.

TranscriptionEngine (singleton) loads models once at startup and is shared across all sessions.
AudioProcessor is created per WebSocket session. It runs an async producer-consumer pipeline: FFmpeg decodes audio, Silero VAD detects speech, the ASR backend transcribes, and results stream back to the client.
Two streaming policies:
- LocalAgreement (HypothesisBuffer) -- confirms tokens only when consecutive inferences agree.
- SimulStreaming (AlignAtt attention-based) -- emits tokens as soon as alignment attention is confident.
6 ASR backends: WhisperASR, FasterWhisperASR, MLXWhisper, VoxtralMLX, VoxtralHF, Qwen3.
SessionASRProxy wraps the shared ASR with a per-session language override, using a lock to safely swap original_language during transcribe().
DiffTracker implements a snapshot-then-diff protocol for bandwidth-efficient incremental WebSocket updates (opt-in via ?mode=diff).

Key Files

File	Purpose
`config.py`	`WhisperLiveKitConfig` dataclass -- single source of truth for configuration
`core.py`	`TranscriptionEngine` singleton, `online_factory()`, diarization/translation factories
`audio_processor.py`	Per-session async pipeline (FFmpeg -> VAD -> ASR -> output)
`basic_server.py`	FastAPI server: WebSocket `/asr`, REST `/v1/audio/transcriptions`, CLI `wlk`
`timed_objects.py`	`ASRToken`, `Segment`, `FrontData` data structures
`diff_protocol.py`	`DiffTracker` -- snapshot-then-diff WebSocket protocol
`session_asr_proxy.py`	`SessionASRProxy` -- thread-safe per-session language wrapper
`parse_args.py`	CLI argument parser, returns `WhisperLiveKitConfig`
`test_client.py`	Headless WebSocket test client (`wlk-test`)
`test_harness.py`	In-process testing harness (`TestHarness`) for real E2E testing
`local_agreement/online_asr.py`	`OnlineASRProcessor` for LocalAgreement policy
`simul_whisper/`	SimulStreaming policy implementation (AlignAtt)

Key Patterns

TranscriptionEngine uses double-checked locking for thread-safe singleton initialization. Never create a second instance in production. Use TranscriptionEngine.reset() in tests only to switch backends.
WhisperLiveKitConfig dataclass is the single source of truth. Use from_namespace() (from argparse) or from_kwargs() (programmatic). parse_args() returns a WhisperLiveKitConfig, not a raw Namespace.
online_factory() in core.py routes to the correct online processor class based on backend and policy.
FrontData.to_dict() is the canonical output format for WebSocket messages.
SessionASRProxy uses __getattr__ delegation -- it forwards everything except transcribe() to the wrapped ASR.
The server exposes self.args as a Namespace on TranscriptionEngine for backward compatibility with AudioProcessor.

Adding a New ASR Backend

Create whisperlivekit/my_backend.py with a class implementing:
- transcribe(audio, init_prompt="") -- run inference on audio array
- ts_words(result) -- extract timestamped words from result
- segments_end_ts(result) -- extract segment end timestamps
- use_vad() -- whether this backend needs external VAD
Set required attributes on the class: sep, original_language, backend_choice, SAMPLING_RATE, confidence_validation, tokenizer, buffer_trimming, buffer_trimming_sec.
Register in core.py:
- Add an elif branch in TranscriptionEngine._do_init() to instantiate the backend.
- Add a routing case in online_factory() to return the appropriate online processor.
Add the backend choice to CLI args in parse_args.py.

Testing with TestHarness

TestHarness wraps AudioProcessor in-process for full pipeline testing without a server.

Key methods:

feed(path, speed=1.0) -- feed audio at controlled speed (0 = instant)
silence(duration, speed=1.0) -- inject silence (>5s triggers silence detection)
drain(seconds) -- wait for ASR to catch up without feeding audio
finish(timeout) -- signal end-of-audio, wait for pipeline to drain
state -- current TestState with lines, buffers, speakers, timestamps
wait_for(predicate) / wait_for_text() / wait_for_silence() / wait_for_speakers(n)
snapshot_at(audio_time) -- historical state at a given audio position
on_update(callback) -- register callback for each state update

TestState provides:

text, committed_text -- full or committed-only transcription
speakers, n_speakers, has_silence -- speaker/silence info
line_at(time_s), speaker_at(time_s), text_at(time_s) -- query by timestamp
lines_between(start, end), text_between(start, end) -- query by time range
wer(reference), wer_detailed(reference) -- evaluation against ground truth
speech_lines, silence_segments -- filtered line lists

OpenAI-Compatible REST API

The server exposes an OpenAI-compatible batch transcription endpoint:

# Transcribe a file (drop-in replacement for OpenAI)
curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.mp3 \
  -F response_format=verbose_json

# Works with the OpenAI Python client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
result = client.audio.transcriptions.create(model="whisper-1", file=open("audio.mp3", "rb"))
print(result.text)

Supported response_format values: json, verbose_json, text, srt, vtt. The model parameter is accepted but ignored (uses the server's configured backend).

Do NOT

Do not create a second TranscriptionEngine instance. It is a singleton; the constructor returns the existing instance after the first call.
Do not modify original_language on the shared ASR directly. Use SessionASRProxy for per-session language overrides.
Do not assume the frontend handles diff protocol messages. Diff mode is opt-in (?mode=diff) and ignored by default.
Do not write mock-based unit tests. Use TestHarness with real audio for pipeline testing.

6.9 KiB Raw Blame History