mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-04-26 16:45:46 +00:00
6.9 KiB
6.9 KiB
CLAUDE.md -- WhisperLiveKit
Build & Test
Install for development:
pip install -e ".[test]"
Test with real audio using TestHarness (requires models + audio files):
import asyncio
from whisperlivekit import TestHarness
async def main():
async with TestHarness(model_size="base", lan="en", diarization=True) as h:
await h.feed("audio.wav", speed=1.0) # feed at real-time
await h.drain(2.0) # let ASR catch up
h.print_state() # see current output
await h.silence(7.0, speed=1.0) # 7s silence
await h.wait_for_silence() # verify detection
result = await h.finish()
print(f"WER: {result.wer('expected text'):.2%}")
print(f"Speakers: {result.speakers}")
print(f"Text at 3s: {result.text_at(3.0)}")
asyncio.run(main())
Architecture
WhisperLiveKit is a real-time speech transcription system using WebSockets.
- TranscriptionEngine (singleton) loads models once at startup and is shared across all sessions.
- AudioProcessor is created per WebSocket session. It runs an async producer-consumer pipeline: FFmpeg decodes audio, Silero VAD detects speech, the ASR backend transcribes, and results stream back to the client.
- Two streaming policies:
- LocalAgreement (HypothesisBuffer) -- confirms tokens only when consecutive inferences agree.
- SimulStreaming (AlignAtt attention-based) -- emits tokens as soon as alignment attention is confident.
- 6 ASR backends: WhisperASR, FasterWhisperASR, MLXWhisper, VoxtralMLX, VoxtralHF, Qwen3.
- SessionASRProxy wraps the shared ASR with a per-session language override, using a lock to safely swap
original_languageduringtranscribe(). - DiffTracker implements a snapshot-then-diff protocol for bandwidth-efficient incremental WebSocket updates (opt-in via
?mode=diff).
Key Files
| File | Purpose |
|---|---|
config.py |
WhisperLiveKitConfig dataclass -- single source of truth for configuration |
core.py |
TranscriptionEngine singleton, online_factory(), diarization/translation factories |
audio_processor.py |
Per-session async pipeline (FFmpeg -> VAD -> ASR -> output) |
basic_server.py |
FastAPI server: WebSocket /asr, REST /v1/audio/transcriptions, CLI wlk |
timed_objects.py |
ASRToken, Segment, FrontData data structures |
diff_protocol.py |
DiffTracker -- snapshot-then-diff WebSocket protocol |
session_asr_proxy.py |
SessionASRProxy -- thread-safe per-session language wrapper |
parse_args.py |
CLI argument parser, returns WhisperLiveKitConfig |
test_client.py |
Headless WebSocket test client (wlk-test) |
test_harness.py |
In-process testing harness (TestHarness) for real E2E testing |
local_agreement/online_asr.py |
OnlineASRProcessor for LocalAgreement policy |
simul_whisper/ |
SimulStreaming policy implementation (AlignAtt) |
Key Patterns
- TranscriptionEngine uses double-checked locking for thread-safe singleton initialization. Never create a second instance in production. Use
TranscriptionEngine.reset()in tests only to switch backends. - WhisperLiveKitConfig dataclass is the single source of truth. Use
from_namespace()(from argparse) orfrom_kwargs()(programmatic).parse_args()returns aWhisperLiveKitConfig, not a raw Namespace. - online_factory() in
core.pyroutes to the correct online processor class based on backend and policy. - FrontData.to_dict() is the canonical output format for WebSocket messages.
- SessionASRProxy uses
__getattr__delegation -- it forwards everything excepttranscribe()to the wrapped ASR. - The server exposes
self.argsas aNamespaceonTranscriptionEnginefor backward compatibility withAudioProcessor.
Adding a New ASR Backend
- Create
whisperlivekit/my_backend.pywith a class implementing:transcribe(audio, init_prompt="")-- run inference on audio arrayts_words(result)-- extract timestamped words from resultsegments_end_ts(result)-- extract segment end timestampsuse_vad()-- whether this backend needs external VAD
- Set required attributes on the class:
sep,original_language,backend_choice,SAMPLING_RATE,confidence_validation,tokenizer,buffer_trimming,buffer_trimming_sec. - Register in
core.py:- Add an
elifbranch inTranscriptionEngine._do_init()to instantiate the backend. - Add a routing case in
online_factory()to return the appropriate online processor.
- Add an
- Add the backend choice to CLI args in
parse_args.py.
Testing with TestHarness
TestHarness wraps AudioProcessor in-process for full pipeline testing without a server.
Key methods:
feed(path, speed=1.0)-- feed audio at controlled speed (0 = instant)silence(duration, speed=1.0)-- inject silence (>5s triggers silence detection)drain(seconds)-- wait for ASR to catch up without feeding audiofinish(timeout)-- signal end-of-audio, wait for pipeline to drainstate-- currentTestStatewith lines, buffers, speakers, timestampswait_for(predicate)/wait_for_text()/wait_for_silence()/wait_for_speakers(n)snapshot_at(audio_time)-- historical state at a given audio positionon_update(callback)-- register callback for each state update
TestState provides:
text,committed_text-- full or committed-only transcriptionspeakers,n_speakers,has_silence-- speaker/silence infoline_at(time_s),speaker_at(time_s),text_at(time_s)-- query by timestamplines_between(start, end),text_between(start, end)-- query by time rangewer(reference),wer_detailed(reference)-- evaluation against ground truthspeech_lines,silence_segments-- filtered line lists
OpenAI-Compatible REST API
The server exposes an OpenAI-compatible batch transcription endpoint:
# Transcribe a file (drop-in replacement for OpenAI)
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.mp3 \
-F response_format=verbose_json
# Works with the OpenAI Python client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
result = client.audio.transcriptions.create(model="whisper-1", file=open("audio.mp3", "rb"))
print(result.text)
Supported response_format values: json, verbose_json, text, srt, vtt.
The model parameter is accepted but ignored (uses the server's configured backend).
Do NOT
- Do not create a second
TranscriptionEngineinstance. It is a singleton; the constructor returns the existing instance after the first call. - Do not modify
original_languageon the shared ASR directly. UseSessionASRProxyfor per-session language overrides. - Do not assume the frontend handles diff protocol messages. Diff mode is opt-in (
?mode=diff) and ignored by default. - Do not write mock-based unit tests. Use
TestHarnesswith real audio for pipeline testing.