LLM/WhisperLiveKit

Fork 0

mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-03-07 06:14:05 +00:00

Files

Quentin Fuxa 345d781e97 update doc

2025-11-25 23:20:00 +01:00

4.8 KiB

Raw Permalink Blame History

Models and Model Paths

Defaults

Default Whisper Model: base
When no model is specified, WhisperLiveKit uses the base model, which provides a good balance of speed and accuracy for most use cases.

Default Model Cache Directory: ~/.cache/whisper
Models are automatically downloaded from OpenAI's model hub and cached in this directory. You can override this with --model_cache_dir.

Default Translation Model: 600M (NLLB-200-distilled)
When translation is enabled, the 600M distilled NLLB model is used by default. This provides good quality with minimal resource usage.

Default Translation Backend: transformers
The translation backend defaults to Transformers. On Apple Silicon, this automatically uses MPS acceleration for better performance.

Available Whisper model sizes:

Available Model	Speed	Accuracy	Multilingual	Translation	Hardware Requirements	Best Use Case
tiny(.en)	Fastest	Basic	Yes/No	Yes/No	~1GB VRAM	Real-time, low resources
base(.en)	Fast	Good	Yes/No	Yes/No	~1GB VRAM	Balanced performance
small(.en)	Medium	Better	Yes/No	Yes/No	~2GB VRAM	Quality on limited hardware
medium(.en)	Slow	High	Yes/No	Yes/No	~5GB VRAM	High quality, moderate resources
large-v2	Slowest	Excellent	Yes	Yes	~10GB VRAM	Good overall accuracy & language support
large-v3	Slowest	Excellent	Yes	Yes	~10GB VRAM	Best overall accuracy & language support
large-v3-turbo	Fast	Excellent	Yes	No	~6GB VRAM	Fast, high-quality transcription

How to choose?

Language Support

English only: Use .en (ex: base.en) models for better accuracy and faster processing when you only need English transcription
Multilingual: Do not use .en models.

Special Cases

No translation needed: Use large-v3-turbo
- Same transcription quality as large-v2 but significantly faster
- Important: Does not translate correctly, only transcribes

Additional Considerations

Model Performance:

Accuracy improves significantly from tiny to large models
English-only models are ~10-15% more accurate for English audio
Newer versions (v2, v3) have better punctuation and formatting

Audio Quality Impact:

Clean, clear audio: smaller models may suffice
Noisy, accented, or technical audio: larger models recommended
Phone/low-quality audio: use at least small model

Custom Models:

The --model-path parameter accepts:

File Path

.pt / .bin / .safetensor formats Should be openable by pytorch/safetensor.

Directory Path (recommended)

Must contain:

.pt / .bin / .safetensor file (required for decoder)

May optionally contain:

.bin file - faster-whisper model for encoder (requires faster-whisper)
weights.npz or weights.safetensors - for encoder (requires whisper-mlx)

Hugging Face Repo ID

Provide the repo ID (e.g. openai/whisper-large-v3) and WhisperLiveKit will download and cache the snapshot automatically. For gated repos, authenticate via huggingface-cli login first.

To improve speed/reduce hallucinations, you may want to use scripts/determine_alignment_heads.py to determine the alignment heads to use for your model, and use the --custom-alignment-heads to pass them to WLK. If not, alignment heads are set to be all the heads of the last half layer of decoder.

Translation Models and Backend

Language Support: ~200 languages

Distilled Model Sizes Available

Model	Size	Parameters	VRAM (FP16)	VRAM (INT8)	Quality
600M	2.46 GB	600M	~1.5GB	~800MB	Good, understandable
1.3B	5.48 GB	1.3B	~3GB	~1.5GB	Better accuracy, context

Quality Impact: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.

Backend Performance

Backend	Speed vs Base	Memory Usage	Quality Loss
CTranslate2	6-10x faster	40-60% less	~5% BLEU drop
Transformers	Baseline	High	None
Transformers + MPS (on Apple Silicon)	2x faster	Medium	None

Metrics:

CTranslate2: 50-100+ tokens/sec
Transformers: 10-30 tokens/sec
Apple Silicon with MPS: Up to 2x faster than CTranslate2

4.8 KiB Raw Permalink Blame History