Files
WhisperLiveKit/available_models.md
Quentin Fuxa 1833e7c921 0.2.10
2025-09-16 23:45:00 +02:00

3.8 KiB
Raw Permalink Blame History

Available Whisper model sizes:

  • tiny.en (english only)
  • tiny
  • base.en (english only)
  • base
  • small.en (english only)
  • small
  • medium.en (english only)
  • medium
  • large-v1
  • large-v2
  • large-v3
  • large-v3-turbo

How to choose?

Language Support

  • English only: Use .en models for better accuracy and faster processing when you only need English transcription
  • Multilingual: Do not use .en models.

Resource Constraints

  • Limited GPU/CPU or need for very low latency: Choose small or smaller models
    • tiny: Fastest, lowest resource usage, acceptable quality for simple audio
    • base: Good balance of speed and accuracy for basic use cases
    • small: Better accuracy while still being resource-efficient
  • Good resources available: Use large models for best accuracy
    • large-v2: Excellent accuracy, good multilingual support
    • large-v3: Best overall accuracy and language support

Special Cases

  • No translation needed: Use large-v3-turbo
    • Same transcription quality as large-v2 but significantly faster
    • Important: Does not translate correctly, only transcribes

Model Comparison Table

Model Speed Accuracy Multilingual Translation Best Use Case
tiny(.en) Fastest Basic Yes/No Yes/No Real-time, low resources
base(.en) Fast Good Yes/No Yes/No Balanced performance
small(.en) Medium Better Yes/No Yes/No Quality on limited hardware
medium(.en) Slow High Yes/No Yes/No High quality, moderate resources
large-v2 Slowest Excellent Yes Yes Best overall quality
large-v3 Slowest Excellent Yes Yes Maximum accuracy
large-v3-turbo Fast Excellent Yes No Fast, high-quality transcription

Additional Considerations

Model Performance:

  • Accuracy improves significantly from tiny to large models
  • English-only models are ~10-15% more accurate for English audio
  • Newer versions (v2, v3) have better punctuation and formatting

Hardware Requirements:

  • tiny: ~1GB VRAM
  • base: ~1GB VRAM
  • small: ~2GB VRAM
  • medium: ~5GB VRAM
  • large: ~10GB VRAM
  • largev3turbo: ~6GB VRAM

Audio Quality Impact:

  • Clean, clear audio: smaller models may suffice
  • Noisy, accented, or technical audio: larger models recommended
  • Phone/low-quality audio: use at least small model

Quick Decision Tree

  1. English only? → Add .en to your choice
  2. Limited resources or need speed? → small or smaller
  3. Good hardware and want best quality? → large-v3
  4. Need fast, high-quality transcription without translation? → large-v3-turbo
  5. Need translation capabilities? → large-v2 or large-v3 (avoid turbo)

Translation Models and Backend

Language Support: ~200 languages

Distilled Model Sizes Available

Model Size Parameters VRAM (FP16) VRAM (INT8) Quality
600M 2.46 GB 600M ~1.5GB ~800MB Good, understandable
1.3B 5.48 GB 1.3B ~3GB ~1.5GB Better accuracy, context

Quality Impact: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.

Backend Performance

Backend Speed vs Base Memory Usage Quality Loss
CTranslate2 6-10x faster 40-60% less ~5% BLEU drop
Transformers Baseline High None
Transformers + MPS (on Apple Silicon) 2x faster Medium None

Metrics:

  • CTranslate2: 50-100+ tokens/sec
  • Transformers: 10-30 tokens/sec
  • Apple Silicon with MPS: Up to 2x faster than CTranslate2

Quick Decision Matrix

Choose 600M: Limited resources, close to 0 lag Choose 1.3B: Quality matters Choose Transformers: On Apple Silicon