diff --git a/README.md b/README.md index aa2d9d8..66698d1 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@
@@ -92,10 +92,10 @@ See **Parameters & Configuration** below on how to use them. Start the transcription server with various options: ```bash -# SimulStreaming backend for ultra-low latency -whisperlivekit-server --backend simulstreaming --model large-v3 +# Use better model than default (small) +whisperlivekit-server --model large-v3 -# Advanced configuration with diarization +# Advanced configuration with diarization and language whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language fr ``` @@ -146,6 +146,16 @@ The package includes an HTML/JavaScript implementation [here](https://github.com ### ⚙️ Parameters & Configuration +An important list of parameters can be changed. But what *should* you change? +- the `--model` size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/available_models.md) +- the `--language`. List [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py) +- the `--backend` ? you can switch to `--backend faster-whisper` if `simulstreaming` does not work correctly or if you prefer to avoid the dual-license requirements. +- `--warmup-file`, if you have one +- `--host`, `--port`, `--ssl-certfile`, `--ssl-keyfile`, if you set up a server +- `--diarization`, if you want to use it. + +The rest I don't recommend. But below are your options. + | Parameter | Description | Default | |-----------|-------------|---------| | `--model` | Whisper model size. | `small` | @@ -187,7 +197,6 @@ The package includes an HTML/JavaScript implementation [here](https://github.com |-----------|-------------|---------| | `--diarization` | Enable speaker identification | `False` | | `--diarization-backend` | `diart` or `sortformer` | `sortformer` | -| `--punctuation-split` | Use punctuation to improve speaker boundaries | `True` | | `--segmentation-model` | Hugging Face model ID for Diart segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` | | `--embedding-model` | Hugging Face model ID for Diart embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` | diff --git a/available_models.md b/available_models.md new file mode 100644 index 0000000..cac0850 --- /dev/null +++ b/available_models.md @@ -0,0 +1,72 @@ +# Available model sizes: + +- tiny.en (english only) +- tiny +- base.en (english only) +- base +- small.en (english only) +- small +- medium.en (english only) +- medium +- large-v1 +- large-v2 +- large-v3 +- large-v3-turbo + +## How to choose? + +### Language Support +- **English only**: Use `.en` models for better accuracy and faster processing when you only need English transcription +- **Multilingual**: Do not use `.en` models. + +### Resource Constraints +- **Limited GPU/CPU or need for very low latency**: Choose `small` or smaller models + - `tiny`: Fastest, lowest resource usage, acceptable quality for simple audio + - `base`: Good balance of speed and accuracy for basic use cases + - `small`: Better accuracy while still being resource-efficient +- **Good resources available**: Use `large` models for best accuracy + - `large-v2`: Excellent accuracy, good multilingual support + - `large-v3`: Best overall accuracy and language support + +### Special Cases +- **No translation needed**: Use `large-v3-turbo` + - Same transcription quality as `large-v2` but significantly faster + - **Important**: Does not translate correctly, only transcribes + +### Model Comparison Table + +| Model | Speed | Accuracy | Multilingual | Translation | Best Use Case | +|-------|--------|----------|--------------|-------------|---------------| +| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | Real-time, low resources | +| base(.en) | Fast | Good | Yes/No | Yes/No | Balanced performance | +| small(.en) | Medium | Better | Yes/No | Yes/No | Quality on limited hardware | +| medium(.en) | Slow | High | Yes/No | Yes/No | High quality, moderate resources | +| large-v2 | Slowest | Excellent | Yes | Yes | Best overall quality | +| large-v3 | Slowest | Excellent | Yes | Yes | Maximum accuracy | +| large-v3-turbo | Fast | Excellent | Yes | No | Fast, high-quality transcription | + +### Additional Considerations + +**Model Performance**: +- Accuracy improves significantly from tiny to large models +- English-only models are ~10-15% more accurate for English audio +- Newer versions (v2, v3) have better punctuation and formatting + +**Hardware Requirements**: +- `tiny`: ~1GB VRAM +- `base`: ~1GB VRAM +- `small`: ~2GB VRAM +- `medium`: ~5GB VRAM +- `large`: ~10GB VRAM + +**Audio Quality Impact**: +- Clean, clear audio: smaller models may suffice +- Noisy, accented, or technical audio: larger models recommended +- Phone/low-quality audio: use at least `small` model + +### Quick Decision Tree +1. English only? → Add `.en` to your choice +2. Limited resources or need speed? → `small` or smaller +3. Good hardware and want best quality? → `large-v3` +4. Need fast, high-quality transcription without translation? → `large-v3-turbo` +5. Need translation capabilities? → `large-v2` or `large-v3` (avoid turbo) \ No newline at end of file