diff --git a/README.md b/README.md index 30e21a6..6662ac0 100644 --- a/README.md +++ b/README.md @@ -13,32 +13,32 @@ License

-## ๐Ÿš€ Overview +## Overview This project is based on [WhisperStreaming](https://github.com/ufal/whisper_streaming) and [SimulStreaming](https://github.com/ufal/SimulStreaming), allowing you to transcribe audio directly from your browser. WhisperLiveKit provides a complete backend solution for real-time speech transcription with a functional, simple and customizable frontend. Everything runs locally on your machine โœจ -### ๐Ÿ”„ Architecture +### Architecture WhisperLiveKit consists of three main components: -- **Frontend**: A basic html + JS interface that captures microphone audio and streams it to the backend via WebSockets. You can use and adapt the provided template at [whisperlivekit/web/live_transcription.html](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html). +- **Frontend**: A basic html + JS interface that captures microphone audio and streams it to the backend via WebSockets. You can use and adapt the [provided template](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html). - **Backend (Web Server)**: A FastAPI-based WebSocket server that receives streamed audio data, processes it in real time, and returns transcriptions to the frontend. This is where the WebSocket logic and routing live. - **Core Backend (Library Logic)**: A server-agnostic core that handles audio processing, ASR, and diarization. It exposes reusable components that take in audio bytes and return transcriptions. -### โœจ Key Features +### Key Features -- **๐ŸŽ™๏ธ Real-time Transcription** - Locally (or on-prem) convert speech to text instantly as you speak -- **๐Ÿ‘ฅ Speaker Diarization** - Identify different speakers in real-time using [Diart](https://github.com/juanmc2005/diart) -- **๐ŸŒ Multi-User Support** - Handle multiple users simultaneously with a single backend/server -- **๐Ÿ”‡ Automatic Silence Chunking** โ€“ Automatically chunks when no audio is detected to limit buffer size -- **โœ… Confidence Validation** โ€“ Immediately validate high-confidence tokens for faster inference (WhisperStreaming only) -- **๐Ÿ‘๏ธ Buffering Preview** โ€“ Displays unvalidated transcription segments (not compatible with SimulStreaming yet) -- **โœ’๏ธ Punctuation-Based Speaker Splitting [BETA]** - Align speaker changes with natural sentence boundaries for more readable transcripts -- **โšก SimulStreaming Backend** - Ultra-low latency transcription using state-of-the-art AlignAtt policy. The code is not directly included in the repo : To use, please copy [simul_whisper](https://github.com/ufal/SimulStreaming/tree/main/simul_whisper) content into `whisperlivekit/simul_whisper` . โš ๏ธ You must comply with the [Polyform license](https://github.com/ufal/SimulStreaming/blob/main/LICENCE.txt) +- **Real-time Transcription** - Locally (or on-prem) convert speech to text instantly as you speak +- **Speaker Diarization** - Identify different speakers in real-time using [Diart](https://github.com/juanmc2005/diart) +- **Multi-User Support** - Handle multiple users simultaneously with a single backend/server +- **Automatic Silence Chunking** โ€“ Automatically chunks when no audio is detected to limit buffer size +- **Confidence Validation** โ€“ Immediately validate high-confidence tokens for faster inference (WhisperStreaming only) +- **Buffering Preview** โ€“ Displays unvalidated transcription segments (not compatible with SimulStreaming yet) +- **Punctuation-Based Speaker Splitting [BETA]** - Align speaker changes with natural sentence boundaries for more readable transcripts +- **SimulStreaming Backend** - Ultra-low latency transcription using state-of-the-art AlignAtt policy. The code is not directly included in the repo : To use, please copy [simul_whisper](https://github.com/ufal/SimulStreaming/tree/main/simul_whisper) content into `whisperlivekit/simul_whisper` . โš ๏ธ You must comply with the [Polyform license](https://github.com/ufal/SimulStreaming/blob/main/LICENCE.txt) -## ๐Ÿ“– Quick Start +## Quick Start ```bash # Install the package @@ -53,25 +53,19 @@ whisperlivekit-server --model tiny.en That's it! Start speaking and watch your words appear on screen. -## ๐Ÿ› ๏ธ Installation Options - -### Install from PyPI (Recommended) +## Installation ```bash +#Install from PyPI (Recommended) pip install whisperlivekit -``` -### Install from Source - -```bash +#Install from Source git clone https://github.com/QuentinFuxa/WhisperLiveKit cd WhisperLiveKit pip install -e . ``` -### System Dependencies - -FFmpeg is required: +### FFmpeg Dependency ```bash # Ubuntu/Debian @@ -140,40 +134,30 @@ whisperlivekit-server --backend simulstreaming --model large-v3 --frame-threshol Check [basic_server.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a complete example. ```python -from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args +from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args from fastapi import FastAPI, WebSocket, WebSocketDisconnect from fastapi.responses import HTMLResponse from contextlib import asynccontextmanager import asyncio -# Global variable for the transcription engine transcription_engine = None @asynccontextmanager async def lifespan(app: FastAPI): global transcription_engine - # Example: Initialize with specific parameters directly + transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en") # You can also load from command-line arguments using parse_args() # args = parse_args() # transcription_engine = TranscriptionEngine(**vars(args)) - transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en") yield app = FastAPI(lifespan=lifespan) -# Serve the web interface -@app.get("/") -async def get(): - return HTMLResponse(get_web_interface_html()) - # Process WebSocket connections async def handle_websocket_results(websocket: WebSocket, results_generator): - try: - async for response in results_generator: - await websocket.send_json(response) - await websocket.send_json({"type": "ready_to_stop"}) - except WebSocketDisconnect: - print("WebSocket disconnected during results handling.") + async for response in results_generator: + await websocket.send_json(response) + await websocket.send_json({"type": "ready_to_stop"}) @app.websocket("/asr") async def websocket_endpoint(websocket: WebSocket): @@ -182,33 +166,19 @@ async def websocket_endpoint(websocket: WebSocket): # Create a new AudioProcessor for each connection, passing the shared engine audio_processor = AudioProcessor(transcription_engine=transcription_engine) results_generator = await audio_processor.create_tasks() - send_results_to_client = handle_websocket_results(websocket, results_generator) - results_task = asyncio.create_task(send_results_to_client) + results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator)) await websocket.accept() - try: - while True: - message = await websocket.receive_bytes() - await audio_processor.process_audio(message) - except WebSocketDisconnect: - print(f"Client disconnected: {websocket.client}") - except Exception as e: - await websocket.close(code=1011, reason=f"Server error: {e}") - finally: - results_task.cancel() - try: - await results_task - except asyncio.CancelledError: - logger.info("Results task successfully cancelled.") + while True: + message = await websocket.receive_bytes() + await audio_processor.process_audio(message) ``` ### Frontend Implementation -The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it in `whisperlivekit/web/live_transcription.html`, or load its content using the `get_web_interface_html()` function from `whisperlivekit`: +The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html), or load its content using `get_web_interface_html()` : ```python from whisperlivekit import get_web_interface_html - -# ... later in your code where you need the HTML string ... html_content = get_web_interface_html() ``` @@ -257,11 +227,8 @@ WhisperLiveKit offers extensive configuration options: 1. **Audio Capture**: Browser's MediaRecorder API captures audio in webm/opus format 2. **Streaming**: Audio chunks are sent to the server via WebSocket -3. **Processing**: Server decodes audio with FFmpeg and streams into Whisper for transcription -4. **Real-time Output**: - - Partial transcriptions appear immediately in light gray (the 'aperรงu') - - Finalized text appears in normal color - - (When enabled) Different speakers are identified and highlighted +3. **Processing**: Server decodes audio with FFmpeg and streams into the model for transcription +4. **Real-time Output**: Partial transcriptions appear immediately in light gray (the 'aperรงu') and finalized text appears in normal color ## ๐Ÿš€ Deployment Guide @@ -291,17 +258,14 @@ To deploy WhisperLiveKit in production: proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_set_header Host $host; - } -} -``` + }} 4. **HTTPS Support**: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL ### ๐Ÿ‹ Docker -A basic Dockerfile is provided which allows re-use of Python package installation options. See below usage examples: +A basic Dockerfile is provided which allows re-use of Python package installation options. โš ๏ธ For **large** models, ensure that your **docker runtime** has enough **memory** available. See below usage examples: -**NOTE:** For **larger** models, ensure that your **docker runtime** has enough **memory** available. #### All defaults - Create a reusable image with only the basics and then run as a named container: @@ -327,40 +291,11 @@ docker start -i whisperlivekit-base - `HF_TOKEN="./token"` - Add your Hugging Face Hub access token to download gated models ## ๐Ÿ”ฎ Use Cases - -- **Meeting Transcription**: Capture discussions in real-time -- **Accessibility Tools**: Help hearing-impaired users follow conversations -- **Content Creation**: Transcribe podcasts or videos automatically -- **Customer Service**: Transcribe support calls with speaker identification - -## ๐Ÿ“„ License - -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. - -**โš ๏ธ Important**: When using the SimulStreaming backend, you must also comply with the **PolyForm Noncommercial License 1.0.0** that governs SimulStreaming. For commercial use of the SimulStreaming backend, obtain a commercial license from the [SimulStreaming authors](https://github.com/ufal/SimulStreaming#-licence-and-contributions). - -## ๐Ÿค Contributing - -Contributions are welcome! Here's how to get started: - -1. Fork the repository -2. Create a feature branch: `git checkout -b feature/amazing-feature` -3. Commit your changes: `git commit -m 'Add amazing feature'` -4. Push to your branch: `git push origin feature/amazing-feature` -5. Open a Pull Request +Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service... ## ๐Ÿ™ Acknowledgments -This project builds upon the foundational work of: -- [Whisper Streaming](https://github.com/ufal/whisper_streaming) -- [SimulStreaming](https://github.com/ufal/SimulStreaming) (BETA backend) -- [Diart](https://github.com/juanmc2005/diart) -- [OpenAI Whisper](https://github.com/openai/whisper) +We extend our gratitude to the original authors of: -We extend our gratitude to the original authors for their contributions. - -## ๐Ÿ”— Links - -- [GitHub Repository](https://github.com/QuentinFuxa/WhisperLiveKit) -- [PyPI Package](https://pypi.org/project/whisperlivekit/) -- [Issue Tracker](https://github.com/QuentinFuxa/WhisperLiveKit/issues) \ No newline at end of file +| [Whisper Streaming](https://github.com/ufal/whisper_streaming) | [SimulStreaming](https://github.com/ufal/SimulStreaming) | [Diart](https://github.com/juanmc2005/diart) | [OpenAI Whisper](https://github.com/openai/whisper) | +| -------- | ------- | -------- | ------- | \ No newline at end of file diff --git a/whisperlivekit/whisper_streaming_custom/backends.py b/whisperlivekit/whisper_streaming_custom/backends.py index 1b08c4e..851e39e 100644 --- a/whisperlivekit/whisper_streaming_custom/backends.py +++ b/whisperlivekit/whisper_streaming_custom/backends.py @@ -12,6 +12,11 @@ import numpy as np from whisperlivekit.timed_objects import ASRToken logger = logging.getLogger(__name__) +SIMULSTREAMING_ERROR_AND_INSTALLATION_INSTRUCTIONS = ImportError( +"""SimulStreaming dependencies are not available. +Please install WhisperLiveKit using pip install "whisperlivekit[simulstreaming]". +If you are building from source, you should also copy the content of the https://github.com/ufal/SimulStreaming/tree/main/simul_whisper directory into whisperlivekit/simul_whisper. +""") try: from whisperlivekit.simul_whisper.config import AlignAttConfig @@ -315,7 +320,7 @@ class SimulStreamingASR(ASRBase): def __init__(self, lan, modelsize=None, cache_dir=None, model_dir=None, logfile=sys.stderr, **kwargs): if not SIMULSTREAMING_AVAILABLE: - raise ImportError("""SimulStreaming dependencies are not available. Please install WhisperLiveKit using pip install "whisperlivekit[simulstreaming]". If you are building from source, you should also copy the content of the simul_whisper directory from the SimulStreaming repository into whisperlivekit/simul_whisper.""") + raise SIMULSTREAMING_ERROR_AND_INSTALLATION_INSTRUCTIONS with open("whisperlivekit/simul_whisper/dual_license_simulstreaming.md", "r") as f: print("*"*80 + f.read() + "*"*80) self.logfile = logfile diff --git a/whisperlivekit/whisper_streaming_custom/whisper_online.py b/whisperlivekit/whisper_streaming_custom/whisper_online.py index 4320a75..352ae1d 100644 --- a/whisperlivekit/whisper_streaming_custom/whisper_online.py +++ b/whisperlivekit/whisper_streaming_custom/whisper_online.py @@ -5,7 +5,7 @@ import librosa from functools import lru_cache import time import logging -from .backends import FasterWhisperASR, MLXWhisper, WhisperTimestampedASR, OpenaiApiASR, SimulStreamingASR, SIMULSTREAMING_AVAILABLE +from .backends import FasterWhisperASR, MLXWhisper, WhisperTimestampedASR, OpenaiApiASR, SimulStreamingASR, SIMULSTREAMING_AVAILABLE, SIMULSTREAMING_ERROR_AND_INSTALLATION_INSTRUCTIONS from .online_asr import OnlineASRProcessor, VACOnlineASRProcessor, SimulStreamingOnlineProcessor, SIMULSTREAMING_AVAILABLE as SIMULSTREAMING_ONLINE_AVAILABLE logger = logging.getLogger(__name__) @@ -72,10 +72,7 @@ def backend_factory(args): elif backend == "simulstreaming": logger.debug("Using SimulStreaming backend.") if not SIMULSTREAMING_AVAILABLE: - raise ImportError( - "SimulStreaming backend is not available. Please install SimulStreaming dependencies. " - "See the documentation for installation instructions." - ) + raise SIMULSTREAMING_ERROR_AND_INSTALLATION_INSTRUCTIONS simulstreaming_kwargs = {} for attr in ['frame_threshold', 'beams', 'decoder_type', 'audio_max_len', 'audio_min_len', @@ -144,7 +141,7 @@ def backend_factory(args): def online_factory(args, asr, tokenizer, logfile=sys.stderr): if args.backend == "simulstreaming": if not SIMULSTREAMING_ONLINE_AVAILABLE: - raise ImportError("SimulStreaming online processor is not available.") + raise SIMULSTREAMING_ERROR_AND_INSTALLATION_INSTRUCTIONS logger.debug("Creating SimulStreaming online processor") online = SimulStreamingOnlineProcessor(