diff --git a/air_llm/README.md b/air_llm/README.md index 3ee1381..39098bb 100644 --- a/air_llm/README.md +++ b/air_llm/README.md @@ -2,10 +2,19 @@ AirLLM optimizes inference memory usage, allowing 70B large language models to r AirLLM优化inference内存,4GB单卡GPU可以运行70B大语言模型推理。不需要任何损失模型性能的量化和蒸馏,剪枝等模型压缩。 +## Updates + + +[2023/12/01] airllm 2.0. Support compressions: **3x run time speed up!** + +[2023/11/20] airllm Initial verion! + + + ## Quickstart -### install package +### 1. install package First, install airllm pip package. @@ -20,7 +29,7 @@ pip install airllm pip install -i https://pypi.org/simple/ airllm ``` -### Inference +### 2. Inference Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model. @@ -69,6 +78,34 @@ Note: During inference, the original model will first be decomposed and saved la 注意:推理过程会首先将原始模型按层分拆,转存。请保证huggingface cache目录有足够的磁盘空间。 + +### 3. Compression - 3x Inference Speed! + +We just added model compression based on block-wise quantization based model compression. Which can further **speed up the inference speed** for up to **3x** , with almost ignorable accuracy loss(see more performance evaluation and why we use block-wise quantization in [this paper](https://arxiv.org/abs/2212.09720))! + +![speed_improvement](https://github.com/lyogavin/Anima/blob/main/assets/airllm2_time_improvement.png?raw=true) + +#### how to enalbe model compression speed up: + +* Step 1. make sure you have [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) installed by `pip install -U bitsandbytes ` +* Step 2. make sure airllm verion later than 2.0.0: `pip install -U airllm` +* Step 3. when initialize the model, passing the argument compression ('4bit' or '8bit'): + +```python +model = AirLLMLlama2("garage-bAInd/Platypus2-70B-instruct" + compression='4bit' # specify '8bit' for 8-bit block-wise quantization + ) +``` + +### 4. All supported configurations + +When initialize the model, we support the following configurations: + +* **compression**: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression +* **profiling_mode**: supported options: True to output time consumptions or by default False +* **layer_shards_saving_path**: optionally another path to save the splitted model + + ## Acknowledgement A lot of the code are based on SimJeg's great work in the Kaggle exam competition. Big shoutout to SimJeg: diff --git a/air_llm/setup.py b/air_llm/setup.py index 9f39a4d..6297fad 100644 --- a/air_llm/setup.py +++ b/air_llm/setup.py @@ -5,7 +5,7 @@ with open("README.md", "r") as fh: setuptools.setup( name="airllm", - version="0.9.5", + version="2.0.0", author="Gavin Li", author_email="gavinli@animaai.cloud", description="AirLLM allows single 4GB GPU card to run 70B large language models without quantization, distillation or pruning.", diff --git a/assets/airllm2_time_improvement.png b/assets/airllm2_time_improvement.png new file mode 100644 index 0000000..d4ef9d8 Binary files /dev/null and b/assets/airllm2_time_improvement.png differ