publish 2.0.0

2026-03-21 16:31:10 +00:00 · 2023-12-01 21:12:39 -06:00
parent d3511affe0
commit 19283f57c8
3 changed files with 40 additions and 3 deletions
--- a/air_llm/README.md
+++ b/air_llm/README.md
@@ -2,10 +2,19 @@ AirLLM optimizes inference memory usage, allowing 70B large language models to r

 AirLLM优化inference内存，4GB单卡GPU可以运行70B大语言模型推理。不需要任何损失模型性能的量化和蒸馏，剪枝等模型压缩。

+## Updates
+
+
+[2023/12/01] airllm 2.0. Support compressions: **3x run time speed up!**
+
+[2023/11/20] airllm Initial verion!
+
+
+

 ## Quickstart

-### install package
+### 1. install package

 First, install airllm pip package.

@@ -20,7 +29,7 @@ pip install airllm
 pip install -i https://pypi.org/simple/ airllm
 ```

-### Inference
+### 2. Inference

 Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.

@@ -69,6 +78,34 @@ Note: During inference, the original model will first be decomposed and saved la
 
 注意：推理过程会首先将原始模型按层分拆，转存。请保证huggingface cache目录有足够的磁盘空间。

+
+### 3. Compression - 3x Inference Speed!
+
+We just added model compression based on block-wise quantization based model compression. Which can further **speed up the inference speed** for up to **3x** , with almost ignorable accuracy loss(see more performance evaluation and why we use block-wise quantization in [this paper](https://arxiv.org/abs/2212.09720))!
+
+![speed_improvement](https://github.com/lyogavin/Anima/blob/main/assets/airllm2_time_improvement.png?raw=true)
+
+#### how to enalbe model compression speed up:
+
+* Step 1. make sure you have [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) installed by `pip install -U bitsandbytes `
+* Step 2. make sure airllm verion later than 2.0.0: `pip install -U airllm` 
+* Step 3. when initialize the model, passing the argument compression ('4bit' or '8bit'):
+
+```python
+model = AirLLMLlama2("garage-bAInd/Platypus2-70B-instruct"
+					 compression='4bit' # specify '8bit' for 8-bit block-wise quantization 
+					 )
+```
+
+### 4. All supported configurations
+ 
+When initialize the model, we support the following configurations:
+
+* **compression**: supported options: 4bit,  8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
+* **profiling_mode**: supported options: True to output time consumptions or by default False
+* **layer_shards_saving_path**: optionally another path to save the splitted model
+
+
 ## Acknowledgement

 A lot of the code are based on SimJeg's great work in the Kaggle exam competition. Big shoutout to SimJeg:
--- a/air_llm/setup.py
+++ b/air_llm/setup.py
@@ -5,7 +5,7 @@ with open("README.md", "r") as fh:

 setuptools.setup(
    name="airllm",
-    version="0.9.5",
+    version="2.0.0",
    author="Gavin Li",
    author_email="gavinli@animaai.cloud",
    description="AirLLM allows single 4GB GPU card to run 70B large language models without quantization, distillation or pruning.",
--- a/assets/airllm2_time_improvement.png
+++ b/assets/airllm2_time_improvement.png