mirror of
https://github.com/0xSojalSec/airllm.git
synced 2026-03-21 16:31:10 +00:00
publish 2.0.0
This commit is contained in:
@@ -2,10 +2,19 @@ AirLLM optimizes inference memory usage, allowing 70B large language models to r
|
||||
|
||||
AirLLM优化inference内存,4GB单卡GPU可以运行70B大语言模型推理。不需要任何损失模型性能的量化和蒸馏,剪枝等模型压缩。
|
||||
|
||||
## Updates
|
||||
|
||||
|
||||
[2023/12/01] airllm 2.0. Support compressions: **3x run time speed up!**
|
||||
|
||||
[2023/11/20] airllm Initial verion!
|
||||
|
||||
|
||||
|
||||
|
||||
## Quickstart
|
||||
|
||||
### install package
|
||||
### 1. install package
|
||||
|
||||
First, install airllm pip package.
|
||||
|
||||
@@ -20,7 +29,7 @@ pip install airllm
|
||||
pip install -i https://pypi.org/simple/ airllm
|
||||
```
|
||||
|
||||
### Inference
|
||||
### 2. Inference
|
||||
|
||||
Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.
|
||||
|
||||
@@ -69,6 +78,34 @@ Note: During inference, the original model will first be decomposed and saved la
|
||||
|
||||
注意:推理过程会首先将原始模型按层分拆,转存。请保证huggingface cache目录有足够的磁盘空间。
|
||||
|
||||
|
||||
### 3. Compression - 3x Inference Speed!
|
||||
|
||||
We just added model compression based on block-wise quantization based model compression. Which can further **speed up the inference speed** for up to **3x** , with almost ignorable accuracy loss(see more performance evaluation and why we use block-wise quantization in [this paper](https://arxiv.org/abs/2212.09720))!
|
||||
|
||||

|
||||
|
||||
#### how to enalbe model compression speed up:
|
||||
|
||||
* Step 1. make sure you have [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) installed by `pip install -U bitsandbytes `
|
||||
* Step 2. make sure airllm verion later than 2.0.0: `pip install -U airllm`
|
||||
* Step 3. when initialize the model, passing the argument compression ('4bit' or '8bit'):
|
||||
|
||||
```python
|
||||
model = AirLLMLlama2("garage-bAInd/Platypus2-70B-instruct"
|
||||
compression='4bit' # specify '8bit' for 8-bit block-wise quantization
|
||||
)
|
||||
```
|
||||
|
||||
### 4. All supported configurations
|
||||
|
||||
When initialize the model, we support the following configurations:
|
||||
|
||||
* **compression**: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
|
||||
* **profiling_mode**: supported options: True to output time consumptions or by default False
|
||||
* **layer_shards_saving_path**: optionally another path to save the splitted model
|
||||
|
||||
|
||||
## Acknowledgement
|
||||
|
||||
A lot of the code are based on SimJeg's great work in the Kaggle exam competition. Big shoutout to SimJeg:
|
||||
|
||||
@@ -5,7 +5,7 @@ with open("README.md", "r") as fh:
|
||||
|
||||
setuptools.setup(
|
||||
name="airllm",
|
||||
version="0.9.5",
|
||||
version="2.0.0",
|
||||
author="Gavin Li",
|
||||
author_email="gavinli@animaai.cloud",
|
||||
description="AirLLM allows single 4GB GPU card to run 70B large language models without quantization, distillation or pruning.",
|
||||
|
||||
BIN
assets/airllm2_time_improvement.png
Normal file
BIN
assets/airllm2_time_improvement.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 45 KiB |
Reference in New Issue
Block a user