update readme

This commit is contained in:
Yu Li
2024-07-29 15:47:13 -05:00
parent 9025ac0528
commit 411cb2a0af
5 changed files with 4 additions and 81 deletions

View File

@@ -1,4 +1,4 @@
![airllm_logo](https://github.com/lyogavin/Anima/blob/main/assets/airllm_logo_sm.png?v=3&raw=true)
![airllm_logo](https://github.com/lyogavin/airllm/blob/main/assets/airllm_logo_sm.png?v=3&raw=true)
[**Quickstart**](#quickstart) |
[**Configurations**](#configurations) |

View File

@@ -1,4 +1,4 @@
![airllm_logo](https://github.com/lyogavin/Anima/blob/main/assets/airllm_logo_sm.png?v=3&raw=true)
![airllm_logo](https://github.com/lyogavin/airllm/blob/main/assets/airllm_logo_sm.png?v=3&raw=true)
[**Quickstart**](#quickstart) |
[**Configurations**](#configurations) |
@@ -8,8 +8,6 @@
**AirLLM** optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card. No quantization, distillation, pruning or other model compression techniques that would result in degraded model performance are needed.
AirLLM优化inference内存4GB单卡GPU可以运行70B大语言模型推理。不需要任何损失模型性能的量化和蒸馏剪枝等模型压缩。
<a href="https://github.com/lyogavin/Anima/stargazers">![GitHub Repo stars](https://img.shields.io/github/stars/lyogavin/Anima?style=social)</a>
[![Downloads](https://static.pepy.tech/personalized-badge/airllm?period=total&units=international_system&left_color=grey&right_color=blue&left_text=downloads)](https://pepy.tech/project/airllm)
@@ -28,36 +26,22 @@ AirLLM优化inference内存4GB单卡GPU可以运行70B大语言模型推理
[2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.
AirLLM天然支持Llama3 70B。4GB显存运行Llama3 70B大模型。
[2023/12/25] v2.8.2: Support MacOS running 70B large language models.
支持苹果系统运行70B大模型
[2023/12/20] v2.7: Support AirLLMMixtral.
[2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.
提供AuoModel自动根据repo参数检测模型类型自动初始化模型。
[2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.
[2023/12/03] added support of **ChatGLM**, **QWen**, **Baichuan**, **Mistral**, **InternLM**!
支持ChatGLM, QWEN, Baichuan, Mistral, InternLM!
[2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard.
支持safetensor系列模型现在open llm leaderboard前10的模型都已经支持。
[2023/12/01] airllm 2.0. Support compressions: **3x run time speed up!**
airllm2.0。支持模型压缩速度提升3倍。
[2023/11/20] airllm Initial verion!
airllm发布。
## Table of Contents
* [Quick start](#quickstart)
@@ -75,13 +59,10 @@ airllm发布。
First, install the airllm pip package.
首先安装airllm包。
```bash
pip install airllm
```
如果找不到package可能是因为默认的镜像问题。可以尝试制定原始镜像
```bash
pip install -i https://pypi.org/simple/ airllm
```
@@ -90,12 +71,8 @@ pip install -i https://pypi.org/simple/ airllm
Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.
然后初始化AirLLMLlama2传入所使用模型的huggingface repo ID或者本地路径即可类似于普通的transformer模型进行推理。
(*You can also specify the path to save the splitted layered model through **layer_shards_saving_path** when init AirLLMLlama2.*
*如果需要指定另外的路径来存储分层的模型可以在初始化AirLLMLlama2是传入参数**layer_shards_saving_path**。*)
```python
from airllm import AutoModel
@@ -133,15 +110,11 @@ print(output)
Note: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory.
注意推理过程会首先将原始模型按层分拆转存。请保证huggingface cache目录有足够的磁盘空间。
## Model Compression - 3x Inference Speed Up!
We just added model compression based on block-wise quantization-based model compression. Which can further **speed up the inference speed** for up to **3x** , with **almost ignorable accuracy loss!** (see more performance evaluation and why we use block-wise quantization in [this paper](https://arxiv.org/abs/2212.09720))
我们增加了基于block-wise quantization的模型压缩推理速度提升3倍几乎没有精度损失。精度评测可以参考此paper[this paper](https://arxiv.org/abs/2212.09720)
![speed_improvement](https://github.com/lyogavin/Anima/blob/main/assets/airllm2_time_improvement.png?v=2&raw=true)
#### How to enable model compression speed up:
@@ -166,8 +139,6 @@ While in our case the bottleneck is mainly at the disk loading, we only need to
When initialize the model, we support the following configurations:
初始化model的时候可以指定以下的配置参数
* **compression**: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
* **profiling_mode**: supported options: True to output time consumptions or by default False
* **layer_shards_saving_path**: optionally another path to save the splitted model
@@ -194,51 +165,6 @@ Example colabs here:
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
## Supported Models
#### [HF open llm leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) top models
**Including but not limited to the following:** (Most of the open models are based on llama2, so should be supported by default)
@12/01/23
| Rank | Model | Supported | Model Class |
| ------------- | ------------- | ------------- | ------------- |
| 1 | TigerResearch/tigerbot-70b-chat-v2 | ✅ | AirLLMLlama2 |
| 2 | upstage/SOLAR-0-70b-16bit | ✅ | AirLLMLlama2 |
| 3 | ICBU-NPU/FashionGPT-70B-V1.1 | ✅ | AirLLMLlama2 |
| 4 | sequelbox/StellarBright | ✅ | AirLLMLlama2 |
| 5 | bhenrym14/platypus-yi-34b | ✅ | AirLLMLlama2 |
| 6 | MayaPH/GodziLLa2-70B | ✅ | AirLLMLlama2 |
| 7 | 01-ai/Yi-34B | ✅ | AirLLMLlama2 |
| 8 | garage-bAInd/Platypus2-70B-instruct | ✅ | AirLLMLlama2 |
| 9 | jondurbin/airoboros-l2-70b-2.2.1 | ✅ | AirLLMLlama2 |
| 10 | chargoddard/Yi-34B-Llama | ✅ | AirLLMLlama2 |
| | mistralai/Mistral-7B-Instruct-v0.1 | ✅ | AirLLMMistral |
| | mistralai/Mixtral-8x7B-v0.1 | ✅ | AirLLMMixtral |
#### [opencompass leaderboard](https://opencompass.org.cn/leaderboard-llm) top models
**Including but not limited to the following:** (Most of the open models are based on llama2, so should be supported by default)
@12/01/23
| Rank | Model | Supported | Model Class |
| ------------- | ------------- | ------------- | ------------- |
| 1 | GPT-4 | closed.ai😓 | N/A |
| 2 | TigerResearch/tigerbot-70b-chat-v2 | ✅ | AirLLMLlama2 |
| 3 | THUDM/chatglm3-6b-base | ✅ | AirLLMChatGLM |
| 4 | Qwen/Qwen-14B | ✅| AirLLMQWen |
| 5 | 01-ai/Yi-34B | ✅ | AirLLMLlama2 |
| 6 | ChatGPT | closed.ai😓 | N/A |
| 7 | OrionStarAI/OrionStar-Yi-34B-Chat | ✅ | AirLLMLlama2 |
| 8 | Qwen/Qwen-14B-Chat | ✅ | AirLLMQWen |
| 9 | Duxiaoman-DI/XuanYuan-70B | ✅ | AirLLMLlama2 |
| 10 | internlm/internlm-20b | ✅ | AirLLMInternLM |
| 26 | baichuan-inc/Baichuan2-13B-Chat | ✅ | AirLLMBaichuan |
#### example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):
<details>
@@ -333,8 +259,6 @@ safetensors_rust.SafetensorError: Error while deserializing header: MetadataInco
If you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. See [this](https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12). You may need to extend your disk space, clear huggingface [.cache](https://huggingface.co/docs/datasets/cache) and rerun.
如果你碰到这个error很有可能是空间不足。可以参考一下[这个](https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12) 可能需要扩大硬盘空间删除huggingface的[.cache](https://huggingface.co/docs/datasets/cache)然后重新run。
### 2. ValueError: max() arg is an empty sequence
Most likely you are loading QWen or ChatGLM model with Llama2 class. Try the following:
@@ -392,7 +316,6 @@ BibTex entry:
```
## Contribution
Welcomed contributions, ideas and discussions!

View File

@@ -5,13 +5,13 @@ with open("README.md", "r") as fh:
setuptools.setup(
name="airllm",
version="2.8.3",
version="2.8.6",
author="Gavin Li",
author_email="gavinli@animaai.cloud",
description="AirLLM allows single 4GB GPU card to run 70B large language models without quantization, distillation or pruning.",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/lyogavin/Anima/tree/main/air_llm",
url="https://github.com/lyogavin/airllm",
packages=setuptools.find_packages(),
install_requires=[
'tqdm',

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

After

Width:  |  Height:  |  Size: 11 KiB