diff --git a/README.md b/README.md
index 1e23904..eb2dc45 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-
+
[**Quickstart**](#quickstart) |
[**Configurations**](#configurations) |
diff --git a/air_llm/README.md b/air_llm/README.md
index 71fb2f9..eb2dc45 100644
--- a/air_llm/README.md
+++ b/air_llm/README.md
@@ -1,4 +1,4 @@
-
+
[**Quickstart**](#quickstart) |
[**Configurations**](#configurations) |
@@ -8,8 +8,6 @@
**AirLLM** optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card. No quantization, distillation, pruning or other model compression techniques that would result in degraded model performance are needed.
-AirLLM优化inference内存,4GB单卡GPU可以运行70B大语言模型推理。不需要任何损失模型性能的量化和蒸馏,剪枝等模型压缩。
-

[](https://pepy.tech/project/airllm)
@@ -28,36 +26,22 @@ AirLLM优化inference内存,4GB单卡GPU可以运行70B大语言模型推理
[2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.
-AirLLM天然支持Llama3 70B。4GB显存运行Llama3 70B大模型。
-
[2023/12/25] v2.8.2: Support MacOS running 70B large language models.
-支持苹果系统运行70B大模型!
-
[2023/12/20] v2.7: Support AirLLMMixtral.
[2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.
-提供AuoModel,自动根据repo参数检测模型类型,自动初始化模型。
-
[2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.
[2023/12/03] added support of **ChatGLM**, **QWen**, **Baichuan**, **Mistral**, **InternLM**!
-支持ChatGLM, QWEN, Baichuan, Mistral, InternLM!
-
[2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard.
-支持safetensor系列模型,现在open llm leaderboard前10的模型都已经支持。
-
[2023/12/01] airllm 2.0. Support compressions: **3x run time speed up!**
-airllm2.0。支持模型压缩,速度提升3倍。
-
[2023/11/20] airllm Initial verion!
-airllm发布。
-
## Table of Contents
* [Quick start](#quickstart)
@@ -75,13 +59,10 @@ airllm发布。
First, install the airllm pip package.
-首先安装airllm包。
-
```bash
pip install airllm
```
-如果找不到package,可能是因为默认的镜像问题。可以尝试制定原始镜像:
```bash
pip install -i https://pypi.org/simple/ airllm
```
@@ -90,12 +71,8 @@ pip install -i https://pypi.org/simple/ airllm
Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.
-然后,初始化AirLLMLlama2,传入所使用模型的huggingface repo ID,或者本地路径即可类似于普通的transformer模型进行推理。
-
(*You can also specify the path to save the splitted layered model through **layer_shards_saving_path** when init AirLLMLlama2.*
-*如果需要指定另外的路径来存储分层的模型可以在初始化AirLLMLlama2是传入参数:**layer_shards_saving_path**。*)
-
```python
from airllm import AutoModel
@@ -133,15 +110,11 @@ print(output)
Note: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory.
-注意:推理过程会首先将原始模型按层分拆,转存。请保证huggingface cache目录有足够的磁盘空间。
-
## Model Compression - 3x Inference Speed Up!
We just added model compression based on block-wise quantization-based model compression. Which can further **speed up the inference speed** for up to **3x** , with **almost ignorable accuracy loss!** (see more performance evaluation and why we use block-wise quantization in [this paper](https://arxiv.org/abs/2212.09720))
-我们增加了基于block-wise quantization的模型压缩,推理速度提升3倍几乎没有精度损失。精度评测可以参考此paper:[this paper](https://arxiv.org/abs/2212.09720)
-

#### How to enable model compression speed up:
@@ -166,8 +139,6 @@ While in our case the bottleneck is mainly at the disk loading, we only need to
When initialize the model, we support the following configurations:
-初始化model的时候,可以指定以下的配置参数:
-
* **compression**: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
* **profiling_mode**: supported options: True to output time consumptions or by default False
* **layer_shards_saving_path**: optionally another path to save the splitted model
@@ -194,51 +165,6 @@ Example colabs here:
-## Supported Models
-
-#### [HF open llm leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) top models
-
-
-**Including but not limited to the following:** (Most of the open models are based on llama2, so should be supported by default)
-
-@12/01/23
-
-| Rank | Model | Supported | Model Class |
-| ------------- | ------------- | ------------- | ------------- |
-| 1 | TigerResearch/tigerbot-70b-chat-v2 | ✅ | AirLLMLlama2 |
-| 2 | upstage/SOLAR-0-70b-16bit | ✅ | AirLLMLlama2 |
-| 3 | ICBU-NPU/FashionGPT-70B-V1.1 | ✅ | AirLLMLlama2 |
-| 4 | sequelbox/StellarBright | ✅ | AirLLMLlama2 |
-| 5 | bhenrym14/platypus-yi-34b | ✅ | AirLLMLlama2 |
-| 6 | MayaPH/GodziLLa2-70B | ✅ | AirLLMLlama2 |
-| 7 | 01-ai/Yi-34B | ✅ | AirLLMLlama2 |
-| 8 | garage-bAInd/Platypus2-70B-instruct | ✅ | AirLLMLlama2 |
-| 9 | jondurbin/airoboros-l2-70b-2.2.1 | ✅ | AirLLMLlama2 |
-| 10 | chargoddard/Yi-34B-Llama | ✅ | AirLLMLlama2 |
-| ? | mistralai/Mistral-7B-Instruct-v0.1 | ✅ | AirLLMMistral |
-| ? | mistralai/Mixtral-8x7B-v0.1 | ✅ | AirLLMMixtral |
-
-
-#### [opencompass leaderboard](https://opencompass.org.cn/leaderboard-llm) top models
-
-**Including but not limited to the following:** (Most of the open models are based on llama2, so should be supported by default)
-
-@12/01/23
-
-| Rank | Model | Supported | Model Class |
-| ------------- | ------------- | ------------- | ------------- |
-| 1 | GPT-4 | closed.ai😓 | N/A |
-| 2 | TigerResearch/tigerbot-70b-chat-v2 | ✅ | AirLLMLlama2 |
-| 3 | THUDM/chatglm3-6b-base | ✅ | AirLLMChatGLM |
-| 4 | Qwen/Qwen-14B | ✅| AirLLMQWen |
-| 5 | 01-ai/Yi-34B | ✅ | AirLLMLlama2 |
-| 6 | ChatGPT | closed.ai😓 | N/A |
-| 7 | OrionStarAI/OrionStar-Yi-34B-Chat | ✅ | AirLLMLlama2 |
-| 8 | Qwen/Qwen-14B-Chat | ✅ | AirLLMQWen |
-| 9 | Duxiaoman-DI/XuanYuan-70B | ✅ | AirLLMLlama2 |
-| 10 | internlm/internlm-20b | ✅ | AirLLMInternLM |
-| 26 | baichuan-inc/Baichuan2-13B-Chat | ✅ | AirLLMBaichuan |
-
#### example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):
@@ -333,8 +259,6 @@ safetensors_rust.SafetensorError: Error while deserializing header: MetadataInco
If you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. See [this](https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12). You may need to extend your disk space, clear huggingface [.cache](https://huggingface.co/docs/datasets/cache) and rerun.
-如果你碰到这个error,很有可能是空间不足。可以参考一下[这个](https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12) 可能需要扩大硬盘空间,删除huggingface的[.cache](https://huggingface.co/docs/datasets/cache),然后重新run。
-
### 2. ValueError: max() arg is an empty sequence
Most likely you are loading QWen or ChatGLM model with Llama2 class. Try the following:
@@ -392,7 +316,6 @@ BibTex entry:
```
-
## Contribution
Welcomed contributions, ideas and discussions!
diff --git a/air_llm/setup.py b/air_llm/setup.py
index fdffe05..7f26515 100644
--- a/air_llm/setup.py
+++ b/air_llm/setup.py
@@ -5,13 +5,13 @@ with open("README.md", "r") as fh:
setuptools.setup(
name="airllm",
- version="2.8.3",
+ version="2.8.6",
author="Gavin Li",
author_email="gavinli@animaai.cloud",
description="AirLLM allows single 4GB GPU card to run 70B large language models without quantization, distillation or pruning.",
long_description=long_description,
long_description_content_type="text/markdown",
- url="https://github.com/lyogavin/Anima/tree/main/air_llm",
+ url="https://github.com/lyogavin/airllm",
packages=setuptools.find_packages(),
install_requires=[
'tqdm',
diff --git a/assets/airllm_logo.png b/assets/airllm_logo.png
index 5693eba..a5f7196 100644
Binary files a/assets/airllm_logo.png and b/assets/airllm_logo.png differ
diff --git a/assets/airllm_logo_sm.png b/assets/airllm_logo_sm.png
index e2dfe04..1239a28 100644
Binary files a/assets/airllm_logo_sm.png and b/assets/airllm_logo_sm.png differ