Update README.md

This commit is contained in:
Gavin Li
2024-07-29 15:06:52 -05:00
committed by GitHub
parent ee1cf64e82
commit 6d65cffba8

433
README.md
View File

@@ -1,29 +1,20 @@
# Anima
![airllm_logo](https://github.com/lyogavin/Anima/blob/main/assets/airllm_logo_sm.png?v=3&raw=true)
![Anima Logo](https://github.com/lyogavin/Anima/blob/main/anima_logo.png?raw=true)
[**Quickstart**](#quickstart) |
[**Configurations**](#configurations) |
[**MacOS**](#macos) |
[**Example notebooks**](#example-python-notebook) |
[**FAQ**](#faq)
**AirLLM** optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card. No quantization, distillation, pruning or other model compression techniques that would result in degraded model performance are needed.
This is the first open source 33B Chinese LLM, we also support DPO alignment training and we have open source 100k context window. The latest update is AirLLM, a library helps you to infer 70B LLM from just single GPU with just 4GB memory.
第一个开源的基于QLoRA的33B中文大语言模型支持了基于DPO的对齐训练。
我们也开源了100K输入窗口的开源模型Anima100K基于Llama2可商用。
最新开源了单卡跑70B模型的AirLLM。
*Read this in [English](README_en.md).*
*Read this in [Japanese](README_ja.md).*
<div align="left">
AirLLM优化inference内存4GB单卡GPU可以运行70B大语言模型推理。不需要任何损失模型性能的量化和蒸馏剪枝等模型压缩。
<a href="https://github.com/lyogavin/Anima/stargazers">![GitHub Repo stars](https://img.shields.io/github/stars/lyogavin/Anima?style=social)</a>
[![Downloads](https://static.pepy.tech/personalized-badge/airllm?period=total&units=international_system&left_color=grey&right_color=blue&left_text=downloads)](https://pepy.tech/project/airllm)
[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/LianjiaTech/BELLE/blob/main/LICENSE)
[![Generic badge](https://img.shields.io/badge/wechat-Anima-brightgreen?logo=wechat)](https://static.aicompose.cn/static/wecom_barcode.png?t=1671918938)
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/lyogavin/Anima33B-merged)
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/lyogavin/Anima-7B-100K)
[![Discord](https://img.shields.io/discord/1175437549783760896?logo=discord&color=7289da
)](https://discord.gg/2xffU5sn)
[![PyPI - AirLLM](https://img.shields.io/pypi/format/airllm?logo=pypi&color=3571a3)
@@ -32,136 +23,380 @@ This is the first open source 33B Chinese LLM, we also support DPO alignment tra
[![Support me on Patreon](https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fshieldsio-patreon.vercel.app%2Fapi%3Fusername%3Dgavinli%26type%3Dpatrons&style=flat)](https://patreon.com/gavinli)
[![GitHub Sponsors](https://img.shields.io/github/sponsors/lyogavin?logo=GitHub&color=lightgray)](https://github.com/sponsors/lyogavin)
</div>
## 🔄 更新 Updates
## Updates
[2024/04/20] [AirLLM](https://github.com/lyogavin/Anima/tree/main/air_llm) supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.
[2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.
AirLLM天然支持Llama3 70B。4GB显存运行Llama3 70B大模型。
[2024/03/07] Open source: Latte text2video Training - Train your own SORA!
[2023/12/25] v2.8.2: Support MacOS running 70B large language models.
最接近SORA的开源模型来了[训练你自己的SORA](https://github.com/lyogavin/train_your_own_sora)
支持苹果系统运行70B大模型
[2023/11/17] Open source: AirLLM, inference 70B LLM with 4GB single GPU.
[2023/12/20] v2.7: Support AirLLMMixtral.
开源AirLLM单卡4GB显存跑70B大模型无需量化无需模型压缩
[2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.
[2023/09/06] open source 100K context window Llama2 based LLM
提供AuoModel自动根据repo参数检测模型类型自动初始化模型。
更新支持100k 上下文的基于Llama2的可商用大模型
[2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.
[2023/06/29] Open source alignment training based on DPO+QLORA
[2023/12/03] added support of **ChatGLM**, **QWen**, **Baichuan**, **Mistral**, **InternLM**!
更新基于DPO+QLoRA的Human Feedback训练
支持ChatGLM, QWEN, Baichuan, Mistral, InternLM!
[2023/06/12] Open source the first 33B Chinese Large language model
[2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard.
开源了第一个基于QLoRA的中文33B大语言模型
支持safetensor系列模型现在open llm leaderboard前10的模型都已经支持。
[2023/12/01] airllm 2.0. Support compressions: **3x run time speed up!**
airllm2.0。支持模型压缩速度提升3倍。
[2023/11/20] airllm Initial verion!
airllm发布。
## Table of Contents
* [Quick start](#quickstart)
* [Model Compression](#model-compression---3x-inference-speed-up)
* [Configurations](#configurations)
* [Run on MacOS](#macos)
* [Example notebooks](#example-python-notebook)
* [Supported Models](#supported-models)
* [Acknowledgement](#acknowledgement)
* [FAQ](#faq)
## Quickstart
### 1. Install package
First, install the airllm pip package.
首先安装airllm包。
```bash
pip install airllm
```
如果找不到package可能是因为默认的镜像问题。可以尝试制定原始镜像
```bash
pip install -i https://pypi.org/simple/ airllm
```
### 2. Inference
Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.
然后初始化AirLLMLlama2传入所使用模型的huggingface repo ID或者本地路径即可类似于普通的transformer模型进行推理。
(*You can also specify the path to save the splitted layered model through **layer_shards_saving_path** when init AirLLMLlama2.*
*如果需要指定另外的路径来存储分层的模型可以在初始化AirLLMLlama2是传入参数**layer_shards_saving_path**。*)
```python
from airllm import AutoModel
MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")
# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")
input_text = [
'What is the capital of United States?',
#'I like',
]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
```
Note: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory.
注意推理过程会首先将原始模型按层分拆转存。请保证huggingface cache目录有足够的磁盘空间。
## AirLLM, inference 70B LLM with 4GB single GPU
## Model Compression - 3x Inference Speed Up!
AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card. No quantization, distillation, pruning or other model compression techniques that would result in degraded model performance are needed.
We just added model compression based on block-wise quantization-based model compression. Which can further **speed up the inference speed** for up to **3x** , with **almost ignorable accuracy loss!** (see more performance evaluation and why we use block-wise quantization in [this paper](https://arxiv.org/abs/2212.09720))
AirLLM优化inference内存4GB单卡GPU可以运行70B大语言模型推理。不需要任何损失模型性能的量化和蒸馏剪枝等模型压缩。
我们增加了基于block-wise quantization的模型压缩推理速度提升3倍几乎没有精度损失。精度评测可以参考此paper[this paper](https://arxiv.org/abs/2212.09720)
![speed_improvement](https://github.com/lyogavin/Anima/blob/main/assets/airllm2_time_improvement.png?v=2&raw=true)
#### How to enable model compression speed up:
* Step 1. make sure you have [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) installed by `pip install -U bitsandbytes `
* Step 2. make sure airllm verion later than 2.0.0: `pip install -U airllm`
* Step 3. when initialize the model, passing the argument compression ('4bit' or '8bit'):
```python
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct",
compression='4bit' # specify '8bit' for 8-bit block-wise quantization
)
```
#### What are the differences between model compression and quantization?
Quantization normally needs to quantize both weights and activations to really speed things up. Which makes it harder to maintain accuracy and avoid the impact of outliers in all kinds of inputs.
While in our case the bottleneck is mainly at the disk loading, we only need to make the model loading size smaller. So, we get to only quantize the weights' part, which is easier to ensure the accuracy.
## Configurations
When initialize the model, we support the following configurations:
初始化model的时候可以指定以下的配置参数
* **compression**: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
* **profiling_mode**: supported options: True to output time consumptions or by default False
* **layer_shards_saving_path**: optionally another path to save the splitted model
* **hf_token**: huggingface token can be provided here if downloading gated models like: *meta-llama/Llama-2-7b-hf*
* **prefetching**: prefetching to overlap the model loading and compute. By default, turned on. For now, only AirLLMLlama2 supports this.
* **delete_original**: if you don't have too much disk space, you can set delete_original to true to delete the original downloaded hugging face model, only keep the transformed one to save half of the disk space.
## MacOS
Just install airllm and run the code the same as on linux. See more in [Quick Start](#quickstart).
* make sure you installed [mlx](https://github.com/ml-explore/mlx?tab=readme-ov-file#installation) and torch
* you probabaly need to install python native see more [here](https://stackoverflow.com/a/65432861/21230266)
* only [Apple silicon](https://support.apple.com/en-us/HT211814) is supported
Example [python notebook] (https://github.com/lyogavin/Anima/blob/main/air_llm/examples/run_on_macos.ipynb)
Find out more [Here](https://github.com/lyogavin/Anima/tree/main/air_llm)。
## Example Python Notebook
## Train your own SORA: Open source: Latte text2video Training
Example colabs here:
Train your own SORA:
<a target="_blank" href="https://colab.research.google.com/github/lyogavin/Anima/blob/main/air_llm/examples/run_all_types_of_models.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
Check out here: [https://github.com/lyogavin/train_your_own_sora](https://github.com/lyogavin/train_your_own_sora)
## Supported Models
## 100K context length LLM
We released the new Anima open source 7B model, supporting an input window length of 100K! Its based on LLama2, so available for commercial use!
With specifically curated long text question answering training data for the 100K input length, and a lot of memory optimizations, we enabled the LLama2 model to scale to 100K input length.
#### [HF open llm leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) top models
当输入长度支持100k你甚至可以把整个知识库都放入Prompt交给模型。或者可以把一本书直接放到Prompt里边。再也不用各种费劲的向量化文本分割。。。。
**Including but not limited to the following:** (Most of the open models are based on llama2, so should be supported by default)
我们堆了各种最新的猛料:[XEntropy](https://github.com/NVIDIA/apex/tree/master/apex/contrib/xentropy)[Paged 8bit Adamw](https://github.com/TimDettmers/bitsandbytes), [LORA](https://github.com/huggingface/peft), [Flashattention2](https://github.com/Dao-AILab/flash-attention)并且专门针对长输入对于training和Inference代码都做了修改定制使得单卡100G就可以训练100k窗口。单卡40G就可以进行推理。
@12/01/23
训练数据上从几十种公开数据集中精选了专门针对长输入的30k100k长度的长文本训练数据专门针对100K输入对模型进行了训练。
Find out more [Here](https://github.com/lyogavin/Anima/tree/main/anima_100k)。
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/lyogavin/Anima-7B-100K)
| Rank | Model | Supported | Model Class |
| ------------- | ------------- | ------------- | ------------- |
| 1 | TigerResearch/tigerbot-70b-chat-v2 | ✅ | AirLLMLlama2 |
| 2 | upstage/SOLAR-0-70b-16bit | ✅ | AirLLMLlama2 |
| 3 | ICBU-NPU/FashionGPT-70B-V1.1 | ✅ | AirLLMLlama2 |
| 4 | sequelbox/StellarBright | ✅ | AirLLMLlama2 |
| 5 | bhenrym14/platypus-yi-34b | ✅ | AirLLMLlama2 |
| 6 | MayaPH/GodziLLa2-70B | ✅ | AirLLMLlama2 |
| 7 | 01-ai/Yi-34B | ✅ | AirLLMLlama2 |
| 8 | garage-bAInd/Platypus2-70B-instruct | ✅ | AirLLMLlama2 |
| 9 | jondurbin/airoboros-l2-70b-2.2.1 | ✅ | AirLLMLlama2 |
| 10 | chargoddard/Yi-34B-Llama | ✅ | AirLLMLlama2 |
| | mistralai/Mistral-7B-Instruct-v0.1 | ✅ | AirLLMMistral |
| | mistralai/Mixtral-8x7B-v0.1 | ✅ | AirLLMMixtral |
## Anima 33B Chinese
#### [opencompass leaderboard](https://opencompass.org.cn/leaderboard-llm) top models
We believe the future of AI will be fully open and democratized. AI should be a tool thats accessible to everyone, instead of only the big monopolies(some of them have the term “open” in their names 😆 .). QLoRA might be an important step towards that future. We want to make some small contribution to the historical process of democratization of AI, we are open sourcing the 33B QLoRA model we trained: all the model parameters, code, datasets and evaluations are opened! 🤗
**Including but not limited to the following:** (Most of the open models are based on llama2, so should be supported by default)
@12/01/23
| Rank | Model | Supported | Model Class |
| ------------- | ------------- | ------------- | ------------- |
| 1 | GPT-4 | closed.ai😓 | N/A |
| 2 | TigerResearch/tigerbot-70b-chat-v2 | ✅ | AirLLMLlama2 |
| 3 | THUDM/chatglm3-6b-base | ✅ | AirLLMChatGLM |
| 4 | Qwen/Qwen-14B | ✅| AirLLMQWen |
| 5 | 01-ai/Yi-34B | ✅ | AirLLMLlama2 |
| 6 | ChatGPT | closed.ai😓 | N/A |
| 7 | OrionStarAI/OrionStar-Yi-34B-Chat | ✅ | AirLLMLlama2 |
| 8 | Qwen/Qwen-14B-Chat | ✅ | AirLLMQWen |
| 9 | Duxiaoman-DI/XuanYuan-70B | ✅ | AirLLMLlama2 |
| 10 | internlm/internlm-20b | ✅ | AirLLMInternLM |
| 26 | baichuan-inc/Baichuan2-13B-Chat | ✅ | AirLLMBaichuan |
#### example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):
<details>
因此我们认为[QLoRA](https://arxiv.org/abs/2305.14314) 的工作很重要重要到可能是个Game Changer。通过QLoRA的优化方法第一次让33B规模的模型可以比较民主化的比较低成本的finetune训练并且普及使用。我们认为33B模型既可以发挥大规模模型的比较强的reasoning能力又可以针对私有业务领域数据进行灵活的finetune训练提升对于LLM的控制力。
* ChatGLM:
Find out more [Here](https://github.com/lyogavin/Anima/tree/main/training)。
```python
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("THUDM/chatglm3-6b-base")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=True)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache= True,
return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
```
* QWen:
```python
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("Qwen/Qwen-7B")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache=True,
return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
```
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/lyogavin/Anima33B-merged)
* Baichuan, InternLM, Mistral, etc:
```python
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("baichuan-inc/Baichuan2-7B-Base")
#model = AutoModel.from_pretrained("internlm/internlm-20b")
#model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache=True,
return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
```
## Alignment training based on DPO and QLoRA
We open sourced latest alignment techinque - DPO.
Anima模型又开源了基于QLoRA的最新的DPO技术。
DPO是最新的最高效的RLHF训练方法。RLHF一直是生成式AI训练的老大难问题也被认为是OpenAI的压箱底独家秘笈。DPO技术改变了这一切让RLHF彻底傻瓜化
我们开源了RLHF的低成本QLoRA的实现一台GPU机器就可以训练33B模型的DPO
Find out more [here](https://github.com/lyogavin/Anima/tree/main/rlhf)。
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/lyogavin/Anima33B-DPO-Belle-1k-merged)
</details>
## Stay Connected with Us
### Wechat 微信公众号
扫码:
![group](https://github.com/lyogavin/Anima/blob/main/assets/wechat_pub_account.jpg?raw=true)
#### To request other model support: [here](https://docs.google.com/forms/d/e/1FAIpQLSe0Io9ANMT964Zi-OQOq1TJmnvP-G3_ZgQDhP7SatN0IEdbOg/viewform?usp=sf_link)
### Wechat group 微信群
扫码进群:
## Acknowledgement
<img src="https://github.com/lyogavin/Anima/blob/main/assets/wechat_group.png?raw=true" alt="group" style="width:260px;"/>
A lot of the code are based on SimJeg's great work in the Kaggle exam competition. Big shoutout to SimJeg:
### Discord
[![Discord](https://img.shields.io/discord/1175437549783760896?logo=discord&color=7289da
)](https://discord.gg/2xffU5sn)
### Blog
[![Website](https://img.shields.io/website?up_message=blog&url=https%3A%2F%2Fmedium.com%2F%40lyo.gavin&logo=medium&color=black)](https://medium.com/@lyo.gavin)
[GitHub account @SimJeg](https://github.com/SimJeg),
[the code on Kaggle](https://www.kaggle.com/code/simjeg/platypus2-70b-with-wikipedia-rag),
[the associated discussion](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/446414).
## Contribution 参与贡献
## FAQ
Buy me a coffee please! 欢迎大家参与贡献本项目 🙏
### 1. MetadataIncompleteBuffer
**如果你喜欢我们的项目,请帮忙点个⭐吧!**
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
If you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. See [this](https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12). You may need to extend your disk space, clear huggingface [.cache](https://huggingface.co/docs/datasets/cache) and rerun.
如果你碰到这个error很有可能是空间不足。可以参考一下[这个](https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12) 可能需要扩大硬盘空间删除huggingface的[.cache](https://huggingface.co/docs/datasets/cache)然后重新run。
### 2. ValueError: max() arg is an empty sequence
Most likely you are loading QWen or ChatGLM model with Llama2 class. Try the following:
For QWen model:
```python
from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)
```
For ChatGLM model:
```python
from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)
```
### 3. 401 Client Error....Repo model ... is gated.
Some models are gated models, needs huggingface api token. You can provide hf_token:
```python
model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", #hf_token='HF_API_TOKEN')
```
### 4. ValueError: Asking to pad but the tokenizer does not have a padding token.
Some model's tokenizer doesn't have padding token, so you can set a padding token or simply turn the padding config off:
```python
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False #<----------- turn off padding
)
```
## Citing AirLLM
If you find
AirLLM useful in your research and wish to cite it, please use the following
BibTex entry:
```
@software{airllm2023,
author = {Gavin Li},
title = {AirLLM: scaling large language models on low-end commodity computers},
url = {https://github.com/lyogavin/Anima/tree/main/air_llm},
version = {0.0},
year = {2023},
}
```
## Contribution
Welcomed contributions, ideas and discussions!
If you find it useful, please ⭐ or buy me a coffee! 🙏
[!["Buy Me A Coffee"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://bmc.link/lyogavinQ)
## ✍️ 艾写科技 & Anima AI
This work is from [Anima AI LLC](https://animaai.cloud) and [aiwrite.ai](https://aiwrite.ai).
此工作来自于[艾写科技](https://aiwrite.ai) [Anima AI](https://animaai.cloud)。