https://github.com/yuanzhoulvpi2017/zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)
https://github.com/yuanzhoulvpi2017/zero_nlp

bert chatglm-6b clip gpt gpt2 huggingface-transformers llama llama2 llava nlp pytorch text-generation transformers

Last synced: 5 months ago
JSON representation

中文nlp解决方案(大模型、数据、模型、训练、推理)

Host: GitHub
URL: https://github.com/yuanzhoulvpi2017/zero_nlp
Owner: yuanzhoulvpi2017
License: mit
Created: 2023-02-05T07:11:18.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2025-02-12T13:56:56.000Z (8 months ago)
Last Synced: 2025-05-11T01:01:45.680Z (5 months ago)
Topics: bert, chatglm-6b, clip, gpt, gpt2, huggingface-transformers, llama, llama2, llava, nlp, pytorch, text-generation, transformers
Language: Jupyter Notebook
Homepage:
Size: 50.8 MB
Stars: 3,434
Watchers: 31
Forks: 403
Open Issues: 100
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - yuanzhoulvpi2017/zero_nlp
my-awesome - yuanzhoulvpi2017/zero_nlp - 6b,clip,gpt,gpt2,huggingface-transformers,llama,llama2,llava,nlp,pytorch,text-generation,transformers pushed_at:2025-08 star:3.7k fork:0.4k 中文nlp解决方案(大模型、数据、模型、训练、推理) (Jupyter Notebook)

README

          # zero to nlp

## 特点

1. 🎯`目标`：基于`pytorch`、`transformers`做中文领域的nlp开箱即用的训练框架，提供全套的训练、微调模型（包括大模型、文本转向量、文本生成、多模态等模型）的解决方案；

2. 💽`数据`：

    - 从开源社区，整理了海量的训练数据，帮助用户可以快速上手；

    - 同时也开放训练数据模版，可以快速处理垂直领域数据；

    - 结合多线程、内存映射等更高效的数据处理方式，即使需要处理`百GB`规模的数据，也是轻而易举；

3. 💻`流程`：每一个项目有完整的模型训练步骤，如：数据清洗、数据处理、模型构建、模型训练、模型部署、模型图解；

4. 🔥`模型`：当前已经支持`gpt2`、`clip`、`gpt-neox`、`dolly`、`llama`、`chatglm-6b`、`VisionEncoderDecoderModel`等多模态大模型；

5. 🚀`多卡串联`

   ：当前，多数的大模型的尺寸已经远远大于单个消费级显卡的显存，需要将多个显卡串联，才能训练大模型、才能部署大模型。因此对部分模型结构进行修改，实现了`训练时`、`推理时`

   的多卡串联功能。

6. ⚙️`模型工具`：添加了大模型的`词表裁切`和`词表扩充`

   教程[model_modify](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/model_modify)

## 目录

[//]: # (### 源码解读)

[//]: # ()

[//]: # (当前`transformers`包，确实好用，包括训练等，但是我们不能停留于表面，不能浅尝辄止。要深入源码底部，挖掘出每一个细节。因此，在这个模块中，我将把)

[//]: # (`transfrmers`包中用到的python高级用法、优秀的数据处理思路和方法，尽可能的讲解清楚。)

[//]: # ()

[//]: # (⚠️将逐步完善，敬请期待)

[//]: # (| 模块         | 文件名称 | 作用  | 实现细节 |)

[//]: # (|------------|------|-----|------|)

[//]: # (| Tokenizer  | ☑️   | ☑️  | ☑️   |)

[//]: # (| Datasets   | ☑️   | ☑️  | ☑️   |)

[//]: # (| Model      | ☑️   | ☑️  | ☑️   |)

[//]: # (| Trainer    | ☑️   | ☑️  | ☑️   |)

[//]: # (| AutoClass  | ☑️   | ☑️  | ☑️   |)

[//]: # (| AutoConfig | ☑️   | ☑️  | ☑️   |)

### 模型训练

| 中文名称 
|---------------------- 
| 中文文本分类 
| 中文`gpt2` 
| 中文`clip` 
| 图像生成中文文本 
| vit核心源码介绍 
| `Thu-ChatGlm-6b`(`v1`版本 
| 🌟chatglm-`v2`-6b🎉 
| 中文`dolly_v2_3b` 
| 中文`llama`(作废) 
| 中文`bloom` 
| 中文`falcon`(注意 
| 中文**预训练**代码 
| 百川大模型 
| 模型修剪✂️ 
| llama2 流水线并行 
| 百川2-7b-chat的`dpo` 
| 训练时候，数据 
| internlm-base sft 
| train qwen2 
| train llava

| 文件夹名称                                                                                                                 | 数据 | 数据清洗 | 大模型 | 模型部署 | 图解 | -------------|-----------------------------------------------------------------------------------------------------------------------|----|------|-----|------|----| | [chinese_classifier](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chinese_classifier)                       | ✅  | ✅    | ✅   | ❌    | ✅  | | [chinese_gpt2](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chinese_gpt2)                                   | ✅  | ✅    | ✅   | ✅    | ❌  | | [chinese_clip](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chinese_clip_ddp)                               | ✅  | ✅    | ✅   | ❌    | ✅  | | [VisionEncoderDecoderModel](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/vit-gpt2-image-chinese-captioning) | ✅  | ✅    | ✅   | ❌    | ✅  | | [vit model](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/vit)                                               | ❌  | ❌    | ❌   | ❌    | ✅  | 作废)       | [simple_thu_chatglm6b](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/simple_thu_chatglm6b)                   | ✅  | ✅    | ✅   | ✅    | ❌  | | [chatglm_v2_6b_lora](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chatglm_v2_6b_lora)                       | ✅  | ✅    | ✅   | ❌    | ❌  | | [dolly_v2_3b](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chinese_dolly_v2_3b)                             | ✅  | ✅    | ✅   | ❌    | ❌  | | [chinese_llama](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chinese_llama)                                 | ✅  | ✅    | ✅   | ❌    | ❌  | | [chinese_bloom](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chinese_bloom)                                 | ✅  | ✅    | ✅   | ❌    | ❌  | ：falcon模型和bloom结构类似) | [chinese_bloom](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chinese_bloom)                                 | ✅  | ✅    | ✅   | ❌    | ❌  | | [model_clm](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/model_clm)                                         | ✅  | ✅    | ✅   | ❌    | ❌  | | [model_baichuan](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/model_baichuan)                               | ✅  | ✅    | ✅   | ✅    | ❌  | | [model_modify](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/model_modify)                                   | ✅  | ✅    | ✅   |      |    | | [pipeline](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/pipeline)                                           | ✅  | ✅    | ✅   | ❌    | ❌  | | [DPO baichuan2-7b-chat ](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/baichuan2_dpo)                        | ✅  | ✅    | ✅   | ❌    | ❌  | 占比发生变化                     | [train_data_sample ](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/train_data_sample)                        | ✅  | ✅    | ✅   | ❌    | ❌  | | [internlm-sft ](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/internlm-sft)                                  | ✅  | ✅    | ✅   | ❌    | ❌  | | [train_qwen2 ](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/train_qwen)                                     | ✅  | ✅    | ✅   | ✅    | ❌  | | [train_llava ](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/train_llava)                                    | ✅  | ✅    | ✅   | ✅    | ✅  |

### 工程介绍 debug vllm

1. 介绍如何debug

   vllm，对vllm工程上了解的更加透彻 [debug vllm](https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/debug_vllm)

数据流程图解

我一直觉得，数据流程通过图解的形式表达出来，其实是最清楚的，因此我都会尽可能的把每一个任务的都图解出来。

### 文本分类数据图解

![](images/文本分类.003.png)

### 中文gpt2

![](images/chinesegpt2_bot.png)

### 中文clip

![model](images/clip001.png)

### 图像生成中文文本

![model](images/vision-encoder-decoder.png)

### vit 源码

![](images/vit_architecture.jpg)

# 分享transformers源码解读

一直在做transformers的源码解读，可以去B站查看视频👉[良睦路程序员](https://space.bilibili.com/45156039)

[//]: # (# 分享数据)

[//]: # ()

[//]: # (一直在整理开源数据，如果有需要，可以关注公众号`统计学人`，回复`nlp数据`即可。目前还在整理数据中)

[//]: # ()

[//]: # (![统计学人](images/gzh.jpg))

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yuanzhoulvpi2017/zero_nlp

Awesome Lists containing this project

README