https://github.com/lonepatient/clue_pytorch
CLUE baseline pytorch CLUE的pytorch版本基线
https://github.com/lonepatient/clue_pytorch
albert bert chinese classification clue ernie glue pytorch roberta xlnet
Last synced: 9 months ago
JSON representation
CLUE baseline pytorch CLUE的pytorch版本基线
- Host: GitHub
- URL: https://github.com/lonepatient/clue_pytorch
- Owner: lonePatient
- Created: 2019-10-15T14:32:34.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-04-03T14:11:51.000Z (about 6 years ago)
- Last Synced: 2025-04-10T09:39:56.896Z (about 1 year ago)
- Topics: albert, bert, chinese, classification, clue, ernie, glue, pytorch, roberta, xlnet
- Language: Python
- Homepage: https://github.com/CLUEbenchmark/CLUE
- Size: 340 KB
- Stars: 74
- Watchers: 2
- Forks: 17
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## CLUE_pytorch
中文语言理解测评基准(Language Understanding Evaluation benchmark for Chinese)
**备注**:此版本为个人开发版(目前支持所有的分类型任务),正式版见https://github.com/CLUEbenchmark/CLUE
## 更新
* **2020-03-08**: 模型加载使用[Huggingface-Transformers](https://github.com/huggingface/transformers)
## 模型列表
| model type | model_name_or_path |
| :-------------------: | :----------------------------------------------------------: |
| albert | [voidful/albert_chinese_base](https://huggingface.co/voidful/albert_chinese_base) |
| albert | [voidful/albert_chinese_larg](https://huggingface.co/voidful/albert_chinese_large) |
| albert | [`voidful/albert_chinese_small`](https://huggingface.co/voidful/albert_chinese_small) |
| albert | [`voidful/albert_chinese_tiny`](https://huggingface.co/voidful/albert_chinese_tiny) |
| albert | [`voidful/albert_chinese_xlarge`](https://huggingface.co/voidful/albert_chinese_xlarge) |
| albert | [`voidful/albert_chinese_xxlarge`](https://huggingface.co/voidful/albert_chinese_xxlarge) |
| bert | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) |
| bert-wwm-ext | [`hfl/chinese-bert-wwm-ext`](https://huggingface.co/hfl/chinese-bert-wwm-ext) |
| bert-wwm | [`hfl/chinese-bert-wwm`](https://huggingface.co/hfl/chinese-bert-wwm) |
| roberta-wwm-ext-large | [`hfl/chinese-roberta-wwm-ext-large`](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large) |
| roberta-wwm-ext | [`hfl/chinese-roberta-wwm-ext`](https://huggingface.co/hfl/chinese-roberta-wwm-ext) |
| xlnet-base | [`hfl/chinese-xlnet-base`](https://huggingface.co/hfl/chinese-xlnet-base) |
| xlnet-mid | [`hfl/chinese-xlnet-mid`](https://huggingface.co/hfl/chinese-xlnet-mid) |
| rbt3 | [`hfl/rbt3`](https://huggingface.co/hfl/rbt3) |
| rbt3 | [`hfl/rbtl3`](https://huggingface.co/hfl/rbtl3) |
| RoBERTa-tiny-clue | [`clue/roberta_chinese_clue_tiny`](https://huggingface.co/clue/roberta_chinese_clue_tiny) |
| RoBERTa-tiny-pair | [`clue/roberta_chinese_pair_tiny`](https://huggingface.co/clue/roberta_chinese_pair_tiny) |
| RoBERTa-tiny3L768-clue | [`clue/roberta_chinese_3L768_clue_tiny`](https://huggingface.co/clue/roberta_chinese_3L768_clue_tiny) |
| RoBERTa-tiny3L312-clue | [`clue/roberta_chinese_3L312_clue_tiny`](https://huggingface.co/clue/roberta_chinese_3L312_clue_tiny) |
| RoBERTa-large-clue | [`clue/roberta_chinese_clue_large`](https://huggingface.co/clue/roberta_chinese_clue_large) |
| RoBERTa-large-pair | [`clue/roberta_chinese_pair_large`](https://huggingface.co/clue/roberta_chinese_pair_large) |
## 代码目录说明
```text
├── CLUEdatasets # 存放数据
| └── tnews
| └── wsc
| └── ...
├── metrics # metric计算
| └── clue_compute_metrics.py
├── outputs # 模型输出保存
| └── tnews_output
| └── wsc_output
| └── ...
├── prev_trained_model # 预训练模型
| └── albert_base
| └── bert-wwm
| └── ...
├── processors # 数据处理
| └── clue.py
| └── ...
├── tools # 通用脚本
| └── progressbar.py
| └── ...
├── run_classifier.py # 主程序
├── run_classifier_tnews.sh # 任务运行脚本
```
### 依赖模块
- pytorch=1.1.0
- boto3=1.9
- regex
- sacremoses
- sentencepiece
- python3.6+
- transformers=2.5.1
### 运行方式
**1. 安装Transformers**
```shell
pip install transformers
```
**2. 下载CLUE数据集,运行以下命令:**
```python
python download_clue_data.py --data_dir=./CLUEdatasets --tasks=all
```
上述命令默认下载全CLUE数据集,你也可以指定`--tasks`进行下载对应任务数据集,默认存在在`./CLUEdatasets/{对应task}`目录下。
**注意**: 如果使用本地已经下载好的模型权重,需要在对应的文件夹内存放`config.json`和`vocab.txt`文件,比如:
```text
├── prev_trained_model # 预训练模型
| └── bert-base
| | └── vocab.txt
| | └── config.json
| | └── pytorch_model.bin
```
如果使用本地已有的模型权重文件,直接修改参数`--model_name_or_path=your_local_model_weight_path`即可
**3. 直接运行对应任务sh脚本,如:**
```shell
sh run_classifier_tnews.sh
```
具体运行方式如下:
```python
CURRENT_DIR=`pwd`
export CLUE_DIR=$CURRENT_DIR/CLUEdatasets
export OUTPUR_DIR=$CURRENT_DIR/outputs
TASK_NAME="iflytek"
python run_classifier.py \
--model_type=albert \
--model_name_or_path=voidful/albert_chinese_tiny \
--task_name=$TASK_NAME \
--do_train \
--do_lower_case \
--evaluate_during_training \
--data_dir=$CLUE_DIR/${TASK_NAME}/ \
--max_seq_length=128 \
--per_gpu_train_batch_size=16 \
--per_gpu_eval_batch_size=16 \
--learning_rate=2e-4 \
--num_train_epochs=6.0 \
--logging_steps=759 \
--save_steps=759 \
--output_dir=$OUTPUR_DIR/${TASK_NAME}_output/ \
--overwrite_output_dir \
--seed=42
```
**注意**:
> model_name_or_path=voidful/albert_chinese_tiny默认自动下载albert_chinese_tiny
>当前只支持google版本的中文albert模型
**4. 评估**
当前默认使用最后一个checkpoint模型作为评估模型,你也可以指定`--predict_checkpoints`参数进行对应的checkpoint进行评估,比如:
```python
CURRENT_DIR=`pwd`
export CLUE_DIR=$CURRENT_DIR/CLUEdatasets
export OUTPUR_DIR=$CURRENT_DIR/outputs
TASK_NAME="copa"
python run_classifier.py \
--model_type=bert \
--model_name_or_path=voidful/albert_chinese_tiny \
--task_name=$TASK_NAME \
--do_predict \
--predict_checkpoints=100 \
--do_lower_case \
--data_dir=$CLUE_DIR/${TASK_NAME}/ \
--max_seq_length=128 \
--per_gpu_train_batch_size=16 \
--per_gpu_eval_batch_size=16 \
--learning_rate=1e-5 \
--num_train_epochs=2.0 \
--logging_steps=50 \
--save_steps=50 \
--output_dir=$OUTPUR_DIR/${TASK_NAME}_output/ \
--overwrite_output_dir \
--seed=42
```
### 模型列表
```
"bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
"ernie": (BertConfig, BertForSequenceClassification, BertTokenizer),
"xlnet": (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
"roberta": (BertConfig, BertForSequenceClassification, BertTokenizer),
"albert": (AlbertConfig, AlbertForSequenceClassification, BertTokenizer),
```
### 结果
CLUEWSC2020: WSC Winograd模式挑战中文版,新版2020-03-25发布 [CLUEWSC2020数据集下载](https://storage.googleapis.com/cluebenchmark/tasks/cluewsc2020_public.zip)
| 模型 | 开发集(Dev) |
| :------- | :---------: |
| bert_base | 79.94 |