Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thu-coai/LOT-LongLM
https://github.com/thu-coai/LOT-LongLM
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/thu-coai/LOT-LongLM
- Owner: thu-coai
- Created: 2021-08-28T17:16:59.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-06-28T07:46:58.000Z (about 2 years ago)
- Last Synced: 2024-07-18T22:18:38.896Z (2 months ago)
- Language: Python
- Size: 3.97 MB
- Stars: 70
- Watchers: 3
- Forks: 4
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation
![](https://img.shields.io/github/last-commit/thu-coai/LOT-benchmark?color=blue)
[Tasks](#tasks) | [Datasets](#datasets) | [LongLM](#longlm) | [Baselines](#baselines) | [Paper](https://arxiv.org/abs/2108.12960)
## Introduction
LOT is a benchmark for evaluating Chinese long text modeling. LOT consists of two understanding tasks and two generation tasks. We construct new datasets for these tasks based on human-written Chinese stories.
Furthermore, we release an encoder-decoder-based Chinese long text pretraining model named LongLM with up to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks including text infilling and conditional continuation. Extensive experiments show that LongLM outperforms similar-sized pretraining models substantially on both the understanding and generation tasks in LOT.
## Tasks
We design LOT as an aggregation of two understanding tasks including Cloze Test (ClozeT) and Sentence Position Prediction (SenPos), and two generation tasks including Plot Completion (PlotCom) and Outline-conditioned Generation (OutGen). We show the task descriptions in the table below.
![](./figure/task.png)
###
## Datasets
We show the data statistics in the table below. The abbreviation **sent**/**len** is short for **sentence**/**length**, respectively. The datasets and evaluation scripts can be downloaded from [THUCloud](https://cloud.tsinghua.edu.cn/d/0cf033b0c7c049be855d/).
## LongLM
### 1. Parameters
- $d_m$: the dimension of hidden states
- $d_{ff}$: the dimension of feed forward layers
- $d_{kv}$: the dimension of the keys/values in the self-attention layers
- $n_h$: the number of attention heads
- $n_e$: the number of hidden layers of the encoder
- $n_d$: the number of hidden layers of the decoder
- \#P: the number of parameters### 2. Pretraining Tasks
### 3. Pretraining Data
We collect 120G novels as the pretraining data for LongLM. Part of the pretraining data are [publicly available](https://cloud.tsinghua.edu.cn/d/a5a16f2381e7439eb475/).
### 4. Checkpoints
1. **Download:** The checkpoints and example data can be downloaded from [THUCloud](https://cloud.tsinghua.edu.cn/d/576f340a43964a23b1a5/) or [Hugging Face Model Card](https://huggingface.co/thu-coai). The training and generation scripts are under the directory `longlm`.
2. **Model Loading:**
```python\
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('thu-coai/LongLM-large')
model = T5ForConditionalGeneration.from_pretrained('thu-coai/LongLM-large')
```3. **Training:**
Execute `bash ./finetune.sh` to fine-tune LongLM. If deepspeed is available, you can execute `bash ./finetune_deepspped.sh` to accelerate. You can also use the [official script](https://github.com/huggingface/transformers/tree/v4.6.0-release/examples/legacy/seq2seq) provided by Transformers to fine-tune the model.
```shell
env CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 CUDA_LAUNCH_BLOCKING=1 python3 -m torch.distributed.launch --nproc_per_node=8 \
finetune_trainer.py \
--data_dir=./data \ # directory of data
--train_name=train \ # file prefix of the training data
--output_dir=./save_model \ # output directory to save the checkpoint
--save_total_limit=10 \ # maximum number of the saved checkpoints
--per_gpu_train_batch_size=3 \ # batch size for training
--per_gpu_eval_batch_size=3 \ # batch size for evaluation
--num_train_epochs=1 \ # number of training epochs
--logging_steps=5 \ # number of stps to log the loss value
--model_name_or_path=./LongLM-small \ # path to the pretrained model
--warmup_steps=100 \ # number of steps for warmup
--learning_rate=1e-4 \ # learning rate
--n_val=100 \ # number of examples for validation
--do_train --do_eval \ # whether to training/validation
--evaluation_strategy steps \ # strategy of evaluation
--gradient_accumulation_steps=40 # number of steps for gradient accumulation
--overwrite_output_dir \
--load_best_model_at_end
```4. **Generation:**
```python
input_ids = tokenizer("小咕噜对,",return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(device)
gen = model.generate(input_ids, do_sample=True, decoder_start_token_id=1, top_p=0.9, max_length=512)
```### 5. Dependencies
```
datasets 1.6.2
deepspeed 0.3.16
huggingface-hub 0.0.8
jieba 0.42.1
jsonlines 2.0.0
nltk 3.5
numpy 1.19.5
pytorch-lightning 1.2.0
regex 2020.11.13
rouge 1.0.1
rouge-score 0.0.4
sacrebleu 1.5.0
scipy 1.5.4
sentencepiece 0.1.95
tokenizers 0.10.1
torch 1.8.1
torchaudio 0.8.0
torchmetrics 0.2.0
torchvision 0.9.0
transformers 4.6.1
```## Baselines
### 1. Understanding Tasks
The example data, training and evaluation scripts of LongLM are under the directory `./baselines/understanding`. You can execute `bash ./finetune.sh` to fine-tune LongLM and execute `bash ./eval.sh` to evaluate the fine-tuned model.
### 2. Generation Tasks
The training script of LongLM for the generation tasks is the same as pretraining script. The generation script and example data can be found under `./baseline/generation`. You can execute `bash ./gen.sh` for generation.
## Citation
```txt
@misc{guan2021lot,
title={LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation},
author={Jian Guan and Zhuoer Feng and Yamei Chen and Ruilin He and Xiaoxi Mao and Changjie Fan and Minlie Huang},
year={2021},
eprint={2108.12960},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```