https://github.com/freedomintelligence/cail2022
https://github.com/freedomintelligence/cail2022
Last synced: 11 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/freedomintelligence/cail2022
- Owner: FreedomIntelligence
- Created: 2023-12-05T09:12:35.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-09T06:57:27.000Z (about 2 years ago)
- Last Synced: 2025-03-11T22:19:40.218Z (about 1 year ago)
- Language: Python
- Size: 180 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CAIL2022
## ⚡ Introduction
This is the code for the paper
## ⚒️ Training
### Install the dependencies
```
pip install -r requirements.txt
```
### Download the pretrained models
Download the pretrained XLNET and T5 models into `train-stage1` and `train-stage2`
**XLNET:**(https://huggingface.co/hfl/chinese-xlnet-mid)
**T5:**(https://huggingface.co/imxly/t5-pegasus)
### First Stage
You can train the first-stage model by:
```
cd train-stage1
bash run.sh
```
After training, you can prepare the data for the second stage:
1️⃣ Since we want to extract the domian-related sentences, we observe the result in the first-stage trainig with the highest F1 value for the important label (1) in the test data of each fold as the best model to be filled into `prepare_for_generate.py > best_model_index_list`.
2️⃣ Run the command to genearate `train.jsonl`:
```
CUDA_VISIBLE_DEVICES=0 nohup python3 prepare_for_generate.py > generate_data.out &
```
3️⃣ Select the high-quality sentences based your own threshold:
```
python3 select_text.py
```
### Second Stage
1️⃣ Put the data selected in the first stage (e.g. `train_stage2_0.5.jsonl`) and evaluation data in the folder `train-stage2/data_dir/`
2️⃣ Train the generative model:
```
cd ../train-stage2
bash run.sh
```
3️⃣ Select the best model on evaluation set (choose the model that performs the best during training if there is no evaluation set):
```
CUDA_VISIBLE_DEVICES=0 PYTHONIOENCODING=UTF-8 nohup python3 main.py > main.out &
python3 evaluate.py > evaluate.out
```
### Evaluation
1️⃣ Put the extractive models in `e2e/extractor_model/` folder, abstractive model in `e2e/generator_model` folder and test data in the folder `e2e/data_dir/` and change the path of your downloaded XLNET and T5 models.
2️⃣ Fill the same `best_model_index_list` in `extractor.py` and their corresponsing index as **Training First Stage**
3️⃣Generate the two-stage summary:
```
cd ../e2e
bash run.sh
```
## 🏆 Awards
Our team won the first prize in 2022 CAIL Summary of Legal Public Opinion.

## 📕 Citation