https://github.com/SALT-NLP/LADA
Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"
https://github.com/SALT-NLP/LADA
dataaugmentation lada mixup ner semisupervised-learning
Last synced: 4 months ago
JSON representation
Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"
- Host: GitHub
- URL: https://github.com/SALT-NLP/LADA
- Owner: SALT-NLP
- License: mit
- Created: 2020-10-01T02:44:09.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-10-15T01:58:17.000Z (over 2 years ago)
- Last Synced: 2024-08-24T17:26:22.236Z (8 months ago)
- Topics: dataaugmentation, lada, mixup, ner, semisupervised-learning
- Language: Python
- Homepage:
- Size: 950 KB
- Stars: 44
- Watchers: 6
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-Mixup - [Code
README
# LADA
This repo contains codes for the following paper:*Jiaao Chen\*, Zhenghui Wang\*, Ran Tian, Zichao Yang, Diyi Yang*: Local Additivity Based Data Augmentation for Semi-supervised NER. In Proceedings of The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP'2020)
If you would like to refer to it, please cite the paper mentioned above.
## Getting Started
These instructions will get you running the codes of LADA.### Requirements
* Python 3.6 or higher
* Pytorch >= 1.4.0
* Pytorch_transformers (also known as transformers)
* Pandas, Numpy, Pickle, faiss, sentence-transformers### Code Structure
```
├── code/
│ ├── BERT/
│ │ ├── back_translate.ipynb --> Jupyter Notebook for back translating the dataset
│ │ ├── bert_models.py --> Codes for LADA-based BERT models
│ │ ├── eval_utils.py --> Codes for evaluations
│ │ ├── knn.ipynb --> Jupyter Notebook for building the knn index file
│ │ ├── read_data.py --> Codes for data pre-processing
│ │ ├── train.py --> Codes for trianing BERT model
│ │ └── ...
│ ├── flair/
│ │ ├── train.py --> Codes for trianing flair model
│ │ ├── knn.ipynb --> Jupyter Notebook for building the knn index file
│ │ ├── flair/ --> the flair library
│ │ │ └── ...
│ │ ├── resources/
│ │ │ ├── docs/ --> flair library docs
│ │ │ ├── taggers/ --> save evaluation results for flair model
│ │ │ └── tasks/
│ │ │ └── conll_03/
│ │ │ ├── sent_id_knn_749.pkl --> knn index file
│ │ │ └── ... -> CoNLL-2003 dataset
│ │ └── ...
├── data/
│ └── conll2003/
│ ├── de.pkl -->Back translated training dataset with German as middle language
│ ├── labels.txt --> label index file
│ ├── sent_id_knn_700.pkl
│ └── ... -> CoNLL-2003 dataset
├── eval/
│ └── conll2003/ --> save evaluation results for BERT model
└── README.md
```
## BERT models### Downloading the data
Please download the CoNLL-2003 dataset and save under `./data/conll2003/` as `train.txt`, `dev.txt`, and `test.txt`.### Pre-processing the data
We utilize [Fairseq](https://github.com/pytorch/fairseq) to perform back translation on the training dataset. Please refer to `./code/BERT/back_translate.ipynb` for details.
Here, we have put one example of back translated data, `de.pkl`, in `./data/conll2003/` . You can directly use it for CoNLL-2003 or generate your own back translated data following `./code/BERT/back_translate.ipynb`.
We also provide the kNN index file for the first 700 training sentences (5%) `./data/conll2003/sent_id_knn_700.pkl`. You can directly use it for CoNLL-2003 or generate your own kNN index file following `./code/BERT/knn.ipynb`
### Training models
These section contains instructions for training models on CoNLL-2003 using 5% training data.#### Training BERT+Intra-LADA model
```shell
python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700 --eval-batch-size 128 \
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10 --beta 1.5 --alpha 60 --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio 1
```
#### Training BERT+Inter-LADA model
```shell
python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700 --eval-batch-size 128 \
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10 --beta 1.5 --alpha 60 --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio -1```
#### Training BERT+Semi-Intra-LADA model
```shell
python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700 --eval-batch-size 128 \
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10 --beta 1.5 --alpha 60 --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio 1 \
--u-batch-size 32 --semi --T 0.6 --sharp --weight 0.05 --semi-pkl-file 'de.pkl' \
--semi-num 10000 --semi-loss 'mse' --ignore-last-n-label 4 --warmup-semi --num-semi-iter 1 \
--semi-loss-method 'origin'
```
#### Training BERT+Semi-Inter-LADA model
```shell
python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700 --eval-batch-size 128 \
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10 --beta 1.5 --alpha 60 --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio -1 \
--u-batch-size 32 --semi --T 0.6 --sharp --weight 0.05 --semi-pkl-file 'de.pkl' \
--semi-num 10000 --semi-loss 'mse' --ignore-last-n-label 4 --warmup-semi --num-semi-iter 1 \
--semi-loss-method 'origin'```
####
## flair models[flair](https://github.com/flairNLP/flair) is a BiLSTM-CRF sequence labeling model, and we provide code for flair+Inter-LADA
### Downloading the data
Please download the CoNLL-2003 dataset and save under `./code/flair/resources/tasks/conll_03/` as `eng.train`, `eng.testa` (dev), and `eng.testb` (test).### Pre-processing the data
We also provide the kNN index file for the first 749 training sentences (5%, including the `-DOCSTART-` seperator) `./code/flair/resources/tasks/conll_03/sent_id_knn_749.pkl`. You can directly use it for CoNLL-2003 or generate your own kNN index file following `./code/flair/knn.ipynb`
### Training models
These section contains instructions for training models on CoNLL-2003 using 5% training data.#### Training flair+Inter-LADA model
```shell
CUDA_VISIBLE_DEVICES=1 python ./code/flair/train.py --use-knn-train-data --num-knn-k 5 \
--knn-mix-ratio 0.6 --train-examples 749 --mix-layer 2 --mix-option --alpha 60 --beta 1.5 \
--exp-save-name 'mix' --mini-batch-size 64 --patience 10 --use-crf
```