{"id":14286866,"url":"https://github.com/SALT-NLP/LADA","last_synced_at":"2025-08-15T07:31:22.166Z","repository":{"id":68595397,"uuid":"300119725","full_name":"SALT-NLP/LADA","owner":"SALT-NLP","description":"Source codes for the paper \"Local Additivity Based Data Augmentation for Semi-supervised NER\"","archived":false,"fork":false,"pushed_at":"2022-10-15T01:58:17.000Z","size":973,"stargazers_count":44,"open_issues_count":1,"forks_count":5,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-12-16T02:34:30.431Z","etag":null,"topics":["dataaugmentation","lada","mixup","ner","semisupervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SALT-NLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-10-01T02:44:09.000Z","updated_at":"2024-07-03T01:13:27.000Z","dependencies_parsed_at":"2023-02-21T06:45:55.322Z","dependency_job_id":null,"html_url":"https://github.com/SALT-NLP/LADA","commit_stats":null,"previous_names":["gt-salt/lada"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/SALT-NLP/LADA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SALT-NLP%2FLADA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SALT-NLP%2FLADA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SALT-NLP%2FLADA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SALT-NLP%2FLADA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SALT-NLP","download_url":"https://codeload.github.com/SALT-NLP/LADA/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SALT-NLP%2FLADA/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270539525,"owners_count":24603182,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-15T02:00:12.559Z","response_time":110,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataaugmentation","lada","mixup","ner","semisupervised-learning"],"created_at":"2024-08-23T17:01:05.276Z","updated_at":"2025-08-15T07:31:21.717Z","avatar_url":"https://github.com/SALT-NLP.png","language":"Python","readme":"# LADA\nThis repo contains codes for the following paper: \n\n\n\n*Jiaao Chen\\*, Zhenghui Wang\\*, Ran Tian, Zichao Yang, Diyi Yang*:  Local Additivity Based Data Augmentation for Semi-supervised NER. In Proceedings of The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP'2020)\n\nIf you would like to refer to it, please cite the paper mentioned above. \n\n\n## Getting Started\nThese instructions will get you running the codes of LADA.\n\n### Requirements\n* Python 3.6 or higher\n* Pytorch \u003e= 1.4.0\n* Pytorch_transformers (also known as transformers)\n* Pandas, Numpy, Pickle, faiss, sentence-transformers\n\n\n\n### Code Structure\n```\n├── code/\n│   ├── BERT/\n│   │   ├── back_translate.ipynb --\u003e Jupyter Notebook for back translating the dataset\n│   │   ├── bert_models.py --\u003e Codes for LADA-based BERT models\n│   │   ├── eval_utils.py --\u003e Codes for evaluations\n│   │   ├── knn.ipynb --\u003e Jupyter Notebook for building the knn index file\n│   │   ├── read_data.py --\u003e Codes for data pre-processing\n│   │   ├── train.py --\u003e Codes for trianing BERT model\n│   │   └── ...\n│   ├── flair/\n│   │   ├── train.py --\u003e Codes for trianing flair model\n│   │   ├── knn.ipynb --\u003e Jupyter Notebook for building the knn index file\n│   │   ├── flair/ --\u003e the flair library\n│   │   │   └── ...\n│   │   ├── resources/\n│   │   │   ├── docs/ --\u003e flair library docs\n│   │   │   ├── taggers/ --\u003e save evaluation results for flair model\n│   │   │   └── tasks/\n│   │   │       └── conll_03/\n│   │   │           ├── sent_id_knn_749.pkl --\u003e knn index file\n│   │   │           └── ... -\u003e CoNLL-2003 dataset\n│   │   └── ...\n├── data/\n│   └── conll2003/\n│       ├── de.pkl --\u003eBack translated training dataset with German as middle language\n│       ├── labels.txt --\u003e label index file\n│       ├── sent_id_knn_700.pkl\n│       └── ...  -\u003e CoNLL-2003 dataset\n├── eval/\n│   └── conll2003/ --\u003e save evaluation results for BERT model\n└── README.md\n```\n## BERT models\n\n### Downloading the data\nPlease download the CoNLL-2003 dataset and save under `./data/conll2003/` as `train.txt`, `dev.txt`, and `test.txt`.\n\n### Pre-processing the data\n\n\nWe utilize [Fairseq](https://github.com/pytorch/fairseq) to perform back translation on the training dataset. Please refer to `./code/BERT/back_translate.ipynb` for details.\n\nHere, we have put one example of back translated data, `de.pkl`, in `./data/conll2003/` . You can directly use it for CoNLL-2003 or generate your own back translated data following  `./code/BERT/back_translate.ipynb`.\n\nWe also provide the kNN index file for the first 700 training sentences (5%) `./data/conll2003/sent_id_knn_700.pkl`. You can directly use it for CoNLL-2003 or generate your own kNN index file following `./code/BERT/knn.ipynb`\n\n\n### Training models\nThese section contains instructions for training models on CoNLL-2003 using 5% training data.\n\n#### Training BERT+Intra-LADA model\n```shell\npython ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \\\n--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \\\n--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \\\n--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \\\n--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \\\n--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \\\n--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \\\n--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio 1 \n```\n#### Training BERT+Inter-LADA model\n```shell\npython ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \\\n--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \\\n--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \\\n--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \\\n--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \\ \n--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \\ \n--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \\\n--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio -1  \n\n```\n#### Training BERT+Semi-Intra-LADA model\n```shell\npython ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \\\n--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \\\n--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \\\n--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \\\n--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \\\n--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \\\n--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \\\n--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio 1 \\\n--u-batch-size 32 --semi --T 0.6 --sharp --weight 0.05 --semi-pkl-file 'de.pkl' \\\n--semi-num 10000 --semi-loss 'mse' --ignore-last-n-label 4  --warmup-semi --num-semi-iter 1 \\\n--semi-loss-method 'origin' \n```\n#### Training BERT+Semi-Inter-LADA model\n```shell\npython ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \\\n--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \\\n--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \\\n--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \\\n--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \\ \n--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \\\n--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \\\n--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio -1 \\\n--u-batch-size 32 --semi --T 0.6 --sharp --weight 0.05 --semi-pkl-file 'de.pkl' \\\n--semi-num 10000 --semi-loss 'mse' --ignore-last-n-label 4  --warmup-semi --num-semi-iter 1 \\\n--semi-loss-method 'origin' \n\n```\n\n\n#### \n## flair models\n\n[flair](https://github.com/flairNLP/flair) is a BiLSTM-CRF sequence labeling model, and we provide code for flair+Inter-LADA \n\n### Downloading the data\nPlease download the CoNLL-2003 dataset and save under `./code/flair/resources/tasks/conll_03/` as `eng.train`, `eng.testa` (dev), and `eng.testb` (test).\n\n### Pre-processing the data\n\nWe also provide the kNN index file for the first 749 training sentences (5%, including the `-DOCSTART-` seperator) `./code/flair/resources/tasks/conll_03/sent_id_knn_749.pkl`. You can directly use it for CoNLL-2003 or generate your own kNN index file following `./code/flair/knn.ipynb`\n\n### Training models\nThese section contains instructions for training models on CoNLL-2003 using 5% training data.\n\n#### Training flair+Inter-LADA  model\n```shell\nCUDA_VISIBLE_DEVICES=1 python ./code/flair/train.py --use-knn-train-data --num-knn-k 5 \\\n--knn-mix-ratio 0.6 --train-examples 749 --mix-layer 2  --mix-option --alpha 60 --beta 1.5 \\\n--exp-save-name 'mix'  --mini-batch-size 64  --patience 10 --use-crf \n```\n\n\n","funding_links":[],"categories":["Beyond Vision"],"sub_categories":["**NLP**"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSALT-NLP%2FLADA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSALT-NLP%2FLADA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSALT-NLP%2FLADA/lists"}