{"id":19646686,"url":"https://github.com/amzn/amazon-weak-ner-needle","last_synced_at":"2025-04-28T15:30:59.398Z","repository":{"id":47795683,"uuid":"382814124","full_name":"amzn/amazon-weak-ner-needle","owner":"amzn","description":"Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data","archived":true,"fork":false,"pushed_at":"2023-07-25T00:47:50.000Z","size":106,"stargazers_count":100,"open_issues_count":9,"forks_count":27,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-14T19:52:55.063Z","etag":null,"topics":["bert","biomedical","biomedical-named-entity-recognition","distant-supervision","ner","weak-supervision","weakly-supervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit-0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amzn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-04T09:40:16.000Z","updated_at":"2025-01-17T15:50:44.000Z","dependencies_parsed_at":"2024-11-11T14:41:34.495Z","dependency_job_id":null,"html_url":"https://github.com/amzn/amazon-weak-ner-needle","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2Famazon-weak-ner-needle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2Famazon-weak-ner-needle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2Famazon-weak-ner-needle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2Famazon-weak-ner-needle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amzn","download_url":"https://codeload.github.com/amzn/amazon-weak-ner-needle/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251338580,"owners_count":21573582,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","biomedical","biomedical-named-entity-recognition","distant-supervision","ner","weak-supervision","weakly-supervised-learning"],"created_at":"2024-11-11T14:39:47.480Z","updated_at":"2025-04-28T15:30:59.000Z","avatar_url":"https://github.com/amzn.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data\n\n[arXiv](https://arxiv.org/abs/2106.08977)\n\nThis is the code base for weakly supervised NER.\n\nWe provide a three stage framework:\n- Stage I: Domain continual pre-training;\n- Stage II: Noise-aware weakly supervised pre-training;\n- Stage III: Fine-tuning.\n\nIn this code base, we actually provide basic building blocks which allow arbitrary combination of different stages. We also provide examples scripts for reproducing our results in BioMedical NER.\n\nSee details in [arXiv](https://arxiv.org/abs/2106.08977).\n\n\n- [Weakly Supervised NER](#weakly-supervised-ner)\n  - [Performance Benchmark](#performance-benchmark)\n  - [Dependency](#dependency)\n  - [File Structure:](#file-structure)\n  - [Data](#data)\n    - [Data Format](#data-format)\n    - [Pre-processed Data](#pre-processed-data)\n  - [Usage](#usage)\n    - [Hyperparameter Explaination](#hyperparameter-explaination)\n    - [More Rounds of Training, Try Different Combination](#more-rounds-of-training-try-different-combination)\n    - [Automate Experiments](#automate-experiments)\n    - [Others](#others)\n\n\n\n## Performance Benchmark\n\n**BioMedical NER**\n\n|Method (F1) | BC5CDR-chem | BC5CDR-disease | NCBI-disease |\n|-------|-------------|----------------|--------------|\n|BERT\t          |89.99\t|79.92\t|85.87|\n|bioBERT        |92.85\t|84.70 \t|89.13|\n|PubMedBERT\t    |93.33\t|85.62\t|87.82|\n|**Ours**|**94.17**\t|**90.69**\t|**92.28**|\n\nSee more in [bio_script/README.md](./bio_script/README.md#performance-benchmark)\n\n\n## Dependency\n\n```\npytorch==1.6.0\ntransformers==3.3.1\nallennlp==1.1.0\nflashtool==0.0.10\nray==0.8.7\n```\n\nInstall requirements\n```\npip install -r requirements.txt\n```\n\n(If the `allennlp` and `transformers` are incompatible, install `allennlp` first and then update `transformers`. Since we only use some small functions of `allennlp`, it should works fine. )\n\n## File Structure:\n\n```.\n├── bert-ner          #  Python Code for Training NER models\n│   └── ...\n└── bio_script        #  Shell Scripts for Training BioMedical NER models\n    └── ...\n```\n\n## Usage\nSee examples in `bio_script`\n\n### Hyperparameter Explaination\n\nHere we explain hyperparameters used the scripts in `./bio_script`.\n\n#### Training Scripts:\n**Scripts**\n- `roberta_mlm_pretrain.sh`\n- `weak_weighted_selftrain.sh`\n- `finetune.sh`\n\n**Hyperparameter**\n- `GPUID`: Choose the GPU for training. It can also be specified by `xxx.sh 0,1,2,3`.\n- `MASTER_PORT`: automatically constructed (avoid conflicts) for distributed training.\n- `DISTRIBUTE_GPU`: use distributed training or not\n- `PROJECT_ROOT`: automatically detected, the root path of the project folder.\n- `DATA_DIR`: Directory of the training data, where it contains `train.txt` `test.txt` `dev.txt` `labels.txt` `weak_train.txt` (weak data) `aug_train.txt` (optional).\n- `USE_DA`: if augment training data by augmentation, i.e., combine `train.txt` + `aug_train.txt` in `DATA_DIR` for training.\n- `BERT_MODEL`: the model backbone, e.g., `roberta-large`. See transformers for details.\n- `BERT_CKP`: see `BERT_MODEL_PATH`.\n- `BERT_MODEL_PATH`: the path of the model checkpoint that you want to load as the initialization. Usually used with `BERT_CKP`.\n- `LOSSFUNC`: `nll` the normal loss function, `corrected_nll` noise-aware risk (i.e., add weighted log-unlikelihood regularization: wei*nll + (1-wei)*null ).\n- `MAX_WEIGHT`: The maximum weight of a sample in the loss.\n- `MAX_LENGTH`: max sentence length.\n- `BATCH_SIZE`: batch size per GPU.\n- `NUM_EPOCHS`: number of training epoches.\n- `LR`: learning rate.\n- `WARMUP`: learning rate warmup steps.\n- `SAVE_STEPS`: the frequency of saving models.\n- `EVAL_STEPS`: the frequency of testing on validation.\n- `SEED`: radnom seed.\n- `OUTPUT_DIR`: the directory for saving model and code. Some parameters will be automatically appended to the path.\n  - `roberta_mlm_pretrain.sh`: It's better to manually check where you want to save the model.]\n  - `finetune.sh`: It will be save in `${BERT_MODEL_PATH}/finetune_xxxx`.\n  - `weak_weighted_selftrain.sh`: It will be save in `${BERT_MODEL_PATH}/selftrain/${FBA_RULE}_xxxx` (see `FBA_RULE` below)\n\nThere are some addition parameters need to be set for weakly supervised learning (`weak_weighted_selftrain.sh`).\n- `WEAK_RULE`: what kind of weakly supervised data to use. See [Weakly Supervised Data Refinement Script](#weakly-supervised-data-refinement-script) for details.\n\n#### Profiling Script\n\n**Scripts**\n- `profile.sh`\n\nProfiling scripts also use the same entry as the training script: `bert-ner/run_ner.py` but only do evaluation.\n\n**Hyperparameter**\nBasically the same as training script.\n- `PROFILE_FILE`: can be `train,dev,test` or a specific path to a `txt` data. E.g.,  using Weak by\n  \u003e `PROFILE_FILE=weak_train_100.txt`\n  \u003e `PROFILE_FILE=$DATA_DIR/$PROFILE_FILE`\n\n- `OUTPUT_DIR`: It will be saved in `OUTPUT_DIR=${BERT_MODEL_PATH}/predict/profile`\n\n#### Weakly Supervised Data Refinement Script\n\n**Scripts**\n- `profile2refinedweakdata.sh`\n\n**Hyperparameter**\n- `BERT_CKP`: see `BERT_MODEL_PATH`.\n- `BERT_MODEL_PATH`: the path of the model checkpoint that you want to load as the initialization. Usually used with `BERT_CKP`.\n- `WEI_RULE`: rule for generating weight for each weak sample.\n  - `uni`: all are 1\n  - `avgaccu`: confidence estimate for new labels generated by `all_overwrite`\n  - `avgaccu_weak_non_O_promote`: confidence estimate for new labels generated by `non_O_overwrite`\n- `PRED_RULE`: rule for generating new weak labels.\n  - `non_O_overwrite`: non-entity ('O') is overwrited by prediction\n  - `all_overwrite`: all use prediction, i.e., self-training\n  - `no`: use original weak labels\n  - `non_O_overwrite_all_overwrite_over_accu_xx`: `non_O_overwrite` + if confidence is higher than `xx` all tokens use prediction as new labels\n\nThe generated data will be saved in `${BERT_MODEL_PATH}/predict/weak_${PRED_RULE}-WEI_${WEI_RULE}`\n`WEAK_RULE` specified in `weak_weighted_selftrain.sh` is essential the name of folder `weak_${PRED_RULE}-WEI_${WEI_RULE}`.\n\n\n### More Rounds of Training, Try Different Combination\n\n1. To do training with weakly supervised data from any model checkpoint directory:\n  - i) Set `BERT_CKP` appropriately;\n  - ii) Create profile data, e.g., run `./bio_script/profile.sh` for dev set and weak set\n  - iii) Generate data with weak labels from profile data, e.g., run `./bio_script/profile2refinedweakdata.sh`. You can use different rules to generate weights for each sample (`WEI_RULE`) and different rules to refine weak labels (`PRED_RULE`). See more details in `./ber-ner/profile2refinedweakdata.py`\n  - iv) Do training with `./bio_script/weak_weighted_selftrain.sh`.\n\n2. To do fine-tuning with human labeled data from any model checkpoint directory:\n  - i) Set `BERT_CKP` appropriately;\n  - ii) Run `./bio_script/finetune.sh`.\n\n## Reference\n\n```\n@inproceedings{Jiang2021NamedER,\n  title={Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data},\n  author={Haoming Jiang and Danqing Zhang and Tianyue Cao and Bing Yin and T. Zhao},\n  booktitle={ACL/IJCNLP},\n  year={2021}\n}\n```\n\n## Security\n\nSee [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.\n\n## License\n\nThis library is licensed under the MIT-0 License. See the LICENSE file.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famzn%2Famazon-weak-ner-needle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famzn%2Famazon-weak-ner-needle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famzn%2Famazon-weak-ner-needle/lists"}