{"id":23725997,"url":"https://github.com/voidful/bdg","last_synced_at":"2025-09-04T02:31:23.028Z","repository":{"id":46855062,"uuid":"301120682","full_name":"voidful/BDG","owner":"voidful","description":"Code for \"A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies.\"","archived":false,"fork":false,"pushed_at":"2022-02-02T14:01:15.000Z","size":2309,"stargazers_count":27,"open_issues_count":5,"forks_count":4,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-09T06:05:35.772Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://voidful.github.io/DG-Showcase/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/voidful.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-04T12:10:31.000Z","updated_at":"2025-05-14T07:56:34.000Z","dependencies_parsed_at":"2022-09-11T15:10:29.988Z","dependency_job_id":null,"html_url":"https://github.com/voidful/BDG","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/voidful/BDG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2FBDG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2FBDG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2FBDG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2FBDG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/voidful","download_url":"https://codeload.github.com/voidful/BDG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2FBDG/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273541901,"owners_count":25124056,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-04T02:00:08.968Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-31T00:18:07.826Z","updated_at":"2025-09-04T02:31:22.077Z","avatar_url":"https://github.com/voidful.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BDG(Distractor Generation)\nCode for \"A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies.\"  \n[Paper](https://www.aclweb.org/anthology/2020.findings-emnlp.393/)\n\n## V2\nUpdated result using BART. BART model is uploaded in HuggingFace model hub.\n| model         | BLEU1 | BLEU2 | BLEU3 | BLEU4 | ROUGEL |\n|---------------|-------|-------|-------|-------|--------|\n| BERT DG       | 35.30 | 20.65 | 13.66 | 9.53  | 31.11  |\n| BERT DG pm    | 39.81 | 24.81 | 17.66 | 13.56 | 34.01  |\n| BERT DG an+pm | 39.52 | 24.29 | 17.28 | 13.28 | 33.40  |\n| BART DG       | 40.76 | 26.40 | 19.14 | 14.65 | 35.53  |\n| BART DG pm    | 41.85 | 27.45 | 20.47 | 16.33 | 37.15  |\n| BART DG an+pm | 40.26 | 25.86 | 18.85 | 14.65 | 35.64  |\n* higher is better\n\n| model         | Count BLEU1 \u003e 0.95 |\n|---------------|--------------------|\n| BERT DG       | 115                |\n| BERT DG pm    | 57                 |\n| BERT DG an+pm | 43                 |\n| BART DG       | 110                |\n| BART DG pm    | 60                 |\n| BART DG an+pm | 23                 |\n| Gold          | 12                 |\n* lower is better\n\n## Trained Model and Code Example\n### BART\nDistractor: https://huggingface.co/voidful/bart-distractor-generation  \nDistractor PM: https://huggingface.co/voidful/bart-distractor-generation-pm  \nDistractor AN+PM: https://huggingface.co/voidful/bart-distractor-generation-both  \n\n### BERT \nTrained model available on release:  \nhttps://github.com/voidful/BDG/releases/tag/v1.0\n\nColab notebook for using pre trained model:  \nhttps://colab.research.google.com/drive/1yA3Rex9JHKJmc52E3YdsBQ4eQ_R6kEZB?usp=sharing\n\n## Citation\n\nIf you make use of the code in this repository, please cite the following papers:\n\n    @inproceedings{chung-etal-2020-BERT,\n    title = \"A {BERT}-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies.\",\n    author = \"Chung, Ho-Lam  and\n      Chan, Ying-Hong  and\n      Fan, Yao-Chung\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings\",\n    month = nov,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.findings-emnlp.393\",\n    pages = \"4390--4400\",\n    abstract = \"In this paper, we investigate the following two limitations for the existing distractor generation (DG) methods. First, the quality of the existing DG methods are still far from practical use. There are still room for DG quality improvement. Second, the existing DG designs are mainly for single distractor generation. However, for practical MCQ preparation, multiple distractors are desired. Aiming at these goals, in this paper, we present a new distractor generation scheme with multi-tasking and negative answer training strategies for effectively generating \\textit{multiple} distractors. The experimental results show that (1) our model advances the state-of-the-art result from 28.65 to 39.81 (BLEU 1 score) and (2) the generated multiple distractors are diverse and shows strong distracting power for multiple choice question.\",\n    }\n\n\n## Environment Setup\n```bash\npip install -r requirement.txt\n```\n\n## Data Preprocessing   \nInside `data_preprocessing` folder.  \nDownload dataset [here](https://github.com/Yifan-Gao/Distractor-Generation-RACE), put it into `distractor` folder.    \nrun `convert_data.py` to do preprocessing.  \nrun `dataset_stat.py` for dataset statistics.  \n\n## Train Distractor Generator\n### BART\nusing tfkit==0.7.0 and transformers==4.4.2  \n```bash\ntfkit-train --savedir ./race_cqa_gen_d_bart/ --train ./race_train_updated_cqa_dsep_a_bart.csv --test ./race_test_updated_cqa_dsep_a_bart.csv --model seq2seq  --config facebook/bart-base --batch 9 --epoch 10 --grad_accum 2 --no_eval;\ntfkit-train --savedir ./race_cqa_gen_d_bart_pm/ --train ./race_train_updated_cqa_dsep_a_bart.csv --test ./race_test_updated_cqa_dsep_a_bart.csv --model seq2seq  --config facebook/bart-base --batch 9 --epoch 10 --grad_accum 2 --no_eval --likelihood pos;\ntfkit-train --savedir ./race_cqa_gen_d_bart_both/ --train ./race_train_updated_cqa_dsep_a_bart.csv --test ./race_test_updated_cqa_dsep_a_bart.csv --model seq2seq  --config facebook/bart-base --batch 9 --epoch 10 --grad_accum 2 --no_eval --likelihood both;\n```\n\n### BERT\nusing environment from `requirement.txt`   \nrun the following in main dir:  \n### Train BDG Model\n```bash\ntfkit-train --maxlen 512 --savedir ./race_cqa_gen_d/ --train ./data_preprocessing/processed_data/race_train_updated_cqa_dsep_a.csv --test ./data_preprocessing/processed_data/race_test_updated_cqa_dsep_a.csv --model onebyone --tensorboard  --config bert-base-cased --batch 30 --epoch 6;\n```\n### Train BDG AN model\n```bash\ntfkit-train --maxlen 512 --savedir ./race_cqa_gen_d_an/ --train ./data_preprocessing/processed_data/race_train_updated_cqa_dsep_a.csv --test ./data_preprocessing/processed_data/race_test_updated_cqa_dsep_a.csv --model onebyone-neg --tensorboard  --config bert-base-cased --batch 30 --epoch 6;\n```\n### Train BDG PM model\n```bash\ntfkit-train --maxlen 512 --savedir ./race_cqa_gen_d_pm/ --train ./data_preprocessing/processed_data/race_train_updated_cqa_dsep_a.csv --test ./data_preprocessing/processed_data/race_test_updated_cqa_dsep_a.csv --model onebyone-pos --tensorboard  --config bert-base-cased --batch 30 --epoch 6;\n```\n### Train BDG AN+PM model\n```bash\ntfkit-train --maxlen 512 --savedir ./race_cqa_gen_d_both/ --train ./data_preprocessing/processed_data/race_train_updated_cqa_dsep_a.csv --test ./data_preprocessing/processed_data/race_test_updated_cqa_dsep_a.csv --model onebyone-both --tensorboard  --config bert-base-cased --batch 30 --epoch 6;\n```\n### Eval generator   \n```bash\ntfkit-eval --model model_path --valid ./data_preprocessing/processed_data/race_test_updated_cqa_dall.csv --metric nlg\n```\n\n## Distractor Analysis\nInside `distractor analysis` folder\n-  `preprocess_model_result.py` for result preprocessing and statistics.\n-  `normalize_jsonl_file.py` merge different model result with same question and context.\n-  `create_rank_dataset.py` prepare data for Entropy Maximization.\n\n## RACE MRC\n### Preparation\n```bash\ngit clone https://github.com/huggingface/transformers\ncp our transformer file into huggingface/transformers\n```\n\n### Training Multiple Choice Question Answering Model\nBased on the script [`run_multiple_choice.py`]().\nDownload race data\nTrain   \n```bash\n#training on 4 tesla V100(16GB) GPUS\nexport RACE_DIR=../RACE\npython ./examples/run_multiple_choice.py \\\n--model_type roberta \\\n--task_name race \\\n--model_name_or_path roberta-base-openai-detector  \\\n--do_train  \\\n--do_eval \\\n--data_dir $RACE_DIR \\\n--learning_rate 1e-5 \\\n--num_train_epochs 10 \\\n--max_seq_length 512 \\\n--output_dir ./roberta-base-openai-race \\\n--per_gpu_eval_batch_size=9 \\\n--per_gpu_train_batch_size=9 \\\n--gradient_accumulation_steps 2 \\\n--save_steps 5000 \\\n--eval_all_checkpoints \\\n--seed 77 \n```\n\n### Eval QA \u0026 Get entropy ensemble result\n```bash\nexport RACE_DIR=../multi_dist_normalized_jsonl/xxx.jsonl\npython ./examples/run_multiple_choice.py \\\n--model_type roberta \\\n--task_name race \\\n--model_name_or_path ../roberta-base-openai-race/  \\\n--do_test \\\n--data_dir $RACE_DIR \\\n--max_seq_length 512 \\\n--per_gpu_eval_batch_size=3 \\\n--output_dir ./race_test_result \\\n--overwrite_cache\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fbdg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvoidful%2Fbdg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fbdg/lists"}