{"id":22015379,"url":"https://github.com/luohongyin/rgx","last_synced_at":"2025-05-07T00:46:22.577Z","repository":{"id":40995755,"uuid":"508088725","full_name":"luohongyin/RGX","owner":"luohongyin","description":"Synthetic QA generation for long documents.","archived":false,"fork":false,"pushed_at":"2022-07-22T02:06:45.000Z","size":82,"stargazers_count":14,"open_issues_count":1,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-31T04:41:13.826Z","etag":null,"topics":["language-generation","language-model","natural-language-processing","nlp","question-answering","self-training"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luohongyin.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-27T23:15:56.000Z","updated_at":"2024-09-19T05:31:31.000Z","dependencies_parsed_at":"2022-09-20T20:01:26.735Z","dependency_job_id":null,"html_url":"https://github.com/luohongyin/RGX","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luohongyin%2FRGX","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luohongyin%2FRGX/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luohongyin%2FRGX/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luohongyin%2FRGX/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luohongyin","download_url":"https://codeload.github.com/luohongyin/RGX/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252793565,"owners_count":21805054,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-generation","language-model","natural-language-processing","nlp","question-answering","self-training"],"created_at":"2024-11-30T04:21:35.619Z","updated_at":"2025-05-07T00:46:22.556Z","avatar_url":"https://github.com/luohongyin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RGX: question-answer generation for documents\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cooperative-learning-of-zero-shot-machine/question-answering-on-mrqa-out-of-domain)](https://paperswithcode.com/sota/question-answering-on-mrqa-out-of-domain?p=cooperative-learning-of-zero-shot-machine)\n\nThis repo contains the software developed for the paper,\n\n[Cooperative Self-training for Machine Reading Comprehension](https://arxiv.org/pdf/2103.07449.pdf), Luo H., Li S.-W., Gao M., Yu S., Glass J., NAACL 2022.\n\nTry our [live demo](http://teddy.csail.mit.edu:8080/) with medium-length passages (long document version coming soon).\n\n# Dependency\nWe run this software using the following packages,\n- Python 3.8.13\n- NLTK 3.7\n- Stanza 1.4.0\n- PyTorch 1.11.0 + cu113\n- Transformers 4.19.2\n- Datasets 2.3.2\n\nThe pretrained models are available via this [Google Drive Link](https://drive.google.com/drive/folders/1pREUVN9FSL6RwamhkQBDb5_3oED3Qjlq?usp=sharing). Please download the models and move them under the `model_file/` directory if your machine cannot download the models from the Huggingface hub.\n- `model_file/ext_sqv2.pt`: Pretrained ELECTRA-large question answering model on SQuAD v2.0.\n- `model_file/ques_gen_squad.pt`: Pretrained BART-large question generation model on SQuAD v2.0.\n- `model_file/electra-tokenize.pt`: Electra-large tokenizer provided by Huggingface.\n- `model_file/bart-tokenizer.pt`: BART-large tokenizer provided by Huggingface.\n\n# Quick (?) Start\nGenerate question-answer pairs on the example SQuAD passages we provide at `data/squad/doc_data_0.json` by running the following command,\n\n```\npython rgx_doc.py \\\n    --dataset_name squad \\\n    --data_split 0 \\\n    --output_dir tmp/rgx \\\n    --version_2_with_negative\n```\n\nThe generated data will be stored under `data_gen/squad`, including `rgx_0.json` and `qa_train_corpus_0.json`. We provide the `$DATA_SPLIT` option for distributed data generation, for example, with Slurm. If only generating QA pairs with one process, simply use `--data_split 0`.\n\n## Data \u0026 File Locations\nAll data are stored at the `data/` and `data_gen/` directories.\n- `data/{$DATASET_NAME}/doc_data_{$DATA_SPLIT}.json`: unlabeled documents of the target dataset.\n- `data_gen/{$DATASET_NAME}/rgx_{$DATA_SPLIT}.json`: generated QA data aligned with each document from the corresponding dataset.\n- `data_gen/{$DATASET_NAME}/qa_train_corpus_{$DATA_SPLIT}.json`: generated QA training set of the given dataset. The training examples follows the SQuAD data format and are randomly shuffled.\n\n## Data Format\n- The format of the input file, `doc_data_{$DATA_SPLIT}.json` is a list of dictionaries as\n```\n[\n    {\"context\": INPUT_DOC_TXT__0},\n    {\"context\": INPUT_DOC_TXT__1},\n    ...,\n    {\"context\": INPUT_DOC_TXT__N},\n]\n```\n- The format of the output file, `qa_train_corpus_{$DATA_SPLIT}.json`, is a list of dictionaries as\n```\n[\n    {\n        \"context\": INPUT_DOC_TXT_0,\n        \"question\": GEN_QUESTION_TXT_0,\n        \"answers\": {\n            \"text\": [ ANSWER_TXT ], # only one answer per question\n            \"answer_start\": [ ANSWER_ST_CHAR ]\n            # index of the starting character of the answer txt\n        }\n    },\n    {\n        ...\n    },\n]\n```\n- The format of the output file, `rgx_{$DATA_SPLIT}.json` is a list of document-QA mappings,\n```\n[\n    [\n        $DOCUMENT_i,\n        $ANS2ITEM_LIST_i,\n        $GEN_QA_LIST_i\n    ],\n    ...\n]\n```\n`$DOCUMENT_i` has the same format as the input file.  The `$ANS2ITEM_LIST_i` is the meta-data of all recognized answers and generated questions. Note that one answer can have multiple questions, and the questions can be either correct or not. The final output of the model is `$GEN_QA_LIST_i`, which is a list of dictionaries of generated QA pairs based on the input document,\n```\n[\n    {\n        \"question\": GEN_QUESTION_TXT_0,\n        \"answers\": {\n            \"text\": [ ANSWER_TXT ],\n            \"answer_start\": [ ANSWER_ST_CHAR ]\n        }\n    }\n]\n```\n\n# QA Generation for Your Documents\n- Run the following command, or manually create directories under the `data/` and `data_gen/` directories,\n```\nbash new_dataset.sh $NEW_DATASET_NAME\n```\n- Move the input file containing the target documents as `data/$NEW_DATASET_NAME/doc_data_0.json`. The format is described in the previous section.\n\n- Run the following command\n```\npython rgx_doc.py \\\n    --dataset_name $NEW_DATASET_NAME \\\n    --data_split 0 \\\n    --output_dir tmp/rgx \\\n    --version_2_with_negative\n```\n\nThe generated files will be stored at `data_gen/{$NEW_DATASET_NAME}/`.\n\n# Fine-tuning QA Models with Synthetic Data\nwe suggest two approaches for QA fine-tuning with the generated QA pairs.\n- Secondary pretraining: Fine-tuning the QA model on the synthetic corpus, and fine-tune on SQuAD. The model can be evaluated on different domains.\n- Model mixing: Fine-tune two models on the generated corpus and SQuAD, and average all weights of the two models using the `mix_mode.py` script with\n```\npython mix_model.py $MIX_RATE $SQUAD_MODEL_PATH $RGX_MODEL_PATH\n```\nfor example,\n```\npython mix_model.py 0.5 model_ft_file/ext_sq.pt model_ft_file/ext_rgx.pt\n```\nThe output model will be stored as `model_ft_file/ext_mixed.pt`.\n\n# Contact and Citation\n\nPlease contact the first author, Hongyin Luo (hyluo at mit dot edu) if there are any questions. If our system is applied in your work, please cite our paper\n\n```\n@article{luo2021cooperative,\n  title={Cooperative self-training machine reading comprehension},\n  author={Luo, Hongyin and Li, Shang-Wen and Mingye Gao, and Yu, Seunghak and Glass, James},\n  journal={arXiv preprint arXiv:2103.07449},\n  year={2021}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluohongyin%2Frgx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluohongyin%2Frgx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluohongyin%2Frgx/lists"}