{"id":19932264,"url":"https://github.com/amazon-science/sc2qa-dril","last_synced_at":"2025-08-02T18:07:36.853Z","repository":{"id":139012892,"uuid":"475716300","full_name":"amazon-science/sc2qa-dril","owner":"amazon-science","description":"Code for Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning, EMNLP 2021","archived":false,"fork":false,"pushed_at":"2023-07-26T02:32:56.000Z","size":223,"stargazers_count":4,"open_issues_count":2,"forks_count":1,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-02T09:11:13.603Z","etag":null,"topics":["answer-generation","imitation-learning","natural-language-generation","question-answer-generation","question-answering","question-generation"],"latest_commit_sha":null,"homepage":"https://arxiv.org/pdf/2109.04689.pdf","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-30T04:18:59.000Z","updated_at":"2024-08-12T20:21:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"04f83efa-4fb8-482a-968a-fe109ec76a17","html_url":"https://github.com/amazon-science/sc2qa-dril","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/amazon-science/sc2qa-dril","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsc2qa-dril","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsc2qa-dril/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsc2qa-dril/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsc2qa-dril/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/sc2qa-dril/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsc2qa-dril/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268431637,"owners_count":24249413,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-02T02:00:12.353Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["answer-generation","imitation-learning","natural-language-generation","question-answer-generation","question-answering","question-generation"],"created_at":"2024-11-12T23:09:31.624Z","updated_at":"2025-08-02T18:07:36.843Z","avatar_url":"https://github.com/amazon-science.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning\n\nThis repository contains code for our [EMNLP 2021](https://aclanthology.org/2021.emnlp-main.416/) paper: \n[Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning](https://arxiv.org/pdf/2109.04689.pdf). Li Zhou, Kevin Small, Yong Zhang, Sandeep Atluri.\n\n## Table of Contents\n- [(SC)^2QA Dataset](#sc2qa-dataset)\n  - [Install dependencies](#install-dependencies)\n  - [Step 1 Collect Question-Article Pairs](#step-1-collect-question-article-pairs)\n  - [Step 2 Collect {Question, Article, Summary, Length Constraint} 4-Tuples](#step-2-collect-question-article-summary-length-constraint-4-tuples-as-training-and-validation-set)\n  - [Step 3 Collect Articles as Test Set (Optional)](#step-3-collect-articles-as-test-set-optional)\n- [D-S-DRIL Model](#d-s-dril-model)\n  - [Install dependencies](#install-dependencies-1)\n  - [Train an Answer Generation Model using DRIL](#train-an-answer-generation-model-using-dril)\n  - [Train a Question Generation Model](#train-a-question-generation-model)\n  - [Inference](#inference)\n- [How to Cite](#how-to-cite)\n\n## (SC)^2QA Dataset\nWe provide code and scripts to construct the public version of (SC)^2QA dataset from [commoncrawl's news data stream](https://commoncrawl.org/2016/10/news-dataset-available/). Compared with the internal version of (SC)^2QA used in our paper, this public version is larger, including 50,441 quesiton-article pairs and 65,332 {question, article, summary, length constraint} 4-tuples. We also provide an even larger question-article pairs dataset with 529,039 pairs.\n\nFor your convenience, we also provide the constructed dataset, and you can directly load the dataset with huggingface's load_dataset API.\n\n```python\n#!pip install datasets\nfrom datasets import load_dataset\nsc2qa_dataset = load_dataset(\"sc2qa/sc2qa_commoncrawl\")\n```\n\nThis will load {Question, Article, Summary, Length Constraint} 4-tuples. However, if you only want to use question-article pairs as a dataset (i.e. output of the step 1 below), you can do\n```python\n#!pip install datasets\nfrom datasets import load_dataset\nsc2q_dataset = load_dataset(\"sc2qa/sc2q_commoncrawl\")\n```\n\nWe also provide an even larger question-article pairs dataset (without summaries), which includes articles from an expanded domain list.\n```python\n#!pip install datasets\nfrom datasets import load_dataset\nsc2q_dataset_large = load_dataset(\"sc2qa/sc2q_commoncrawl_large\")\n```\nYou can skip the remaining of this section if you use the load_dataset API above.\n\nThe following are steps to construct the dataset. Appendix A of our paper describes each step in details.\n### Install dependencies\n```bash\npip3 install -r requirements.txt\n```\n### Step 1 Collect Question-Article Pairs\n\n```bash\ncd CommonCrawl_Question_Mining\nbash collect_qa.sh\n```\n\nThis script will download WARC files between 2019/01 to 2021/09 from commoncrawl's news data stream, and then filter out news articles based on a set of rules we defined in CommonCrawl_Question_Mining/collect_qa_step2.py.\n\n### Step 2 Collect {Question, Article, Summary, Length Constraint} 4-Tuples as Training and Validation Set\n\n```bash\nbash collect_qasl.sh\n```\n\nThis script will call BART, PEGASUS, and CTRLSum models to geneate length-constrained summaries for articles in step 1, and then used our pre-trained question-answering model to filter out summaries that are likely incorrect answers.\n\nIn total, we have 65,332 {Question, Article, Summary, Length Constraint} 4-tuples. We use the first 57,332 4-tuples as training set and the last 8,000 4-tuples as validation set. \n\n### Step 3 Collect Articles as Test Set (Optional)\n\n```bash\nbash collect_test_set_articles.sh\n```\n\nThis script will randomly sample 10,000 news articles in 2021/09 from commoncrawl's news data stream. There is no ground-truth questions for these articles.\n\n## D-S-DRIL Model\n\n### Install dependencies\n```bash\npip3 install -r requirements.txt\n```\n\n### Train an Answer Generation Model using DRIL\n\n```bash\ncd D-S-DRIL\nbash scripts/train_ag.sh\n```\n\nTrain an answer generation model using the proposed DRIL method in the paper. The model samples summaries (answers) during training and calculate gradients based on the question reconstruction loss.\n\nWe use Amazon EC2 p3dn.24xlarge GPU instances for training. There are 8 GPUs and each GPU has 32GB memory. Training takes about 8 hours. If you encounter GPU out of memory issue, consider reducing the batch size.\n\nDuring training, the model checkpoints will be saved to `ag_model_output/checkpoint-*`. Each checkpoint folder has a `trainer_state.json` file showing the current best model checkpoint. At the end of the training, the best model will be saved to `ag_model_output/`. However, you may encounter GPU out of memory issue when saving the best model. If so, you can manually copy the best model checkpoint from  `ag_model_output/checkpoint-*` to `ag_model_output/` and then copy the model configuration file from `scripts/model_config/config.json` to `ag_model_output/config.json`.\n### Train a Question Generation Model\n\n```bash\nbash scripts/train_qg.sh\n```\n\nTrain a question generation model that generates questions based on summaries of articles.\n\n### Inference\n\n```bash\nbash scripts/inference.sh [validation|test]\n```\nInference question and answer pairs of articles in the validation set and test set.\n\n## How to Cite\nIf you find this repository useful, please cite the following paper.\n```\n@inproceedings{zhou-etal-2021-generating,\n    title = \"Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning\",\n    author = \"Zhou, Li and Small, Kevin and Zhang, Yong and Atluri, Sandeep\",\n    booktitle = \"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)\",\n    year = \"2021\",\n    pages = \"5103--5135\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.emnlp-main.416\",\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fsc2qa-dril","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Fsc2qa-dril","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fsc2qa-dril/lists"}