{"id":16283384,"url":"https://github.com/liamdugan/summary-qg","last_synced_at":"2025-03-20T02:30:44.460Z","repository":{"id":42424378,"uuid":"468516551","full_name":"liamdugan/summary-qg","owner":"liamdugan","description":"Code for the ACL 2022 Paper \"A Feasibility Study of Answer-Agnostic Question Generation for Education\"","archived":false,"fork":false,"pushed_at":"2022-07-05T18:31:49.000Z","size":492,"stargazers_count":17,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-28T22:35:38.167Z","etag":null,"topics":["natural-language-processing","nlp","question-answer-generation","question-answering","question-generation"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2203.08685","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/liamdugan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-10T21:30:47.000Z","updated_at":"2024-08-20T06:26:29.000Z","dependencies_parsed_at":"2022-09-16T08:20:13.568Z","dependency_job_id":null,"html_url":"https://github.com/liamdugan/summary-qg","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liamdugan%2Fsummary-qg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liamdugan%2Fsummary-qg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liamdugan%2Fsummary-qg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liamdugan%2Fsummary-qg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/liamdugan","download_url":"https://codeload.github.com/liamdugan/summary-qg/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244041306,"owners_count":20388213,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","question-answer-generation","question-answering","question-generation"],"created_at":"2024-10-10T19:13:19.963Z","updated_at":"2025-03-20T02:30:44.049Z","avatar_url":"https://github.com/liamdugan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Joint Summarization \u0026 Question Generation\n![/data/media/demo.gif](/data/media/demo.gif)\n\nThis repository contains the code for the ACL 2022 paper \"A Feasibility Study of Answer-Agnostic Question Generation for Education\". In our paper we show that running QG on summarized text results in higher quality questions.\n\n## Installation\n\nConda:\n```\nconda create -n sumqg_env python=3.9.7\nconda activate sumqg_env\npip install -r requirements.txt\npython -m nltk.downloader punkt\n```\nvenv:\n```\npython -m venv env\nsource env/bin/activate\npip install -r requirements.txt\npython -m nltk.downloader punkt\n```\n\n## Usage\n\nTo run QG on user input or a file, use `run_qg.py`. Add the `-s` flag to include automatic summarization in the pipeline before running QG (for use on longer inputs only). Add the `-f` flag to use the smaller and faster distilled versions of the models. The full options are listed below.\n```\n$ python run_qg.py -h\n  -s, --use_summary     Include summarization pre-processing\n  -f, --fast            Use the smaller and faster versions of the models\n  -i, --infile          The name of the text file to generate questions from.\n                        If no file is given, questions are generated on user input\n```\n\nExample (User Input):\n```\n$ python run_qg.py\n\u003eThe answer to life is 42. The answer to most other questions is unknowable.\n{'answer': '42', 'question': 'What is the answer to life?'}\n{'answer': 'unknowable', 'question': 'What is the answer to most other questions?'}\n```\n\nExample (File Input):\n```\n$ python run_qg.py -s -i data/text/slp_ch2.txt\n\nSummary: The dialogue above is from ELIZA, an early natural language \u003c...\u003e\n\n{'answer': 'Eliza', 'question': \"Who's mimicry of human conversation was remarkably successful?\"}\n{'answer': 'restaurants', 'question': 'Modern conversational agents can answer questions, book flights, or find what?'}\n{'answer': 'Regular expressions', 'question': 'What can be used to specify strings we might want to extract from a document?'}\n...\n```\n\nThese scripts will default to using GPU if it is available. It is highly recommended (but not required) to have access to a CUDA-capable GPU when running these models. They are quite large and take a long time to run on CPU.\n\n## Reproduction\n\nTo reproduce the results from the paper, use `reproduction/run_experiments.py`. This script will generate a file named `out.csv` that contains questions from all three sources (Automatic Summary, Original Text, Human Summary) separated by chapter subsection. If using the full-size models, this should take about 5-10 minutes on GPU.\n```\n$ python run_experiments.py -h\n  -s, --use_summary  Run automatic summarization rather than reading in\n                     automatic summary data from a file\n  -f, --fast         Use the smaller and faster versions of the models\n```\n\nFor example, this command will run the full QG model on all sources\n```\n$ cd reproduction\n$ python run_experiments.py -s\n```\n\nTo reproduce the coverage analysis, use `reproduction/coverage.py`. This script will print out the % of bolded key-terms from the textbook present in question-answer pairs in a given input csv file separated by textual source.\n```\n$ python coverage.py \u003ckeyword_file\u003e \u003cdata_file\u003e\n```\n\nFor example, this command will run a coverage analysis on the data included in the paper. You may also choose to set `data_file` to the `out.csv` file to verify the coverage of your generated questions.\n```\n$ python coverage.py ../data/keywords/keywords.csv ../data/questions/questions.csv\n```\n\nFinally, to reproduce our analysis of annotations collected, use `reproduction/analyze_annotations.py`. This script will print out pairwise IAA and per-annotator statistics (Table 3) for each annotation questions as well as a breakdown across chapters (Table 5). It will also output the plot used in Figure 3 as `summaries.pdf`.\n```\n$ python analyze_annotations.py\n```\n\n## Model Details\n\nThe QG models used and the inference code to run them come from [Suraj Patil's amazing question_generation repository](https://github.com/patil-suraj/question_generation). Many thanks to him for sharing his great work with the academic community. Please see our paper for more details about the training and model inference.\n\nBelow are the evaluation results for the `t5-base` and `t5-small` models on the SQuAD1.0 dev set. For decoding, beam search with num_beams 4 was used with max decoding length set to 32. The [nlg-eval](https://github.com/Maluuba/nlg-eval) package was used to calculate the metrics.\n\n| Name                                                                       | BLEU-4  | METEOR  | ROUGE-L | QA-EM  | QA-F1  |\n|----------------------------------------------------------------------------|---------|---------|---------|--------|--------|\n| [t5-base-qa-qg-hl](https://huggingface.co/valhalla/t5-base-qa-qg-hl)       | 21.0141 | 26.9113 | 43.2484 | 82.46  | 90.272 |\n| [t5-small-qa-qg-hl](https://huggingface.co/valhalla/t5-small-qa-qg-hl)     | 18.9872 | 25.2217 | 40.7893 | 76.121 | 84.904 |\n\n\u003cbr/\u003eBelow are the evaluation results for the `bart-large` and `distilbart` models on the CNN/DailyMail test set.\n\n| Name                                                                       | ROUGE-2  | ROUGE-L |\n|----------------------------------------------------------------------------|---------|---------|\n| [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)       | 21.06 | 30.63 |\n| [sshleifer/distilbart-cnn-6-6](https://huggingface.co/sshleifer/distilbart-cnn-6-6)     | 20.17 | 29.70 |\n\n## Citation\nIf you use our code or findings in your research, please cite us as:\n```\n@inproceedings{dugan-etal-2022-feasibility,\n    title = \"A Feasibility Study of Answer-Agnostic Question Generation for Education\",\n    author = \"Dugan, Liam  and\n      Miltsakaki, Eleni  and\n      Upadhyay, Shriyash  and\n      Ginsberg, Etan  and\n      Gonzalez, Hannah  and\n      Choi, DaHyeon  and\n      Yuan, Chuning  and\n      Callison-Burch, Chris\",\n    booktitle = \"Findings of the Association for Computational Linguistics: ACL 2022\",\n    month = may,\n    year = \"2022\",\n    address = \"Dublin, Ireland\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2022.findings-acl.151\",\n    doi = \"10.18653/v1/2022.findings-acl.151\",\n    pages = \"1919--1926\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliamdugan%2Fsummary-qg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliamdugan%2Fsummary-qg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliamdugan%2Fsummary-qg/lists"}