{"id":13709618,"url":"https://github.com/AMontgomerie/question_generator","last_synced_at":"2025-05-06T16:32:09.706Z","repository":{"id":37710535,"uuid":"281276564","full_name":"AMontgomerie/question_generator","owner":"AMontgomerie","description":"An NLP system for generating reading comprehension questions","archived":true,"fork":false,"pushed_at":"2024-02-06T22:18:21.000Z","size":111,"stargazers_count":269,"open_issues_count":13,"forks_count":72,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-08-03T23:16:11.640Z","etag":null,"topics":["bert","natural-language-generation","natural-language-processing","nlg","nlp","question-generation","t5","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AMontgomerie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-21T02:37:01.000Z","updated_at":"2024-07-27T16:26:38.000Z","dependencies_parsed_at":"2022-07-12T15:18:05.264Z","dependency_job_id":null,"html_url":"https://github.com/AMontgomerie/question_generator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMontgomerie%2Fquestion_generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMontgomerie%2Fquestion_generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMontgomerie%2Fquestion_generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AMontgomerie%2Fquestion_generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AMontgomerie","download_url":"https://codeload.github.com/AMontgomerie/question_generator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224513460,"owners_count":17323812,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","natural-language-generation","natural-language-processing","nlg","nlp","question-generation","t5","transformers"],"created_at":"2024-08-02T23:00:42.541Z","updated_at":"2024-11-13T19:32:00.106Z","avatar_url":"https://github.com/AMontgomerie.png","language":"Python","funding_links":[],"categories":["Repository \u0026 Toolkit"],"sub_categories":[],"readme":"# question_generator\n\nQuestion Generator is an NLP system for generating reading comprehension-style questions from texts such as news articles or pages excerpts from books. The system is built using pretrained models from [HuggingFace Transformers](https://github.com/huggingface/transformers). There are two models: the question generator itself, and the QA evaluator which ranks and filters the question-answer pairs based on their acceptability.\n\n## Update 2021/11/29\n\n### Updated training scripts\n\nThe training notebooks have been updated with training scripts. To run:\n\n```bash\npython question_generator/training/qg_train.py\n```\n\n```bash\npython question_generator/training/qa_eval_train.py\n```\n\nHyperparameters can be changed using commandline arguments. See the scripts for the list of available arguments.\n\n### Datasets uploaded to Huggingface Hub\n\nThe datasets have been uploaded to the Huggingface Hub:\n\n- [question generator training and validation data](https://huggingface.co/datasets/iarfmoose/question_generator)\n- [qa evaluator training and validation data](https://huggingface.co/datasets/iarfmoose/qa_evaluator)\n\n## Usage\n\nThe easiest way to generate some questions is to clone the github repo and then run `qg_run.py` like this:\n\n```\ngit clone https://github.com/amontgomerie/question_generator\ncd question_generator\npip install -r requirements.txt -qq\npython run_qg.py --text_file articles/twitter_hack.txt\n```\n\nThis will generate 10 question-answer pairs of mixed style (full-sentence and multiple choice) based on the article specified in `--text_file` and print them to the console. For more information see the qg_commandline_example notebook.\n\nThe `QuestionGenerator` class can also be instantiated and used like this:\n\n```python\nfrom questiongenerator import QuestionGenerator\nqg = QuestionGenerator()\nqg.generate(text, num_questions=10)\n```\n\nThis will generate 10 questions of mixed style and return a list of dictionaries containing question-answer pairs. In the case of multiple choice questions, the answer will contain a list of dictionaries containing the answers and a boolean value stating if the answer is correct or not. The output can be easily printed using the `print_qa()` function. For more information see the question_generation_example notebook.\n\n### Choosing the number of questions\n\nThe desired number of questions can be passed as a command line argument using `--num_questions` or as an argument when calling `qg.generate(text, num_questions=20`. If the chosen number of questions is too large, then the model may not be able to generate enough. The maximum number of questions will depend on the length of the input text, or more specifically the number of sentences and named entities containined within text. Note that the quality of some of the outputs will decrease for larger numbers of questions, as the QA Evaluator ranks generated questions and returns the best ones.\n\n### Answer styles\n\nThe system can generate questions with full-sentence answers (`'sentences'`), questions with multiple-choice answers (`'multiple_choice'`), or a mix of both (`'all'`). This can be selected using the `--answer_style` or `qg.generate(answer_style=\u003cstyle\u003e)` arguments.\n\n## Models\n\n### Question Generator\n\nThe question generator model takes a text as input and outputs a series of question and answer pairs. The answers are sentences and phrases extracted from the input text. The extracted phrases can be either full sentences or named entities extracted using [spaCy](https://spacy.io/). Named entities are used for multiple-choice answers. The wrong answers will be other entities of the same type found in the text. The questions are generated by concatenating the extracted answer with the full text (up to a maximum of 512 tokens) as context in the following format:\n\n```\nanswer_token \u003cextracted answer\u003e context_token \u003ccontext\u003e\n```\n\nThe concatenated string is then encoded and fed into the question generator model. The model architecture is `t5-base`. The pretrained model was finetuned as a sequence-to-sequence model on a dataset made up several well-known QA datasets ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/), [RACE](http://www.cs.cmu.edu/~glai1/data/race/), [CoQA](https://stanfordnlp.github.io/coqa/), and [MSMARCO](https://microsoft.github.io/msmarco/)). The datasets were restructured by concatenating the answer and context fields into the previously mentioned format. The concatenated answer and context was then used as an input for training, and the question field became the targets.\n\nThe datasets can be found [here](https://drive.google.com/drive/folders/1JtliZ5FyCmczc7e-iJXUoRplKVaWql8s?usp=sharing).\n\n### QA Evaluator\n\nThe QA evaluator takes a question answer pair as an input and outputs a value representing its prediction about whether the input was a valid question and answer pair or not. The model is `bert-base-cased` with a sequence classification head. The pretrained model was finetuned on the same data as the question generator model, but the context was removed. The question and answer were concatenated 50% of the time. In the other 50% of the time a corruption operation was performed (either swapping the answer for an unrelated answer, or by copying part of the question into the answer). The model was then trained to predict whether the input sequence represented one of the original QA pairs or a corrupted input.\n\nThe input for the QA evaluator follows the format for `BertForSequenceClassification`, but using the question and answer as the two sequences. It is the following format:\n\n```\n[CLS] \u003cquestion\u003e [SEP] \u003canswer [SEP]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAMontgomerie%2Fquestion_generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAMontgomerie%2Fquestion_generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAMontgomerie%2Fquestion_generator/lists"}