{"id":15014115,"url":"https://github.com/kristiyanvachev/question-generation","last_synced_at":"2025-04-05T15:06:32.079Z","repository":{"id":44006885,"uuid":"163272580","full_name":"KristiyanVachev/Question-Generation","owner":"KristiyanVachev","description":"Generating multiple choice questions from text using Machine Learning.","archived":false,"fork":false,"pushed_at":"2021-10-27T07:02:06.000Z","size":20142,"stargazers_count":448,"open_issues_count":2,"forks_count":113,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-02-12T23:46:42.423Z","etag":null,"topics":["ai","cosine-similarity","machine-learning","naive-bayes","nlp","question-generation","question-generator","questions-and-answers","quiz","spacy","spacy-nlp","word-embeddings"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KristiyanVachev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-12-27T09:07:00.000Z","updated_at":"2024-02-12T23:46:34.000Z","dependencies_parsed_at":"2022-09-16T19:01:44.769Z","dependency_job_id":null,"html_url":"https://github.com/KristiyanVachev/Question-Generation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KristiyanVachev%2FQuestion-Generation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KristiyanVachev%2FQuestion-Generation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KristiyanVachev%2FQuestion-Generation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KristiyanVachev%2FQuestion-Generation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KristiyanVachev","download_url":"https://codeload.github.com/KristiyanVachev/Question-Generation/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247353745,"owners_count":20925329,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cosine-similarity","machine-learning","naive-bayes","nlp","question-generation","question-generator","questions-and-answers","quiz","spacy","spacy-nlp","word-embeddings"],"created_at":"2024-09-24T19:45:12.791Z","updated_at":"2025-04-05T15:06:32.060Z","avatar_url":"https://github.com/KristiyanVachev.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n#  Question Generation\n\nThis project was originally intended for an AI course at Sofia University. During it's execution, I was constraint on time and couldn't implement all the ideas I had, but I plan to continue working on it... and I did pick up the topic for my Master's thesis, using **T5 Transformers to generate question-answer pairs along with distractors**. Check it out in the [Question-Generation-Transformers](https://github.com/KristiyanVachev/Question-Generation-Transformers) repository. \n\nThe approach for identifyng keywords used as target answers has been accepted in the RANLP2021 conference - [Generating Answer Candidates for Quizzes and Answer-Aware Question Generators](https://arxiv.org/abs/2108.12898v1).\n\n\n## General idea\nThe idea is to generate multiple choice answers from text, by splitting this complex problem to simpler steps:\n\n - **Identify keywords** from the text and use them as answers to the questions.\n - **Replace the answer** from the sentence with *blank space* and use it as the base for the question.\n - **Transform the sentence** with a blank space for answer to a more *question-like sentence*.\n - **Generate distractors**, words that are similar to the answer, as *incorrect answers*.\n\n![Question generation step by step gif](https://media.giphy.com/media/1n4JPydITD3mGvTZBZ/giphy.gif)\n\n## Installation\n\n### Creating a virtual environment *(optional)*\nTo avoid any conflicts with python packages from other projects, it is a good practice to create a [virtual environment](https://docs.python.org/3/library/venv.html) in which the packages will be installed. If you do not want to this you can skip the next commands and directly install the the requirements.txt file. \n\nCreate a virtual environment :\n\n    python -m venv venv\n\nEnter the virtual environment:\n\n*Windows:*\n\n    . .\\venv\\Scripts\\activate\n\n*Linux or MacOS*\n\n    source .\\venv\\Scripts\\activate\n\nInstall ipython inside the venv:\n\n    ipython kernel install --user --name=.venv\n\nInstall jupyter lab inside the venv:\n\n    pip install jupyterlab\n\n### Installing packages\n\n    pip install -r .\\requirements.txt \n    \n### Run jupyter\n\n    jupyter lab\n\n## Execution\n\n### Data Exploration\nBefore I could to anything, I wanted to understand more about how questions are made and what kind of words are it's answers.\n\nI used the [SQuAD 1.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset which has about 100 000 questions generated from Wikipedia articles.\n\nYou can read about the insights I've found in the *Data Exploration* jupyter notebook.\n\n### Identifying answers\nMy assumption was that **words from the text would be great answers for questions**. All I needed to do was to decide which words, or short phrases, are good enough to become answers.\n\nI decided to do a binary classification on each word from the text. [spaCy](https://spacy.io/) really helped me with the word tagging.\n\n#### Feature engineering\nI pretty much needed to create the entire dataset for the binary classification. \nI extracted each non-stop word from the paragraphs of each question in the SQuAD dataset and added some features on it like:\n\n - **Part of speech**\n - Is it a **Named entity**\n - Are only **alpha characters** used\n - **Shape** - whether it's only alpha characters, digits, has punctuation (xxxx, dddd, Xxx X. Xxxx)\n - **Word count**\n\nAnd the label **isAnswer** - whether the word extracted from the paragraph is the same and in the same place as the answer of the SQuAD question. \n\nSome other features like **TF-IDF** score and **cosine similarity** *to the title* would be the great, but I didn't have the time to add them.\n\nOther than those, it's up to our imagination to create new features - maybe whether it's in the start, middle or end of a sentence,  information about the words surrounding it and more... Though before adding more feature it would be nice to have a metric to assess whether the feature is going to be useful or not.\n\n#### Model training\nI found the problem similar to *spam filtering*, where a common approach is to tag each word of an email as coming from a spam or not a spam email.\n\nI used scikit-learn's **Gaussian Naive Bayes** algorithm to classify each word whether it's an answer.\n\nThe results were surprisingly good - at a quick glance, the algorithm classified most of the words as answers. The ones it didn't were in fact unfit.\n\nThe cool thing about *Naive Bayes* is that you get the **probability** for each word. In the demo I've used that to order the words from the most likely answer to the least likely.\n\n### Creating questions\nAnother assumption I had was that **the sentence of an answer could easily be turned to a question**. Just by placing a *blank space* in the position of the answer in the text I get a **\"cloze\" question** *(sentence with a blank space for the missing word)*\n\n**Answer:** \nOxygen\n\n**Question:**\n \\_____ is a chemical element with symbol O and atomic number 8.\n\nI decided it wasn't worth it to transform the cloze question to a more question-looking sentence, but I imagine it could be done with a **seq2seq neural network**, similarly to the way text is translated from one language to another.\n\n### Generating incorrect answers\nThe part turned out really well. \n\nFor each answer I generate it's most similar words using **word embeddings** and **cosine similarity**.\n\n![Most similar words to oxygen](https://i.gyazo.com/175b9f86b3defc0798800cb06169cc3f.png)\n\nMost of the words are just fine and could easily be mistaken for the correct answer. But there are some which are obviously not appropriate.\n\nSince I didn't have a dataset with incorrect answers I fell back on a more classical approach.\n\nI removed the words that **weren't the same part of speech** or **the same named entity** as the answer, and added some more context from the question.\n\nI would like to find a dataset with multiple choice answers and see if I can create a *ML model* for generating better incorrect answers.\n\n## Results\nAfter adding a Demo project, the generated questions aren't really fit to go into a classroom instantly, but they are't bad either. \n\nThe cool thing is the **simplicity** and **modularity** of the approach, where you could find where it's doing bad (*say it's classifying verbs*) and plug a fix into it. \n\nHaving a complex Neural Network (*like all the papers on the topics do*) will probably do better, especially in the age we're living. But the great thing I found out about this approach, is that it's like a *gateway for a software engineer*, with his software engineering mindset, to get into the field of AI and see meaningful results. \n\n## Future work (*updated*)\nI find this topic quite interesting and with a lot of potential. I would probably continue working in this field.\n\n I even enrolled in a *Masters of Data Mining* and will probably do some similar projects. I will link anything useful here.\n\nI've already put some more time on finishing the project, but I would like to transform it more to a tutorial about getting into the field of AI while having the ability to easily extend it with new custom features. \n\n## Updates\n\n**Update - 29.12.19:** \nThe repository has become pretty popular, so I added a new notebook (*Demo.ipynb*) that combines all the modules and generates questions for any text. I reordered the other notebooks and documented the code (a bit better). \n\n**Update - 09.03.21:** \nAdded a requirements.txt file with instructions to run a virtual environment and fixed the bug a with *ValueError: operands could not be broadcast together with shapes (230, 121) (83, )*\n\nI have also started working on my Master's thesis with a similar topic of Question Generation. \n\n**Update - 27.10.21:** \nI have uploaded the code for my Master's thesis in the [Question-Generation-Transformers](https://github.com/KristiyanVachev/Question-Generation-Transformers) repository. I highly encourage you to check it out. \n\nAdditionally the approach using a classfier to pick the answer candidates has been accepted as a students paper in the RANLP2021 conference. [Paper here](https://arxiv.org/abs/2108.12898v1).\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkristiyanvachev%2Fquestion-generation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkristiyanvachev%2Fquestion-generation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkristiyanvachev%2Fquestion-generation/lists"}