{"id":13535109,"url":"https://github.com/krishna-sharma19/SBU-QA","last_synced_at":"2025-04-02T00:32:29.442Z","repository":{"id":68517365,"uuid":"161274766","full_name":"krishna-sharma19/SBU-QA","owner":"krishna-sharma19","description":"This repository uses pretrain BERT embeddings for transfer learning in QA domain","archived":false,"fork":false,"pushed_at":"2018-12-18T06:24:36.000Z","size":9270,"stargazers_count":29,"open_issues_count":0,"forks_count":9,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-11-02T23:32:13.505Z","etag":null,"topics":["bert-model","fine-tuning","question-answering","question-generation","tensorflow","transfer-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krishna-sharma19.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-12-11T03:58:11.000Z","updated_at":"2020-10-16T12:44:58.000Z","dependencies_parsed_at":"2023-03-07T06:00:18.604Z","dependency_job_id":null,"html_url":"https://github.com/krishna-sharma19/SBU-QA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishna-sharma19%2FSBU-QA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishna-sharma19%2FSBU-QA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishna-sharma19%2FSBU-QA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishna-sharma19%2FSBU-QA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krishna-sharma19","download_url":"https://codeload.github.com/krishna-sharma19/SBU-QA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246735340,"owners_count":20825220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert-model","fine-tuning","question-answering","question-generation","tensorflow","transfer-learning"],"created_at":"2024-08-01T08:00:49.805Z","updated_at":"2025-04-02T00:32:24.406Z","avatar_url":"https://github.com/krishna-sharma19.png","language":"Jupyter Notebook","funding_links":[],"categories":["BERT QA \u0026 RC task:"],"sub_categories":[],"readme":"                                                 ONLINE DEMO COMING SOON....\n\n# SOURCES\nWe’ve taken pre-trained embeddings from BERT model - https://github.com/google-research/bert           \nAutomatic Q/A generation from TheGadFlyProject - https://github.com/TheGadflyProject/TheGadflyProject                 \n- REQUIREMENTS\nYou’ll need TensorFlow, Spacy, Python 3.6, nltk           \n# INTRODUCTION\nDue to high volume of data on the internet, it is becoming increasingly difficult to search for relevant information. Search engines can be  useful to find such information but sometimes it is difficult to answer question asked in natural language. Question-Answering systems for large datasets can be easily trained to give great results but training a deep neural network on a small dataset leads to poor results. One such example is a Question-Answering system for Stony Brook University. To develop QA system for such a scenario, we propose methods to learn weights from other large datasets and then fine-tune it using Stony Brook University website data. We built QA system for SBU using BERT base pre-trained weights. We used several other techniques to retrieve correct context paragraphs and evaluate our system. \n\nWe took pre-trained weights of BERT base system and then fine-tuned it with SQuAD QA training data. To decide paragraph of answer we used multiway classification and infersent.\n\nMost of our work is in 3 files - \n- doc_classifier.ipnb               \n- fine_tuning_squad.ipynb                     \n- fine_tunign_sbu_squad.ipynb                    \n### doc_classifier.ipnb (Jupyter notebook)                                             \n  For finding context of a given question we wrote the doc_retrieval module, it has jupyter notebook which generates json that            can be passed to the trained model for prediction\n### fine_tuning_squad.ipynb (colab notebook)                                                                  \nIn the BERT module, we load the pre-trained embedding and run run_squad.py which was given with the BERT repository.\nAlso, you can run the colab notebook  - https://colab.research.google.com/drive/1vaWITP0EmlmDn-bxAckzrxM8T_gbCAdj\n### DownloadWebsite.py\nThis file is responsible to download wikipedia data of stony brook university and it will convert data such that it can be during pre-training.\n\n# SETUP\nWe use Google collaboratory to explore the BERT model and out experiments went well so we decided to fine tune our network on colab, so the it does not matter where you keep files locally, you need to upload them to the notebooks directory structure or upload it to your drive and then mount your drive. The folder BERT/bert_reqs contains all the requirements you'll need, so make sure you're using it correctly in colab. There are 2 options - \n- Upload file directly from local system\n- Mount Google cloud on Colab and Upload files from Google cloud (RECOMMENDED)           \nWe recommend using cloud, because we experienced that it was faster and it's frustruating to browse for requirements everytime the runtime enviroment resets or notebook resets.\n# FAQs \n### How can I train using run_squad.py?                      \nYou can fine tune your own network with pre-trained embeddings on SQUAD using the following command - \nThis command is also present in fine_tuning_squad.ipynb                   \n```\npython run_squad.py \\\n  --vocab_file=$BERT_BASE_DIR/vocab.txt \\\n  --bert_config_file=$BERT_BASE_DIR/bert_config.json \\\n  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \\\n  --do_train=True \\\n  --train_file=train-v1.2.json \\\n  --do_predict=True \\\n  --predict_file='handmade_qa_sbu.json' \\\n  --train_batch_size=8 \\\n  --learning_rate=3e-5 \\\n  --num_train_epochs=2.0 \\\n  --max_seq_length=384 \\\n  --doc_stride=128 \\\n  --output_dir=$OUTPUT_DIR \n  ```\n### How can I pre-train using stony brook data?     \nWe can use sbu_small_pretrain.tfrecord generated by downloadData.py to pre-train bert_base. Here is a command to do that. \n  ```\npython run_pretraining.py \\\n  --input_file=sbu_small_pretrain.tfrecord \\\n  --output_dir=./temp/ \\\n  --do_train=True \\\n  --do_eval=True \\\n  --bert_config_file=bert_config.json \\\n  --init_checkpoint=bert_model.ckpt \\\n  --train_batch_size=32 \\\n  --max_seq_length=128 \\\n  --max_predictions_per_seq=20 \\\n  --num_train_steps=20 \\\n  --num_warmup_steps=10 \\\n  --learning_rate=2e-4\n  ```\n### How can I predict my own questions and contexts?   \nYou just need to generate json from doc_classifier.ipynb by passing the list of question and context to the create_json() methond and then run the fine tuned network for prediction.            \nNOTE: Please note that output folder must not be empty and should contain checkpoint and data files  \n#### Generating SQUAD style json file - \n```\n#You can also generate json for you own context and paragraphs\ncustom_para = ['''Domestic Student Health Insurance Plan (SHIP) .Benefits and Highlights of the SHIP.SHIP \nhas been developed especially for Stony Brook students (and their dependents) to provide access to comprehensive\ncare that complements the quality health services on campus.The details of the plan are reviewed and recommended\neach year by committee members to ensure that the coverage is well-suited to the needs of the Stony Brook students \nand respectful of their budgets. SHIP is administered by United Healthcare. The Plans meet all of the student health\ninsurance standards developed by the American College Health Association.SHIP is tailor-made for the college\npopulation.Provides continuous coverage at a reasonable cost for most on or off-campus medical care over \nFall/Winter and Spring/Summer Semesters.Covers pre-existing medical conditions \u0026 preventative care.\nAnnual deductible $200 for an individual.Annual out of pocket limit of $3,000 which includes deductibles, \ncopays and coinsurance.Covers inpatient and outpatient mental health care.No deductible applied to prescription \ndrug coverage.Please note: Office visits for Primary Care and Specialists have a $35 copayment \nwith 0% coinsurance with a referral and 30% coinsurance without a referral.''']\ncustom_ques = [\"What is the annual deductible amount for SHIP?\"]\njson_file = create_json(custom_para,custom_ques)\n```\n\nWe used this to find context paragraph containing answer. Basic indea is question embedding should be closer to the paragraph containing the answer. Primary code change is in Infersent/encoder/Infersent.ipynb.\n\n\n#### Generating predictions for this file\n```\npython run_squad.py \\\n  --vocab_file=$BERT_BASE_DIR/vocab.txt \\\n  --bert_config_file=$BERT_BASE_DIR/bert_config.json \\\n  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \\\n  --do_train=False \\\n  --train_file=train-v1.2.json \\\n  --do_predict=True \\\n  --predict_file='qna_sbu_test.json' \\\n  --train_batch_size=8 \\\n  --learning_rate=3e-5 \\\n  --num_train_epochs=2.0 \\\n  --max_seq_length=384 \\\n  --doc_stride=128 \\\n  --output_dir=$OUTPUT_DIR \n  ```\n#### See your answer -\n```\n!cat output_small/predictions.json\n```\n### How can I test on your hand annotated test json file?  \nYou can test our system on fine-tuned network with the hand annotated dataset with the following command\n```\npython run_squad.py \\\n  --vocab_file=$BERT_BASE_DIR/vocab.txt \\\n  --bert_config_file=$BERT_BASE_DIR/bert_config.json \\\n  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \\\n  --do_train=False \\\n  --train_file=train-v1.2.json \\\n  --do_predict=True \\\n  --predict_file='handmade_qa_sbu.json' \\\n  --train_batch_size=8 \\\n  --learning_rate=3e-5 \\\n  --num_train_epochs=2.0 \\\n  --max_seq_length=384 \\\n  --doc_stride=128 \\\n  --output_dir=$OUTPUT_DIR \n  ```\n### TODO\n- [x] Fine tune BERT on SQUAD                                                          \n- [x] Create document classifier to get context                                               \n- [ ] ADD DEMO                                                \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishna-sharma19%2FSBU-QA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrishna-sharma19%2FSBU-QA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishna-sharma19%2FSBU-QA/lists"}