{"id":26488037,"url":"https://github.com/mdtanvirhossaintusher/qa_classifier","last_synced_at":"2026-05-06T04:32:22.068Z","repository":{"id":191828201,"uuid":"685384578","full_name":"MdTanvirHossainTusher/QA_Classifier","owner":"MdTanvirHossainTusher","description":"A multi-label text classifier that can classify 69 different programming related question category based on question description.","archived":false,"fork":false,"pushed_at":"2023-09-01T02:54:44.000Z","size":16493,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-24T08:15:45.503Z","etag":null,"topics":["blurr","flask","gradio","huggingface","multilabel-classification","natural-language-processing","render","text-classification"],"latest_commit_sha":null,"homepage":"https://multilabel-question-category-classifier.onrender.com/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MdTanvirHossainTusher.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-08-31T05:35:36.000Z","updated_at":"2023-09-01T15:40:57.000Z","dependencies_parsed_at":"2025-07-14T09:31:15.569Z","dependency_job_id":"35085a8f-8c85-4cac-95b9-fdfe597511d3","html_url":"https://github.com/MdTanvirHossainTusher/QA_Classifier","commit_stats":null,"previous_names":["mdtanvirhossaintusher/qa_classifier"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MdTanvirHossainTusher/QA_Classifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MdTanvirHossainTusher%2FQA_Classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MdTanvirHossainTusher%2FQA_Classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MdTanvirHossainTusher%2FQA_Classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MdTanvirHossainTusher%2FQA_Classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MdTanvirHossainTusher","download_url":"https://codeload.github.com/MdTanvirHossainTusher/QA_Classifier/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MdTanvirHossainTusher%2FQA_Classifier/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32678604,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-06T02:33:58.958Z","status":"ssl_error","status_checked_at":"2026-05-06T02:33:39.611Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blurr","flask","gradio","huggingface","multilabel-classification","natural-language-processing","render","text-classification"],"created_at":"2025-03-20T06:55:41.320Z","updated_at":"2026-05-06T04:32:22.049Z","avatar_url":"https://github.com/MdTanvirHossainTusher.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MultiLabel-Question-Category-Classifier(QA_Classifier)\n\nA text classification model from data collection, model training, and deployment.\nThe model can classify 69 different types of questions categories.\nThe keys of `deployment\\category_types_encoded.json` contains the questions categories.\n\n# Data Collection\n\nData was collected from a Stackoverflow Website's questions segment: https://stackoverflow.com/questions\nThe data collection process is divided into two steps:\n\n1. **Question URL Scraping:** The question urls were scraped with `scraper\\questions_url_scraper.py` and the urls are stored along with question title in `data\\questions_urls.csv`\n\n2. **Question Details Scraping:** Using the urls, full question and categories/tags were scraped with `scraper\\questions_details_scraper.py` and stored in `data\\questions_details.csv`. \n\nIn total, I scraped 22124 book details and 22257 question urls. Some urls didn't contain any valid page. Those details were ignored. \n\n\n# Data Preprocessing\n\nInitially there were 10634 different categories in the dataset. After some analysis, I found out 10565 of them are rare (contains less amout of related questions). So, I removed those categories and then I have 69 categories only. After removing the data with rare categories there were 17011 samples left in total. Fortunately, dataset didn't have any null values.\n\n# Model Training\nFinetuned a distilrobera-base and distilbert-base-uncased model from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed at `notebooks` folder of this branch.\n\n# Result Analysis\nIn the table I showed the multilabel accuracy, F1 score(macro \u0026 micro) for two models.\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eModel\u003c/th\u003e\n    \u003cth\u003eAccuracy_multi\u003c/th\u003e\n    \u003cth\u003eF1 Score(Micro)\u003c/th\u003e\n    \u003cth\u003eF1 Score(Macro)\u003c/th\u003e\n  \u003c/tr\u003e\n  \n  \u003ctr\u003e\n    \u003ctd\u003edistilroberta-base\u003c/td\u003e\n    \u003ctd\u003e98.4\u003c/td\u003e\n    \u003ctd\u003e67.03\u003c/td\u003e\n    \u003ctd\u003e53.34\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003edistilbert-base-uncased\u003c/td\u003e\n    \u003ctd\u003e98.3\u003c/td\u003e\n    \u003ctd\u003e64.44\u003c/td\u003e\n    \u003ctd\u003e52.74\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nFrom the above table, we see that, multilabel accuracy are very closed for both the models. But, the F1 Score(Micro \u0026 Macro) of `distilroberta-base` is higher than `distilbert-base-uncased` model's F1 Score. So, we can say that, `distilroberta-base` performed slightly better for the given dataset.\n\n# Model Compression and ONNX Inference\nThe trained model has a memory of 300+ MB. I compressed this model using ONNX quantization and brought it to ~78.8 MB.\n\n# Model Deployment\n\nThe compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in `deployment` folder or see live [here.](https://huggingface.co/spaces/MdTanvirHossain/QA_Classifier)\n\n\u003cimg src=\"deployment/gradio_app_2.PNG\" alt=\"Girl in a jacket\" style=\"width:1600px;height:400px;\"\u003e \u003c/br\u003e\n\n\n# Web Deployment\nDeployed a Flask App built to take question description and show the categories as output. Check `flask` branch for the details. The website is live [here.](https://multilabel-question-category-classifier.onrender.com)\n\u003c/br\u003e\u003c/br\u003e\n\u003cimg src=\"deployment/flask_app_home.PNG\" alt=\"Girl in a jacket\" style=\"width:1000px;height:500px;\"\u003e\u003c/br\u003e\u003c/br\u003e\n\u003cimg src=\"deployment/flask_app_result.PNG\" alt=\"Girl in a jacket\" style=\"width:1000px;height:500px;\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdtanvirhossaintusher%2Fqa_classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmdtanvirhossaintusher%2Fqa_classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdtanvirhossaintusher%2Fqa_classifier/lists"}