https://github.com/mdtanvirhossaintusher/qa_classifier

A multi-label text classifier that can classify 69 different programming related question category based on question description.
https://github.com/mdtanvirhossaintusher/qa_classifier

blurr flask gradio huggingface multilabel-classification natural-language-processing render text-classification

Last synced: 3 months ago
JSON representation

A multi-label text classifier that can classify 69 different programming related question category based on question description.

Host: GitHub
URL: https://github.com/mdtanvirhossaintusher/qa_classifier
Owner: MdTanvirHossainTusher
License: mit
Created: 2023-08-31T05:35:36.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-09-01T02:54:44.000Z (almost 3 years ago)
Last Synced: 2025-10-24T08:15:45.503Z (9 months ago)
Topics: blurr, flask, gradio, huggingface, multilabel-classification, natural-language-processing, render, text-classification
Language: Jupyter Notebook
Homepage: https://multilabel-question-category-classifier.onrender.com/
Size: 15.7 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MultiLabel-Question-Category-Classifier(QA_Classifier)

A text classification model from data collection, model training, and deployment.
The model can classify 69 different types of questions categories.
The keys of `deployment\category_types_encoded.json` contains the questions categories.

# Data Collection

Data was collected from a Stackoverflow Website's questions segment: https://stackoverflow.com/questions
The data collection process is divided into two steps:

1. **Question URL Scraping:** The question urls were scraped with `scraper\questions_url_scraper.py` and the urls are stored along with question title in `data\questions_urls.csv`

2. **Question Details Scraping:** Using the urls, full question and categories/tags were scraped with `scraper\questions_details_scraper.py` and stored in `data\questions_details.csv`.

In total, I scraped 22124 book details and 22257 question urls. Some urls didn't contain any valid page. Those details were ignored.

# Data Preprocessing

Initially there were 10634 different categories in the dataset. After some analysis, I found out 10565 of them are rare (contains less amout of related questions). So, I removed those categories and then I have 69 categories only. After removing the data with rare categories there were 17011 samples left in total. Fortunately, dataset didn't have any null values.

# Model Training
Finetuned a distilrobera-base and distilbert-base-uncased model from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed at `notebooks` folder of this branch.

# Result Analysis
In the table I showed the multilabel accuracy, F1 score(macro & micro) for two models.

Model
Accuracy_multi
F1 Score(Micro)
F1 Score(Macro)

distilroberta-base
98.4
67.03
53.34

distilbert-base-uncased
98.3
64.44
52.74

From the above table, we see that, multilabel accuracy are very closed for both the models. But, the F1 Score(Micro & Macro) of `distilroberta-base` is higher than `distilbert-base-uncased` model's F1 Score. So, we can say that, `distilroberta-base` performed slightly better for the given dataset.

# Model Compression and ONNX Inference
The trained model has a memory of 300+ MB. I compressed this model using ONNX quantization and brought it to ~78.8 MB.

# Model Deployment

The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in `deployment` folder or see live [here.](https://huggingface.co/spaces/MdTanvirHossain/QA_Classifier)

Girl in a jacket

# Web Deployment
Deployed a Flask App built to take question description and show the categories as output. Check `flask` branch for the details. The website is live [here.](https://multilabel-question-category-classifier.onrender.com)

Girl in a jacket

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mdtanvirhossaintusher/qa_classifier

Awesome Lists containing this project

README