https://github.com/praful932/midas

MIDAS@IIITD NLP Task
https://github.com/praful932/midas

midas nlp

Last synced: 6 months ago
JSON representation

MIDAS@IIITD NLP Task

Host: GitHub
URL: https://github.com/praful932/midas
Owner: Praful932
Created: 2021-04-10T11:11:03.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-04-10T17:45:50.000Z (over 4 years ago)
Last Synced: 2025-03-24T11:56:55.354Z (7 months ago)
Topics: midas, nlp
Language: Jupyter Notebook
Homepage:
Size: 72.3 KB
Stars: 7
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## [MIDAS Lab](http://midas.iiitd.edu.in/) Task-3 NLP

# Contents
- [Files to Refer](#files-to-refer)
- [Models Used](#models-used)
- [Things tried & Further Improvements](#things-tried--further-improvements)
- [References](#references)

## Files to Refer
- The Repo works best in collab.
- [Notebook1 - Cleaning, EDA & Preparation for Modelling](https://colab.research.google.com/drive/1c26l-TR899pfLr09p_Ol3Jnq-Fshv9f1?usp=sharing)
- [Notebook2 - Modelling](https://colab.research.google.com/drive/1ofOkfCJKriBfMRwv0PNZpJBgxavmynEM?usp=sharing)
- [Drive Folder](https://drive.google.com/drive/folders/1GEq7QE_wejY6o_U8yFj6jnb1lrSVpP0f?usp=sharing)
- `data.csv` - Raw Dataset Provided.
- `processed_data.csv` - Processed Dataset generated by Notebook1.
- `below_thresh_index.txt` - Indexes of examples from dataset whose category was rare in the dataset, generated by Notebook1 . More details in the Notebook1 .
- `Models/Pretrained-bert` - Saved Pretrained model generated by Notebook 2 if `TRAIN = True` and used for loading and inference in Notebook 2.

## Models Used
- **Random Forest Classifer**, Weighted F1 Score - `0.9764`
- **DistilBert Uncased**, Weighted F1 Score - `0.8970`

## Things tried & Further Improvements
- In Notebook 1 - Preprocessing, for all the text features, lemmatization was tried using spacy, it was dropped as not much changes were seen due to the vocabulary & the pipeline took too much time to lemmatize >30 mins for ~20k samples.
- The `description` feature was more of specification than a description with a semantic sense, so the `product_specifications` deemed more useful for fine-tuning a pretrained model for Sequence Classification.
- Due to using TFidf for the 1st model, around ~47k features were generated, SparsePCA was tried to reduce it, since the dataset was too large, collab crashed. Since already The 1st model was giving a decent score, IncrementalPCA wasn't tried which could have overcome the memory issue.
- For pretraining DistilBert was used which gave decent score with ~20% examples(Bert memory issue) and only one feature `product_specifications`(for the 2nd model) was used as it had a semantic order.
- For both Random Forest & Seq Classification Weighted F1 Score is calculated to ensure imbalance of dataset is taken care of.
- It is interesting to see predictions of both the model against discarded examples(those which did not have target). Amazing what Transfer Learning can do, with just 20 example for each category
![image](https://user-images.githubusercontent.com/45713796/114277037-0dcfa680-9a47-11eb-829b-e07fb97b5b80.png)
- To improve the performance, Hyperparameter tuning can be done, the Pretrained model can be trained with more data.

## References
- [Text Classification on GLUE](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/praful932/midas

Awesome Lists containing this project

README