https://github.com/praful932/midas
MIDAS@IIITD NLP Task
https://github.com/praful932/midas
midas nlp
Last synced: 6 months ago
JSON representation
MIDAS@IIITD NLP Task
- Host: GitHub
- URL: https://github.com/praful932/midas
- Owner: Praful932
- Created: 2021-04-10T11:11:03.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-04-10T17:45:50.000Z (over 4 years ago)
- Last Synced: 2025-03-24T11:56:55.354Z (7 months ago)
- Topics: midas, nlp
- Language: Jupyter Notebook
- Homepage:
- Size: 72.3 KB
- Stars: 7
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## [MIDAS Lab](http://midas.iiitd.edu.in/) Task-3 NLP
# Contents
- [Files to Refer](#files-to-refer)
- [Models Used](#models-used)
- [Things tried & Further Improvements](#things-tried--further-improvements)
- [References](#references)## Files to Refer
- The Repo works best in collab.
- [Notebook1 - Cleaning, EDA & Preparation for Modelling](https://colab.research.google.com/drive/1c26l-TR899pfLr09p_Ol3Jnq-Fshv9f1?usp=sharing)
- [Notebook2 - Modelling](https://colab.research.google.com/drive/1ofOkfCJKriBfMRwv0PNZpJBgxavmynEM?usp=sharing)
- [Drive Folder](https://drive.google.com/drive/folders/1GEq7QE_wejY6o_U8yFj6jnb1lrSVpP0f?usp=sharing)
- `data.csv` - Raw Dataset Provided.
- `processed_data.csv` - Processed Dataset generated by Notebook1.
- `below_thresh_index.txt` - Indexes of examples from dataset whose category was rare in the dataset, generated by Notebook1 . More details in the Notebook1 .
- `Models/Pretrained-bert` - Saved Pretrained model generated by Notebook 2 if `TRAIN = True` and used for loading and inference in Notebook 2.## Models Used
- **Random Forest Classifer**, Weighted F1 Score - `0.9764`
- **DistilBert Uncased**, Weighted F1 Score - `0.8970`## Things tried & Further Improvements
- In Notebook 1 - Preprocessing, for all the text features, lemmatization was tried using spacy, it was dropped as not much changes were seen due to the vocabulary & the pipeline took too much time to lemmatize >30 mins for ~20k samples.
- The `description` feature was more of specification than a description with a semantic sense, so the `product_specifications` deemed more useful for fine-tuning a pretrained model for Sequence Classification.
- Due to using TFidf for the 1st model, around ~47k features were generated, SparsePCA was tried to reduce it, since the dataset was too large, collab crashed. Since already The 1st model was giving a decent score, IncrementalPCA wasn't tried which could have overcome the memory issue.
- For pretraining DistilBert was used which gave decent score with ~20% examples(Bert memory issue) and only one feature `product_specifications`(for the 2nd model) was used as it had a semantic order.
- For both Random Forest & Seq Classification Weighted F1 Score is calculated to ensure imbalance of dataset is taken care of.
- It is interesting to see predictions of both the model against discarded examples(those which did not have target). Amazing what Transfer Learning can do, with just 20 example for each category

- To improve the performance, Hyperparameter tuning can be done, the Pretrained model can be trained with more data.## References
- [Text Classification on GLUE](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)