{"id":16739600,"url":"https://github.com/praful932/midas","last_synced_at":"2025-04-10T13:14:00.815Z","repository":{"id":130280871,"uuid":"356556836","full_name":"Praful932/MIDAS","owner":"Praful932","description":"MIDAS@IIITD NLP Task","archived":false,"fork":false,"pushed_at":"2021-04-10T17:45:50.000Z","size":74,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-24T11:56:55.354Z","etag":null,"topics":["midas","nlp"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Praful932.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-10T11:11:03.000Z","updated_at":"2021-04-13T05:33:03.000Z","dependencies_parsed_at":"2023-03-10T17:25:04.978Z","dependency_job_id":null,"html_url":"https://github.com/Praful932/MIDAS","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Praful932%2FMIDAS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Praful932%2FMIDAS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Praful932%2FMIDAS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Praful932%2FMIDAS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Praful932","download_url":"https://codeload.github.com/Praful932/MIDAS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248225653,"owners_count":21068078,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["midas","nlp"],"created_at":"2024-10-13T00:52:22.966Z","updated_at":"2025-04-10T13:14:00.342Z","avatar_url":"https://github.com/Praful932.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp style=\"text-align: center;\"\u003e\u003ca href=\"https://colab.research.google.com/github/Praful932/MIDAS/blob/main/\"\u003e\n  \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\n\u003c/a\u003e\u003c/p\u003e\n\n## [MIDAS Lab](http://midas.iiitd.edu.in/) Task-3 NLP\n\n# Contents\n- [Files to Refer](#files-to-refer)\n- [Models Used](#models-used)\n- [Things tried \u0026 Further Improvements](#things-tried--further-improvements)\n- [References](#references)\n\n## Files to Refer\n- The Repo works best in collab.\n- [Notebook1 - Cleaning, EDA \u0026 Preparation for Modelling](https://colab.research.google.com/drive/1c26l-TR899pfLr09p_Ol3Jnq-Fshv9f1?usp=sharing)\n- [Notebook2 - Modelling](https://colab.research.google.com/drive/1ofOkfCJKriBfMRwv0PNZpJBgxavmynEM?usp=sharing)\n- [Drive Folder](https://drive.google.com/drive/folders/1GEq7QE_wejY6o_U8yFj6jnb1lrSVpP0f?usp=sharing)\n    - `data.csv` - Raw Dataset Provided.\n    - `processed_data.csv` - Processed Dataset generated by Notebook1.\n    - `below_thresh_index.txt` - Indexes of examples from dataset whose category was rare in the dataset, generated by Notebook1 . More details in the Notebook1 .\n    - `Models/Pretrained-bert` - Saved Pretrained model generated by Notebook 2 if `TRAIN = True` and used for loading and inference in Notebook 2.\n\n## Models Used\n- **Random Forest Classifer**, Weighted F1 Score - `0.9764`\n- **DistilBert Uncased**, Weighted F1 Score - `0.8970`\n\n## Things tried \u0026 Further Improvements\n- In Notebook 1 - Preprocessing, for all the text features, lemmatization was tried using spacy, it was dropped as not much changes were seen due to the vocabulary \u0026 the pipeline took too much time to lemmatize \u003e30 mins for ~20k samples.\n- The `description` feature was more of specification than a description with a semantic sense, so the `product_specifications` deemed more useful for fine-tuning a pretrained model for Sequence Classification.\n- Due to using TFidf for the 1st model, around ~47k features were generated, SparsePCA was tried to reduce it, since the dataset was too large, collab crashed. Since already The 1st model was giving a decent score, IncrementalPCA wasn't tried which could have overcome the memory issue.\n- For pretraining DistilBert was used which gave decent score with ~20% examples(Bert memory issue) and only one feature `product_specifications`(for the 2nd model) was used as it had a semantic order.\n- For both Random Forest \u0026 Seq Classification Weighted F1 Score is calculated to ensure imbalance of dataset is taken care of.\n- It is interesting to see predictions of both the model against discarded examples(those which did not have target). Amazing what Transfer Learning can do, with just 20 example for each category\n![image](https://user-images.githubusercontent.com/45713796/114277037-0dcfa680-9a47-11eb-829b-e07fb97b5b80.png)\n- To improve the performance, Hyperparameter tuning can be done, the Pretrained model can be trained with more data.\n\n## References\n- [Text Classification on GLUE](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpraful932%2Fmidas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpraful932%2Fmidas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpraful932%2Fmidas/lists"}