{"id":24075833,"url":"https://github.com/anurima-saha/text-classification","last_synced_at":"2026-01-28T01:33:36.791Z","repository":{"id":251968853,"uuid":"838995834","full_name":"anurima-saha/Text-Classification","owner":"anurima-saha","description":"Wine reviews used to determine the type of wine training on imbalanced data using classification algorithms like SVM, Naive Bayes and Random Forest Classifier. Neural Network (CNN, RNN and LSTM)  and LLM models (DistilBERT and RoBERTa) were also used followed by error analysis using SHAP.","archived":false,"fork":false,"pushed_at":"2024-08-06T19:34:42.000Z","size":3195,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-01T17:22:46.791Z","etag":null,"topics":["distill-bert","llm","llm-finetuning","naive-bayes-classifier","random-forest-classifier","rnn-pytorch","roberta-model","shapley-value","svm-classifier","text-classification","transfer-learning"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anurima-saha.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-06T18:53:05.000Z","updated_at":"2024-12-18T03:00:29.000Z","dependencies_parsed_at":"2024-08-06T22:39:28.149Z","dependency_job_id":"a83623b1-92b9-4850-ae22-11bb38682d16","html_url":"https://github.com/anurima-saha/Text-Classification","commit_stats":null,"previous_names":["anurima-saha/text-classification"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/anurima-saha/Text-Classification","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anurima-saha%2FText-Classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anurima-saha%2FText-Classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anurima-saha%2FText-Classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anurima-saha%2FText-Classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anurima-saha","download_url":"https://codeload.github.com/anurima-saha/Text-Classification/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anurima-saha%2FText-Classification/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28831679,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-27T23:29:49.665Z","status":"ssl_error","status_checked_at":"2026-01-27T23:25:58.379Z","response_time":168,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distill-bert","llm","llm-finetuning","naive-bayes-classifier","random-forest-classifier","rnn-pytorch","roberta-model","shapley-value","svm-classifier","text-classification","transfer-learning"],"created_at":"2025-01-09T19:29:34.683Z","updated_at":"2026-01-28T01:33:36.766Z","avatar_url":"https://github.com/anurima-saha.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text-Classification\nIn this project, wine reviews have been used to determine the type of wine training on imbalanced an dataset using classification algorithms like SVM, Naive Bayes and Random Forest Classifier. Neural Network (CNN, RNN and LSTM) and LLM models (DistilBERT and RoBERTa) were also used followed by error analysis using SHAP.\n\n## Overview:\nWe have been provided with a wine reviews dataset with two columns: “review_text” and “wine_variant” and the goal is to create a wine recommendation system using test classification.\n#### Data:\n* Target variable – ‘wine_variant’\n* Categories – 8 Types - 'Pinot Noir', 'Sauvignon Blanc', 'Cabernet Sauvignon', 'Chardonnay', 'Syrah', 'Riesling', 'Merlot', 'Zinfandel'\n* Train data – 10000 observations were split into test set of sample size 25% (2500). Stratified sampling used for appropriate representation of above-mentioned classes. An additional \n  validation data with 5000 observations has been used.\n* Distribution – In percentage\n  \u003e![image](https://github.com/user-attachments/assets/772877d8-cd17-4014-bb28-6bb1cc005dc6)\n  \u003e![image](https://github.com/user-attachments/assets/a328d8b2-2e42-419c-ac37-dd28bdcc8df2)\n## Models and Algorithms\n#### Embedding:\n* TF-IDF vectorization\n* Latent Semantic Analysis\n* Sentence Transformer (all-mpnet-base-v2)\n* torchtext.vocab\n  \u003cbr\u003e\n#### Alogorithms:\n##### Supverised ML\n* Linear and Non-linear SVM\n* SDG Classifier\n* Multinomial Naive Bayes\n* Random Forest Classifier\n##### Neural Network\n* CNN\n* LSTM\n##### LLM\n* DistilBERT\n* RoBERTa\n## Conclusion\nFrom the above results we have the four best classifier along list in the order of descending macro average f1 score on validation set:\n1. RoBERTa (0.80)\n2. DistilBERT (0.79)\n3. TFIDF Vectorization + Linear SVC (with hyperparameter tuning) (0.78)\n4. CNN (0.77)\nWe can conclude two things from the above analysis:\n1. Given the size of the training set, the transfer learning algorithms(RoBERTa and DistilBERT) are likely to provide much better results as seen in the table above.\n2. Given the class imbalance in the dataset, the best way to group the categories is on the basis of domain knowledge as stated above. Grouping on the basis of taste and flavour is more appropriate when building a wine recommendation system rather looking at the distribution of target variables. This has led to a significant improvement in results improving classification accuracy from low 70s to almost 80%.\n3. Although our model has shown a significant improvement in results from the baseline SVC model, the macro f1 score does not go above 80% even after working with\nmultiple models. This is a clear indication that we need more training data to improve our classification report.\n\n## Error Analysis\nWe have used the RoBERTa model for performing error analysis using SHAP. We have taken a sample of 30 mis-predicted observations from the provided test set of sample size 500 for this analysis. We will look into a few samples for our report, for a model detailed analysis please refer to the code.\n\n#### Example 1: “Medium to Full-bodied Reds” classified as “Bold Reds”\nWhile words like “light” and “oak” incline the results towards “Medium to Full-bodied Reds”, the final outcome seems to influenced by the use of “powerful”, “refrain” and “berries”.\n\u003e![image](https://github.com/user-attachments/assets/b8ba2752-4fa3-44c0-81a7-29b55016251f)\n\n#### Example 2: “Medium to Full-bodied Reds” classified as “Bold Reds”\nIn this example we see that the use of words like, “TONS” and “more fruit” has pushed the classifier to predict “Bold Red”\n\u003e![image](https://github.com/user-attachments/assets/8138acb9-f1f9-4dc7-84cf-d7222de99ad6)\n\n#### Example 3: “Bold Reds” classified as “Medium to Full-bodied Reds”\nIn the given scenario, the word “medium” clearly influences the result\n\u003e![image](https://github.com/user-attachments/assets/97ab1b25-db08-4a8e-893a-216e29253f2e)\n\n#### “Light-bodied, Crisp Whites” classified as “Full-bodied Whites”\nThe use of the word “champagne” which is a “Full-bodied white” has stirred the prediction to be as such.\nFrom the above analysis we see errors that are primarily domain knowledge related. However, in the reviews we also have text that are redundant and do not contribute to the classification with respect to taste of quality of wine as seen below. Hence, a recommendation from this would to carefully curate samples that are used to train the wine-recommendation model in order to obtain more accurate results.\n\u003e![image](https://github.com/user-attachments/assets/b183f4f0-de95-4447-9d9d-a8186e30a5f9)\n\nFor more details please refer to [Project Report](https://github.com/anurima-saha/Text-Classification/blob/main/Project%20Report.pdf)\n\n\n  \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanurima-saha%2Ftext-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanurima-saha%2Ftext-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanurima-saha%2Ftext-classification/lists"}