{"id":19772015,"url":"https://github.com/danieldacosta/text-classification-exploration","last_synced_at":"2025-02-28T04:44:37.729Z","repository":{"id":207877688,"uuid":"720323959","full_name":"DanielDaCosta/Text-Classification-Exploration","owner":"DanielDaCosta","description":"Explore text classification with Logistic Regression and Naive Bayes models. Implementing from scratch, we compare feature engineering techniques like Bag-of-Words, TF-IDF, and Word Embedding for accurate labeling","archived":false,"fork":false,"pushed_at":"2024-07-20T18:14:55.000Z","size":37320,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-11T01:10:38.261Z","etag":null,"topics":["logistic-regression","naive-bayes","nlp","tf-idf","wordembedding"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DanielDaCosta.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-18T05:53:13.000Z","updated_at":"2024-07-20T18:15:00.000Z","dependencies_parsed_at":"2025-01-11T01:19:59.647Z","dependency_job_id":null,"html_url":"https://github.com/DanielDaCosta/Text-Classification-Exploration","commit_stats":{"total_commits":6,"total_committers":1,"mean_commits":6.0,"dds":0.0,"last_synced_commit":"4eaa76b0edf7de3793200073726a7d7eaa01b522"},"previous_names":["danieldacosta/text-classification-exploration"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FText-Classification-Exploration","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FText-Classification-Exploration/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FText-Classification-Exploration/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FText-Classification-Exploration/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DanielDaCosta","download_url":"https://codeload.github.com/DanielDaCosta/Text-Classification-Exploration/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241101664,"owners_count":19909943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["logistic-regression","naive-bayes","nlp","tf-idf","wordembedding"],"created_at":"2024-11-12T05:05:06.179Z","updated_at":"2025-02-28T04:44:37.712Z","avatar_url":"https://github.com/DanielDaCosta.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Exploring the Efficacy of Simple Models and Feature Engineering Techniques in Text Classification: A Comparative Analysis\n\nPaper: [Exploring the Efficacy of Simple Models and Feature Engineering Techniques in Text Classification.pdf](https://github.com/DanielDaCosta/Text-Classification-Exploration/blob/main/Exploring%20the%20Efficacy%20of%20Simple%20Models%20and%20Feature%20Engineering%20Techniques%20in%20Text%20Classification.pdf)\n\n# Abstract\nText classification is one of the core tasks of Natural Language Processing. Despite being such a complicated task, reasonable results can be obtained by using simple models such as Logistic Regression and Naive Bayes. Beyond the choice of model, the method of feature en- gineering employed also holds an important role in predicting accurate labels. In this paper, we implement these models from scratch using their default configurations and perform a comparative analysis with alternative feature engineering techniques such as Bag-of-Words, TF-IDF (Term Frequency - Inverse Document Frequency), and Word Embedding.\n\n# Introduction\nThe primary goal of this paper is to provide a comprehensive exploration of the methodologies and strategies involved in constructing Logistic Regression and Naive Bayes models from scratch. When building these models, it is important to focus not only on correctly implementing them but also on the discerning selection of feature engineering techniques that yield optimal results across all four datasets. To achieve this goal, not only are the models effectively developed, but also analyze feature engineering methods.\n\n# Usage\n## Train\n- 4dim\n```\npython train.py -m naivebayes -i datasets/4dim.train.txt -o nb.4dim.model\n\npython train.py -m logreg -i datasets/4dim.train.txt -o logreg.4dim.model\n\npython train.py -m logreg_word2vec -i datasets/4dim.train.txt -o logreg_word2vec.4dim.model -e word2vec_embedding.wordvectors\n\npython train.py -m naivebayes_tfidf -i datasets/4dim.train.txt -o naivebayes_tfidf.4dim.model\n```\n- odiya\n```\npython train.py -m naivebayes -i datasets/odiya.train.txt -o nb.odiya.model\n\npython train.py -m logreg -i datasets/odiya.train.txt -o logreg.odiya.model\n\npython train.py -m logreg_word2vec -i datasets/odiya.train.txt -o logreg_word2vec.odiya.model -e word2vec_embedding.wordvectors\n\npython train.py -m naivebayes_tfidf -i datasets/odiya.train.txt -o naivebayes_tfidf.odiya.model\n```\n- products\n```\npython train.py -m naivebayes -i datasets/products.train.txt -o nb.products.model\n\npython train.py -m logreg -i datasets/products.train.txt -o logreg.products.model\n\npython train.py -m logreg_word2vec -i datasets/products.train.txt -o logreg_word2vec.products.model -e word2vec_embedding.wordvectors\n\npython train.py -m naivebayes_tfidf -i datasets/products.train.txt -o naivebayes_tfidf.products.model\n```\n- questions\n```\npython train.py -m naivebayes -i datasets/questions.train.txt -o nb.questions.model\n\npython train.py -m logreg -i datasets/questions.train.txt -o logreg.questions.model\n\npython train.py -m logreg_word2vec -i datasets/questions.train.txt -o logreg_word2vec.questions.model -e word2vec_embedding.wordvectors\n\npython train.py -m naivebayes_tfidf -i datasets/questions.train.txt -o naivebayes_tfidf.questions.model\n```\n## Classify\n- 4dim\n```\npython classify.py -m nb.4dim.model -i datasets/4dim.val.test.txt -o datasets/4dim.val.pred.txt\n\npython classify.py -m logreg.4dim.model -i datasets/4dim.val.test.txt -o datasets/4dim.val.pred.txt\n\npython classify.py -m logreg_word2vec.4dim.model -i datasets/4dim.val.test.txt -o datasets/4dim.val.pred.txt\n\npython classify.py -m naivebayes_tfidf.odiya.model -i datasets/4dim.val.test.txt -o datasets/4dim.val.pred.txt\n\n```\n- odiya\n```\npython classify.py -m nb.odiya.model -i datasets/odiya.val.test.txt -o datasets/odiya.val.pred.txt\n\npython classify.py -m logreg.odiya.model -i datasets/odiya.val.test.txt -o datasets/odiya.val.pred.txt\n\npython classify.py -m logreg_word2vec.odiya.model -i datasets/odiya.val.test.txt -o datasets/odiya.val.pred.txt\n\npython classify.py -m naivebayes_tfidf.odiya.model -i datasets/odiya.val.test.txt -o datasets/odiya.val.pred.txt\n```\n- questions\n```\npython classify.py -m nb.questions.model -i datasets/questions.val.test.txt -o datasets/questions.val.pred.txt\n\npython classify.py -m logreg.odiya.model -i datasets/odiya.val.test.txt -o datasets/odiya.val.pred.txt\n\npython classify.py -m logreg_word2vec.odiya.model -i datasets/odiya.val.test.txt -o datasets/odiya.val.pred.txt\n\npython classify.py -m naivebayes_tfidf.odiya.model -i datasets/odiya.val.test.txt -o datasets/odiya.val.pred.txt\n```\n\n- products\n```\npython classify.py -m nb.products.model -i datasets/products/val.test.txt -o datasets/products.val.pred.txt\n\npython classify.py -m logreg.products.model -i datasets/products/val.test.txt -o datasets/products/val.pred.txt\n\npython classify.py -m logreg_word2vec.products.model -i datasets/products/val.test.txt -o datasets/products/val.pred.txt\n\npython classify.py -m naivebayes_tfidf.products.model -i datasets/products/val.test.txt -o datasets/products/val.pred.txt\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieldacosta%2Ftext-classification-exploration","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanieldacosta%2Ftext-classification-exploration","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieldacosta%2Ftext-classification-exploration/lists"}