{"id":13617461,"url":"https://github.com/yassersouri/classify-text","last_synced_at":"2026-01-10T08:28:48.872Z","repository":{"id":3613107,"uuid":"4678315","full_name":"yassersouri/classify-text","owner":"yassersouri","description":"\"20 Newsgroups\" text classification with python","archived":true,"fork":false,"pushed_at":"2016-11-30T10:24:03.000Z","size":6369,"stargazers_count":151,"open_issues_count":0,"forks_count":66,"subscribers_count":15,"default_branch":"master","last_synced_at":"2024-08-01T20:47:32.574Z","etag":null,"topics":["machine-learning","text-classification"],"latest_commit_sha":null,"homepage":"http://yassersouri.github.com/classify-text/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yassersouri.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-06-15T17:58:27.000Z","updated_at":"2024-07-01T06:03:53.000Z","dependencies_parsed_at":"2022-08-20T11:50:52.118Z","dependency_job_id":null,"html_url":"https://github.com/yassersouri/classify-text","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yassersouri%2Fclassify-text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yassersouri%2Fclassify-text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yassersouri%2Fclassify-text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yassersouri%2Fclassify-text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yassersouri","download_url":"https://codeload.github.com/yassersouri/classify-text/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223621797,"owners_count":17174756,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","text-classification"],"created_at":"2024-08-01T20:01:42.091Z","updated_at":"2026-01-10T08:28:48.821Z","avatar_url":"https://github.com/yassersouri.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003ccenter\u003eSalam\u003c/center\u003e\n\n## Text Classification with python\n\nThis is an experiment. We want to classify text with python.\n\n### Dataset\n\nFor dataset I used the famous \"Twenty Newsgrousps\" dataset. You can find the dataset freely [here](http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups). \n\nI've included a subset of the dataset in the repo, located at `dataset\\` directory. This subset includes 6 of the 20 newsgroups: `space`, `electronics`, `crypt`, `hockey`, `motorcycles` and `forsale`.\n\nWhen you run `main.py` it asks you for the root of the dataset. You can supply your own dataset assuming it has a similar directory structure.\n\n#### UTF-8 incompatibility\n\nSome of the supplied text files had incompatibility with utf-8!\n\nEven textedit.app can't open those files. And they created problem in the code. So I'll delete them as part of the preprocessing.\n\n### Requirements\n\n* python 2.7\n\n* python modules:\n\n  * scikit-learn (v 0.11)\n  * scipy (v 0.10.1)\n  * colorama\n  * termcolor\n  * matplotlib (for use in `plot.py`)\n\n### The code\n\nThe code is pretty straight forward and well documented.\n\n#### Running the code\n\n\tpython main.py\n\n### Experiments\n\nFor experiments I used the subset of the dataset (as described above). I assume that we like `hockey`, `crypt` and `electronics` newsgroups, and we dislike the others.\n\nFor each experiment we use a \"feature vector\", a \"classifier\" and a train-test splitting strategy.\n\n#### Experiment 1: BOW - NB - 20% test\n\nIn this experiment we use a Bag Of Words (**BOW**) representation of each document. And also a Naive Bayes (**NB**) classifier.\n\nWe split the data, so that **20%** of them remain for testing.\n\n__Results__:\n\n```\n             precision    recall  f1-score   support\n\n   dislikes       0.95      0.99      0.97       575\n      likes       0.99      0.95      0.97       621\n\navg / total       0.97      0.97      0.97      1196\n```\n\n#### Experiment 2: TF - NB - 20% test\n\nIn this experiment we use a Term Frequency (**TF**) representation of each document. And also a Naive Bayes (**NB**) classifier.\n\nWe split the data, so that **20%** of them remain for testing.\n\n__Results__:\n\n```\n             precision    recall  f1-score   support\n\n   dislikes       0.97      0.92      0.94       633\n      likes       0.91      0.97      0.94       563\n\navg / total       0.94      0.94      0.94      1196\n```\n\n#### Experiment 3: TFIDF - NB - 20% test\n\nIn this experiment we use a **TFIDF** representation of each document. And also a Naive Bayes (**NB**) classifier.\n\nWe split the data, so that **20%** of them remain for testing.\n\n__Results__:\n\n```\n             precision    recall  f1-score   support\n\n   dislikes       0.96      0.95      0.95       584\n      likes       0.95      0.96      0.96       612\n\navg / total       0.95      0.95      0.95      1196\n```\n\n#### Experiment 4: TFIDF - SVM - 20% test\n\nIn this experiment we use a **TFIDF** representation of each document. And also a linear Support Vector Machine (**SVM**) classifier.\n\nWe split the data, so that **20%** of them remain for testing.\n\n__Results__:\n\n```\n             precision    recall  f1-score   support\n\n   dislikes       0.96      0.97      0.97       587\n      likes       0.97      0.96      0.97       609\n\navg / total       0.97      0.97      0.97      1196\n```\n\n#### Experiment 5: TFIDF - SVM - KFOLD\n\nIn this experiment we use a **TFIDF** representation of each document. And also a linear Support Vector Machine (**SVM**) classifier.\n\nWe split the data using Stratified **K-Fold** algorithm with **k = 5**.\n\n__Results__:\n\n```\nMean accuracy: 0.977 (+/- 0.002 std)\n```\n\n#### Experiment 5: BOW - NB - KFOLD\n\nIn this experiment we use a **TFIDF** representation of each document. And also a linear Support Vector Machine (**SVM**) classifier.\n\nWe split the data using Stratified **K-Fold** algorithm with **k = 5**.\n\n__Results__:\n\n```\nMean accuracy: 0.968 (+/- 0.002 std)\n```\n\n#### Experiment 6: TFIDF - SVM - 90% test\n\nIn this experiment we use a **TFIDF** representation of each document. And also a linear Support Vector Machine (**SVM**) classifier.\n\nWe split the data, so that **90%** of them remain for testing! Only 10% of the dataset is used for training!\n\n__Results__:\n\n```\n             precision    recall  f1-score   support\n\n   dislikes       0.90      0.95      0.93      2689\n      likes       0.95      0.90      0.92      2693\n\navg / total       0.92      0.92      0.92      5382\n```\n\n#### Experiment 7: TFIDF - SVM - KFOLD - 20 classes\n\nIn this experiment we use a **TFIDF** representation of each document. And also a linear Support Vector Machine (**SVM**) classifier.\n\nWe split the data using Stratified **K-Fold** algorithm with **k = 5**.\n\nWe also use the whole \"Twenty Newsgroups\" dataset, which has **20** classes.\n\n__Results__:\n\n```\nMean accuracy: 0.892 (+/- 0.001 std)\n```\n\n#### Experiment 7: BOW - NB - KFOLD - 20 classes\n\nIn this experiment we use a Bag Of Words (**BOW**) representation of each document. And also a Naive Bayes (**NB**) classifier.\n\nWe split the data using Stratified **K-Fold** algorithm with **k = 5**.\n\nWe also use the whole \"Twenty Newsgroups\" dataset, which has **20** classes.\n\n__Results__:\n\n```\nMean accuracy: 0.839 (+/- 0.003 std)\n```\n\n#### Experiment 8: TFIDF - 5-NN - Distance Weights - 20% test\n\nIn this experiment we use a **TFIDF** representation of each document. And also a K Nearest Neighbors (**KNN**) classifier with **k = 5** and **distance weights**.\n\nWe split the data using Stratified **K-Fold** algorithm with **k = 5**.\n\n__Results__:\n\n```\n             precision    recall  f1-score   support\n\n   dislikes       0.93      0.88      0.90       608\n      likes       0.88      0.93      0.90       588\n\navg / total       0.90      0.90      0.90      1196\n```\n\n#### Experiment 9: TFIDF - 5-NN - Uniform Weights - 20% test\n\nIn this experiment we use a **TFIDF** representation of each document. And also a K Nearest Neighbors (**KNN**) classifier with **k = 5** and **uniform weights**.\n\nWe split the data using Stratified **K-Fold** algorithm with **k = 5**.\n\n__Results__:\n\n```\n             precision    recall  f1-score   support\n\n   dislikes       0.95      0.90      0.92       581\n      likes       0.91      0.95      0.93       615\n\navg / total       0.93      0.93      0.93      1196\n```\n\n#### Experiment 10: TFIDF - 5-NN - Distance Weights - KFOLD\n\nIn this experiment we use a **TFIDF** representation of each document. And also a K Nearest Neighbors (**KNN**) classifier with **k = 5** and **distance weights**.\n\nWe split the data using Stratified **K-Fold** algorithm with **k = 5**.\n\n__Results__:\n\n```\nMean accuracy: 0.908 (+/- 0.003 std)\n```\n\n#### Experiment 11: TFIDF - 5-NN - Distance Weights - KFOLD - 20 classes\n\nIn this experiment we use a **TFIDF** representation of each document. And also a K Nearest Neighbors (**KNN**) classifier with **k = 5** and **distance weights**.\n\nWe split the data using Stratified **K-Fold** algorithm with **k = 5**.\n\nWe also use the whole \"Twenty Newsgroups\" dataset, which has **20** classes.\n\n__Results__:\n\n```\n Mean accuracy: 0.745 (+/- 0.002 std) \n```\n\n### So What?\n\nThis experiments show that text classification can be effectively done by simple tools like TFIDF and SVM.\n\n#### Any Conclusion?\n\nWe have found that TFIDF with SVM have the best performance.\n\nTFIDF with SVM perform well both for 2-class problem and 20-class problem.\n\nI would say if you want suggestion from me, use **TFIDF with SVM**.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyassersouri%2Fclassify-text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyassersouri%2Fclassify-text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyassersouri%2Fclassify-text/lists"}