{"id":20612870,"url":"https://github.com/arnoldgaius/text_classifier","last_synced_at":"2026-05-19T02:04:18.478Z","repository":{"id":57474627,"uuid":"93394244","full_name":"ArnoldGaius/Text_Classifier","owner":"ArnoldGaius","description":"基于sklearn的文本分类器 Text classifier based on sklearn","archived":false,"fork":false,"pushed_at":"2017-06-08T05:00:34.000Z","size":78,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-02-19T01:35:08.396Z","etag":null,"topics":["pypi","scikit-learn","text-classifier"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArnoldGaius.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-05T10:50:02.000Z","updated_at":"2022-10-21T03:17:53.000Z","dependencies_parsed_at":"2022-09-10T02:21:33.729Z","dependency_job_id":null,"html_url":"https://github.com/ArnoldGaius/Text_Classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArnoldGaius%2FText_Classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArnoldGaius%2FText_Classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArnoldGaius%2FText_Classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArnoldGaius%2FText_Classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArnoldGaius","download_url":"https://codeload.github.com/ArnoldGaius/Text_Classifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242261155,"owners_count":20098706,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pypi","scikit-learn","text-classifier"],"created_at":"2024-11-16T11:08:11.752Z","updated_at":"2026-05-19T02:04:18.439Z","avatar_url":"https://github.com/ArnoldGaius.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI version](https://img.shields.io/badge/python-2.7-blue.svg)](https://badge.fury.io/py/TextClassifier)\n[![PyPI version](https://badge.fury.io/py/TextClassifier.svg)](https://badge.fury.io/py/TextClassifier)\n\n文本分类器 Text classifier\n=======================================================\nText Classifier based on Numpy,Scikit-learn,Pandas,Matplotlib\n\nTrain Data Format\n----------------------\n| **type** | **Text** |\n|:-----------:|:---------------------------------------------------:|\n| game | The LoL champions pro players would ban forever |\n| society | In Beijing you should keep the rules |\n| etc. | etc. |\n\nSample Usage\n----------------------\n```python\n\u003e\u003e\u003e import TextClassifier\n\n # cerat classifier container\n\u003e\u003e\u003e tc = TextClassifier.classifier_container()\n\n # load data\n # '../data/Train_data.txt' is data path \n # sep Default = ',' you can change it to '\\t',etc. \n\u003e\u003e\u003e tc.load_Data('../data/Train_data.txt',sep=',')\n\n # train the model\n\u003e\u003e\u003e tc.train()\n\n # prediction. Input list or text-String\n\u003e\u003e\u003e print tc.predict('Faker is the first League of Legends player to earn over $1 million in prize money')\n [u'game']\n\u003e\u003e\u003e print tc.predict(['Faker is the first League of Legends player to earn over $1 million in prize money',\n '18-year-old youth killed 88-year-old veteran',\n 'Take you into the real North Korea'])\n [u'game',u'society',u'world']\n\n #get X_train, X_test, y_train, y_test\n\u003e\u003e\u003e from sklearn import cross_validation\n\u003e\u003e\u003e X_train, X_test, y_train, y_test = cross_validation.train_test_split(original_data['Text'], original_data['Categorization'], test_size=0.3, random_state=0)\n\n #get TrainData Accuracy\n\u003e\u003e\u003e tc.Accuracy(X_train, y_train)\n Accuracy:\n 0.917504310503\n```\n\n```python\n #get Confusion Matrix\n\u003e\u003e\u003e Y_predict = tc.predict(X_test)\n\u003e\u003e\u003e tc.confusion_matrix(y_test, Y_predict)\n Confusion Matrix :\n military baby car game food sports finance discovery regimen travel fashion history society story tech world entertainment essay\nmilitary 2831 5 3 16 9 4 8 10 0 15 8 24 9 3 6 42 6 1\nbaby 0 2932 3 3 26 0 1 0 10 7 10 3 16 4 3 7 20 4\ncar 6 10 2813 3 6 8 13 3 1 13 10 3 39 1 11 5 24 4\ngame 10 11 6 2843 5 9 2 4 1 11 13 3 8 4 25 3 31 3\nfood 0 38 0 3 2799 1 5 1 67 34 16 7 9 3 4 8 14 10\nsports 2 7 6 13 6 2803 9 0 1 13 24 5 10 1 5 19 42 4\nfinance 12 10 13 4 15 6 2692 1 2 21 5 3 18 2 79 47 12 8\ndiscovery 8 2 0 3 3 2 5 1155 1 5 1 1 1 0 13 9 0 1\nregimen 0 59 0 0 63 0 2 0 1093 0 3 3 4 2 0 1 5 0\ntravel 9 19 8 8 23 4 9 8 0 2741 19 20 19 7 13 55 14 12\nfashion 2 21 5 9 14 9 1 5 13 18 2772 5 7 1 6 11 77 7\nhistory 49 9 2 3 6 3 3 6 4 28 3 2813 12 20 2 35 21 6\nsociety 27 77 50 7 43 7 42 5 16 78 27 13 2414 29 36 36 58 15\nstory 3 17 1 3 7 2 2 2 2 7 5 12 19 1120 4 6 14 11\ntech 16 8 19 21 6 3 52 13 3 6 5 4 14 0 2787 9 17 7\nworld 52 33 12 8 9 16 33 24 2 35 27 37 50 8 20 2583 30 4\nentertainment 5 14 3 28 6 13 4 3 1 9 120 29 17 3 12 10 2708 8\nessay 7 23 5 3 12 1 8 6 4 15 22 11 7 2 5 2 11 1010\n```\n\n```python\n #get sub_result and Figure\n\u003e\u003e\u003e tc.plot_display(y_test, Y_predict)\n Plot display...\n Test count: Predict count: Sub Result: Sub_Abs Result:\nbaby 3049 3295 246 246\ncar 2973 2949 -24 24\ndiscovery 1210 1246 36 36\nentertainment 2993 3104 111 111\nessay 1154 1115 -39 39\nfashion 2983 3090 107 107\nfinance 2950 2891 -59 59\nfood 3019 3058 39 39\ngame 2992 2978 -14 14\nhistory 3025 2996 -29 29\nmilitary 3000 3039 39 39\nregimen 1235 1221 -14 14\nsociety 2980 2673 -307 307\nsports 2970 2891 -79 79\nstory 1237 1210 -27 27\ntech 2990 3031 41 41\ntravel 2988 3056 68 68\nworld 2983 2888 -95 95\n```\n\n![image](https://github.com/ArnoldGaius/Text_Classifier/blob/master/image/Figure.png)\n\nPerformance\n----------------------\n- Train set: 156k news headline with 18 labels\n- Test set: 36k news headline with 18 labels\n- Compare with svm , naive-bayes , SGD(loss = 'perceptron') of [Scikit-learn](https://github.com/scikit-learn/scikit-learn)\n\n| Classifier | Accuracy | Time cost(s) |\n|:------------------------:|:---------:|:--------------:|\n| scikit-learn(svm) | 71.6% | 241 |\n| scikit-learn(nb) | 72.7% | 12 |\n| scikit-learn(SGD) | 72.4% | 197 |\n| **TextClassifier** | **76.8%** | **8** |\n\nInstallation\n----------------------\n $ pip install TextClassifier\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnoldgaius%2Ftext_classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farnoldgaius%2Ftext_classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnoldgaius%2Ftext_classifier/lists"}