{"id":20612870,"url":"https://github.com/arnoldgaius/text_classifier","last_synced_at":"2026-05-19T02:04:18.478Z","repository":{"id":57474627,"uuid":"93394244","full_name":"ArnoldGaius/Text_Classifier","owner":"ArnoldGaius","description":"基于sklearn的文本分类器 Text classifier based on sklearn","archived":false,"fork":false,"pushed_at":"2017-06-08T05:00:34.000Z","size":78,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-02-19T01:35:08.396Z","etag":null,"topics":["pypi","scikit-learn","text-classifier"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArnoldGaius.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-05T10:50:02.000Z","updated_at":"2022-10-21T03:17:53.000Z","dependencies_parsed_at":"2022-09-10T02:21:33.729Z","dependency_job_id":null,"html_url":"https://github.com/ArnoldGaius/Text_Classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArnoldGaius%2FText_Classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArnoldGaius%2FText_Classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArnoldGaius%2FText_Classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArnoldGaius%2FText_Classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArnoldGaius","download_url":"https://codeload.github.com/ArnoldGaius/Text_Classifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242261155,"owners_count":20098706,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pypi","scikit-learn","text-classifier"],"created_at":"2024-11-16T11:08:11.752Z","updated_at":"2026-05-19T02:04:18.439Z","avatar_url":"https://github.com/ArnoldGaius.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI version](https://img.shields.io/badge/python-2.7-blue.svg)](https://badge.fury.io/py/TextClassifier)\n[![PyPI version](https://badge.fury.io/py/TextClassifier.svg)](https://badge.fury.io/py/TextClassifier)\n\n文本分类器 Text classifier\n=======================================================\nText Classifier based on Numpy,Scikit-learn,Pandas,Matplotlib\n\nTrain Data Format\n----------------------\n|   **type**  |                     **Text**                        |\n|:-----------:|:---------------------------------------------------:|\n|     game    |   The LoL champions pro players would ban forever   |\n|     society |   In Beijing you should keep the rules              |\n|     etc.    |   etc.                                              |\n\nSample Usage\n----------------------\n```python\n\u003e\u003e\u003e import TextClassifier\n\n    # cerat classifier container\n\u003e\u003e\u003e tc = TextClassifier.classifier_container()\n\n    # load data\n    # '../data/Train_data.txt' is data path \n    # sep Default = ',' you can change it to '\\t',etc.  \n\u003e\u003e\u003e tc.load_Data('../data/Train_data.txt',sep=',')\n\n    # train the model\n\u003e\u003e\u003e tc.train()\n\n    # prediction. Input list or text-String\n\u003e\u003e\u003e print tc.predict('Faker is the first League of Legends player to earn over $1 million in prize money')\n    [u'game']\n\u003e\u003e\u003e print tc.predict(['Faker is the first League of Legends player to earn over $1 million in prize money',\n                    '18-year-old youth killed 88-year-old veteran',\n                    'Take you into the real North Korea'])\n    [u'game',u'society',u'world']\n\n    #get X_train, X_test, y_train, y_test\n\u003e\u003e\u003e from sklearn import cross_validation\n\u003e\u003e\u003e X_train, X_test, y_train, y_test = cross_validation.train_test_split(original_data['Text'], original_data['Categorization'], test_size=0.3, random_state=0)\n\n    #get TrainData Accuracy\n\u003e\u003e\u003e tc.Accuracy(X_train, y_train)\n    Accuracy:\n    0.917504310503\n```\n\n```python\n    #get Confusion Matrix\n\u003e\u003e\u003e Y_predict = tc.predict(X_test)\n\u003e\u003e\u003e tc.confusion_matrix(y_test, Y_predict)\n    Confusion Matrix :\n               military  baby   car  game  food  sports  finance  discovery  regimen  travel  fashion  history  society  story  tech  world  entertainment  essay\nmilitary           2831     5     3    16     9       4        8         10        0      15        8       24        9      3     6     42              6      1\nbaby                  0  2932     3     3    26       0        1          0       10       7       10        3       16      4     3      7             20      4\ncar                   6    10  2813     3     6       8       13          3        1      13       10        3       39      1    11      5             24      4\ngame                 10    11     6  2843     5       9        2          4        1      11       13        3        8      4    25      3             31      3\nfood                  0    38     0     3  2799       1        5          1       67      34       16        7        9      3     4      8             14     10\nsports                2     7     6    13     6    2803        9          0        1      13       24        5       10      1     5     19             42      4\nfinance              12    10    13     4    15       6     2692          1        2      21        5        3       18      2    79     47             12      8\ndiscovery             8     2     0     3     3       2        5       1155        1       5        1        1        1      0    13      9              0      1\nregimen               0    59     0     0    63       0        2          0     1093       0        3        3        4      2     0      1              5      0\ntravel                9    19     8     8    23       4        9          8        0    2741       19       20       19      7    13     55             14     12\nfashion               2    21     5     9    14       9        1          5       13      18     2772        5        7      1     6     11             77      7\nhistory              49     9     2     3     6       3        3          6        4      28        3     2813       12     20     2     35             21      6\nsociety              27    77    50     7    43       7       42          5       16      78       27       13     2414     29    36     36             58     15\nstory                 3    17     1     3     7       2        2          2        2       7        5       12       19   1120     4      6             14     11\ntech                 16     8    19    21     6       3       52         13        3       6        5        4       14      0  2787      9             17      7\nworld                52    33    12     8     9      16       33         24        2      35       27       37       50      8    20   2583             30      4\nentertainment         5    14     3    28     6      13        4          3        1       9      120       29       17      3    12     10           2708      8\nessay                 7    23     5     3    12       1        8          6        4      15       22       11        7      2     5      2             11   1010\n```\n\n```python\n    #get sub_result and Figure\n\u003e\u003e\u003e tc.plot_display(y_test, Y_predict)\n    Plot display...\n               Test count:  Predict count:  Sub Result:  Sub_Abs Result:\nbaby                  3049            3295          246              246\ncar                   2973            2949          -24               24\ndiscovery             1210            1246           36               36\nentertainment         2993            3104          111              111\nessay                 1154            1115          -39               39\nfashion               2983            3090          107              107\nfinance               2950            2891          -59               59\nfood                  3019            3058           39               39\ngame                  2992            2978          -14               14\nhistory               3025            2996          -29               29\nmilitary              3000            3039           39               39\nregimen               1235            1221          -14               14\nsociety               2980            2673         -307              307\nsports                2970            2891          -79               79\nstory                 1237            1210          -27               27\ntech                  2990            3031           41               41\ntravel                2988            3056           68               68\nworld                 2983            2888          -95               95\n```\n\n![image](https://github.com/ArnoldGaius/Text_Classifier/blob/master/image/Figure.png)\n\nPerformance\n----------------------\n- Train set: 156k news headline with 18 labels\n- Test set: 36k news headline with 18 labels\n- Compare with svm , naive-bayes , SGD(loss = 'perceptron') of [Scikit-learn](https://github.com/scikit-learn/scikit-learn)\n\n|         Classifier       | Accuracy  |  Time cost(s)  |\n|:------------------------:|:---------:|:--------------:|\n|     scikit-learn(svm)    |   71.6%   |     241        |\n|     scikit-learn(nb)     |   72.7%   |     12         |\n|     scikit-learn(SGD)    |   72.4%   |     197        |\n|     **TextClassifier**   | **76.8%** |     **8**      |\n\nInstallation\n----------------------\n    $ pip install TextClassifier\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnoldgaius%2Ftext_classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farnoldgaius%2Ftext_classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnoldgaius%2Ftext_classifier/lists"}