{"id":16271054,"url":"https://github.com/seanlee97/clfzoo","last_synced_at":"2025-10-04T06:30:36.734Z","repository":{"id":92904632,"uuid":"157158680","full_name":"SeanLee97/clfzoo","owner":"SeanLee97","description":"A deep text classifiers library.","archived":false,"fork":false,"pushed_at":"2018-11-14T01:43:27.000Z","size":160,"stargazers_count":36,"open_issues_count":0,"forks_count":8,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-01-10T22:59:04.037Z","etag":null,"topics":["nlp","tensorflow","text-classification"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SeanLee97.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-12T04:58:58.000Z","updated_at":"2023-01-01T08:15:14.000Z","dependencies_parsed_at":"2023-04-13T04:55:40.598Z","dependency_job_id":null,"html_url":"https://github.com/SeanLee97/clfzoo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanLee97%2Fclfzoo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanLee97%2Fclfzoo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanLee97%2Fclfzoo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanLee97%2Fclfzoo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SeanLee97","download_url":"https://codeload.github.com/SeanLee97/clfzoo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235222568,"owners_count":18955330,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","tensorflow","text-classification"],"created_at":"2024-10-10T18:12:17.962Z","updated_at":"2025-10-04T06:30:31.292Z","avatar_url":"https://github.com/SeanLee97.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e/ clfzoo /\u003c/p\u003e\n\nEng / [CN](https://github.com/SeanLee97/clfzoo/blob/master/docs/ZH_README.md)\n\nclfzoo is a toolkit for text classification. We have implemented some baseline models, such as TextCNN, TextRNN, RCNN, Transformer, HAN, DPCNN. And We have designed a unified and friendly API to train / predict / test the models. Looking forward to your code contributions and suggestions.\n\n## Requiements\n```\npython3+\nnumpy\nsklearn\ntensorflow\u003e=1.6.0\n```\n\n## Installation\n```\ngit clone https://github.com/SeanLee97/clfzoo.git\ncd clfzoo\n```\n\n## Overview\n```\nproject\n│    README.md\n│\n└─── docs\n│\n└─── clfzoo    # models\n│   │  base.py       # base model template\n│   │  config.py     # default configure\n│   │  dataloader.py\n│   │  instance.py   # data instance\n│   │  vocab.py      # vocabulary\n│   │  libs          # layers and functions\n│   │  dpcnn         # implement dpcnn model\n│   │   │  __init__.py  # model apis\n│   │   │  model.py     # model\n│   │  ...           # implement other models\n└───examples\n    │   ...\n```\n\n### Data Prepare\nEach line is a document. The line format is \"label \\t sentence\". The default word tokenizer is split by blank space, so words in sentence should split by blank space.\n\nfor english sample\n\n```\ngreeting    how are you.\n```\n\nfor chinese sample\n```\n打招呼  你 最近 过得 怎样 啊 ？\n```\n\n### Usage\n\n#### train\n```python\n# import model api\nimport clfzoo.textcnn as clf  \n\n# import model config\nfrom clfzoo.config import ConfigTextCNN\n\n\"\"\"define model config\n\nYou can assign value to hy-params defined on base model config (here is ConfigTextCNN)\n\"\"\"\n\nclass Config(ConfigTextCNN):\n    def __init__(self):\n        # it is required to implement super() function\n        super(Config, self).__init__()\n\n    # it is required to provide dataset\n    train_file = '/path/to/train'\n    dev_file = '/path/to/test'\n    \n    # ... other hy-params\n\n# `training` is flag to indicate train mode.\nclf.model(Config(), training=True)\n\n# start to train\nclf.train()\n```\n\nThe train log will output to `log.txt`, the model weights and checkpoint summaries will output to `models` folder.\n\n#### predict\n\nPredit the labels and probability scores.\n\n```python\nimport clfzoo.textcnn as clf\nfrom clfzoo.config import ConfigTextCNN\n\nclass Config(ConfigTextCNN):\n    def __init__(self):\n        super(Config, self).__init__()\n    \n    # the same hy-params as train\n\n# inject config to model\nclf.model(Config())\n\n\"\"\"\nInput: a list\n    each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)\n\"\"\"\ndatas = ['how are u ?', 'what is the weather ?', ...]\n\n\"\"\"\nReturn: a list\n    [('label 1', 'score 1'), ('label 2', 'score 2'), ...]\n\"\"\"\npreds = clf.predict(datas)\n```\n\n#### test\n\nPredit the labels and probability scores and get result metrics. In order to calculate metrics you should provide ground-truth label.\n\n```python\nimport clfzoo.textcnn as clf\nfrom clfzoo.config import ConfigTextCNN\n\nclass Config(ConfigTextCNN):\n    def __init__(self):\n        super(Config, self).__init__()\n    \n    # the same hy-params as train\n\n# inject config to model\nclf.model(Config())\n\n\"\"\"\nInput: a list\n    each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)\n\"\"\"\ndatas = ['how are u ?', 'what is the weather ?', ...]\nlabels = ['greeting', 'weather', ...]\n\n\"\"\"\nReturn: a tuple\n    - predicts: a list\n        [('label 1', 'score 1'), ('label 2', 'score 2'), ...]\n    - metrics: a dict\n        {'recall': '', 'precision': '', 'f1': , 'accuracy': ''}\n\"\"\"\npreds, metrics = clf.test(datas, labels)\n```\n\n\n## Benchmark Results\nhere we use [smp2017-ECDT](https://arxiv.org/abs/1709.10217) dataset as an example, which is a multi-label (31 labels)、short-text and chinese dataset.\n\nWe train all models in 20 epochs, and calculate metrics by sklearn metrics functions. As we all know [fasttext](https://github.com/facebookresearch/fastText) is a strong baseline in text-classification, so here we give the result on fasttext\n\n|  Models  | Precision   | Recall   | F1   |\n| ------------ | ------------ | ------------ | ------------ |\n|  fasttext  |  0.81  | 0.81  | 0.81  |\n|   TextCNN |  0.83  | 0.84   | 0.83   |\n|   TextRNN |  0.84  | 0.83   |  0.82  |\n|   RCNN |  0.86  | 0.85   | 0.85   |\n|   DPCNN |  0.87  | 0.85  | 0.85 |\n|   Transformer |  0.74  | 0.67   | 0.68  |\n|   HAN |  TODO  | TODO   | TODO   |\n\n**Attention!**  It seems that Transformer and HAN can`t perform well now, We will fix bugs and update their result later.\n\n## Contributors\n- sean lee\n    - a single coder \n    - [seanlee97@github.io](https://seanlee97.github.io/)\n- x.m. li\n    - a undergraduate student from Shanxi University\n    - [holahack@github](https://github.com/holahack)\n- ...\n\n## Refrence\nSome code modules from\n\n- [transformer](https://github.com/Kyubyong/transformer)\n- [artf](https://github.com/SeanLee97/artf)\n\nPapers\n\n- TextCNN: [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)\n- DPCNN: [Deep Pyramid Convolutional Neural Networks for Text Categorization](https://ai.tencent.com/ailab/media/publications/ACL3-Brady.pdf)\n- Transformer: [Attention Is All You Need](https://arxiv.org/abs/1706.03762)\n- HAN: [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)\n\n## Contact Us\nAny questions please mailto xmlee97#gmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseanlee97%2Fclfzoo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseanlee97%2Fclfzoo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseanlee97%2Fclfzoo/lists"}