Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/seanlee97/clfzoo
A deep text classifiers library.
https://github.com/seanlee97/clfzoo
nlp tensorflow text-classification
Last synced: 7 days ago
JSON representation
A deep text classifiers library.
- Host: GitHub
- URL: https://github.com/seanlee97/clfzoo
- Owner: SeanLee97
- License: mit
- Created: 2018-11-12T04:58:58.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2018-11-14T01:43:27.000Z (about 6 years ago)
- Last Synced: 2024-12-29T13:42:54.209Z (20 days ago)
- Topics: nlp, tensorflow, text-classification
- Language: Python
- Size: 156 KB
- Stars: 36
- Watchers: 6
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
/ clfzoo /
Eng / [CN](https://github.com/SeanLee97/clfzoo/blob/master/docs/ZH_README.md)
clfzoo is a toolkit for text classification. We have implemented some baseline models, such as TextCNN, TextRNN, RCNN, Transformer, HAN, DPCNN. And We have designed a unified and friendly API to train / predict / test the models. Looking forward to your code contributions and suggestions.
## Requiements
```
python3+
numpy
sklearn
tensorflow>=1.6.0
```## Installation
```
git clone https://github.com/SeanLee97/clfzoo.git
cd clfzoo
```## Overview
```
project
│ README.md
│
└─── docs
│
└─── clfzoo # models
│ │ base.py # base model template
│ │ config.py # default configure
│ │ dataloader.py
│ │ instance.py # data instance
│ │ vocab.py # vocabulary
│ │ libs # layers and functions
│ │ dpcnn # implement dpcnn model
│ │ │ __init__.py # model apis
│ │ │ model.py # model
│ │ ... # implement other models
└───examples
│ ...
```### Data Prepare
Each line is a document. The line format is "label \t sentence". The default word tokenizer is split by blank space, so words in sentence should split by blank space.for english sample
```
greeting how are you.
```for chinese sample
```
打招呼 你 最近 过得 怎样 啊 ?
```### Usage
#### train
```python
# import model api
import clfzoo.textcnn as clf# import model config
from clfzoo.config import ConfigTextCNN"""define model config
You can assign value to hy-params defined on base model config (here is ConfigTextCNN)
"""class Config(ConfigTextCNN):
def __init__(self):
# it is required to implement super() function
super(Config, self).__init__()# it is required to provide dataset
train_file = '/path/to/train'
dev_file = '/path/to/test'
# ... other hy-params# `training` is flag to indicate train mode.
clf.model(Config(), training=True)# start to train
clf.train()
```The train log will output to `log.txt`, the model weights and checkpoint summaries will output to `models` folder.
#### predict
Predit the labels and probability scores.
```python
import clfzoo.textcnn as clf
from clfzoo.config import ConfigTextCNNclass Config(ConfigTextCNN):
def __init__(self):
super(Config, self).__init__()
# the same hy-params as train# inject config to model
clf.model(Config())"""
Input: a list
each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)
"""
datas = ['how are u ?', 'what is the weather ?', ...]"""
Return: a list
[('label 1', 'score 1'), ('label 2', 'score 2'), ...]
"""
preds = clf.predict(datas)
```#### test
Predit the labels and probability scores and get result metrics. In order to calculate metrics you should provide ground-truth label.
```python
import clfzoo.textcnn as clf
from clfzoo.config import ConfigTextCNNclass Config(ConfigTextCNN):
def __init__(self):
super(Config, self).__init__()
# the same hy-params as train# inject config to model
clf.model(Config())"""
Input: a list
each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)
"""
datas = ['how are u ?', 'what is the weather ?', ...]
labels = ['greeting', 'weather', ...]"""
Return: a tuple
- predicts: a list
[('label 1', 'score 1'), ('label 2', 'score 2'), ...]
- metrics: a dict
{'recall': '', 'precision': '', 'f1': , 'accuracy': ''}
"""
preds, metrics = clf.test(datas, labels)
```## Benchmark Results
here we use [smp2017-ECDT](https://arxiv.org/abs/1709.10217) dataset as an example, which is a multi-label (31 labels)、short-text and chinese dataset.We train all models in 20 epochs, and calculate metrics by sklearn metrics functions. As we all know [fasttext](https://github.com/facebookresearch/fastText) is a strong baseline in text-classification, so here we give the result on fasttext
| Models | Precision | Recall | F1 |
| ------------ | ------------ | ------------ | ------------ |
| fasttext | 0.81 | 0.81 | 0.81 |
| TextCNN | 0.83 | 0.84 | 0.83 |
| TextRNN | 0.84 | 0.83 | 0.82 |
| RCNN | 0.86 | 0.85 | 0.85 |
| DPCNN | 0.87 | 0.85 | 0.85 |
| Transformer | 0.74 | 0.67 | 0.68 |
| HAN | TODO | TODO | TODO |**Attention!** It seems that Transformer and HAN can`t perform well now, We will fix bugs and update their result later.
## Contributors
- sean lee
- a single coder
- [[email protected]](https://seanlee97.github.io/)
- x.m. li
- a undergraduate student from Shanxi University
- [holahack@github](https://github.com/holahack)
- ...## Refrence
Some code modules from- [transformer](https://github.com/Kyubyong/transformer)
- [artf](https://github.com/SeanLee97/artf)Papers
- TextCNN: [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
- DPCNN: [Deep Pyramid Convolutional Neural Networks for Text Categorization](https://ai.tencent.com/ailab/media/publications/ACL3-Brady.pdf)
- Transformer: [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- HAN: [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)## Contact Us
Any questions please mailto xmlee97#gmail.com