Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/raghakot/keras-text
Text Classification Library in Keras
https://github.com/raghakot/keras-text
deep-learning keras machine-learning neural-network tensorflow text-classification theano
Last synced: 11 days ago
JSON representation
Text Classification Library in Keras
- Host: GitHub
- URL: https://github.com/raghakot/keras-text
- Owner: raghakot
- License: mit
- Created: 2017-08-27T18:59:02.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-06-04T19:44:35.000Z (over 6 years ago)
- Last Synced: 2024-10-22T16:46:41.683Z (19 days ago)
- Topics: deep-learning, keras, machine-learning, neural-network, tensorflow, text-classification, theano
- Language: Python
- Homepage: https://raghakot.github.io/keras-text/
- Size: 11.6 MB
- Stars: 420
- Watchers: 22
- Forks: 97
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Keras Text Classification Library
[![Build Status](https://travis-ci.org/raghakot/keras-text.svg?branch=master)](https://travis-ci.org/raghakot/keras-text)
[![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/raghakot/keras-text/blob/master/LICENSE)
[![Slack](https://img.shields.io/badge/slack-discussion-E01563.svg)](https://join.slack.com/t/keras-text/shared_invite/MjMzNDU3NDAxODMxLTE1MDM4NTg0MTktNzgxZTNjM2E4Zg)keras-text is a one-stop text classification library implementing various state of the art models with a clean and
extendable interface to implement custom architectures.## Quick start
### Create a tokenizer to build your vocabulary
- To represent you dataset as `(docs, words)` use `WordTokenizer`
- To represent you dataset as `(docs, sentences, words)` use `SentenceWordTokenizer`
- To create arbitrary hierarchies, extend `Tokenizer` and implement the `token_generator` method.```python
from keras_text.processing import WordTokenizertokenizer = WordTokenizer()
tokenizer.build_vocab(texts)
```Want to tokenize with character tokens to leverage character models? Use `CharTokenizer`.
### Build a dataset
A dataset encapsulates tokenizer, X, y and the test set. This allows you to focus your efforts on
trying various architectures/hyperparameters without having to worry about inconsistent evaluation. A dataset can be
saved and loaded from the disk.```python
from keras_text.data import Datasetds = Dataset(X, y, tokenizer=tokenizer)
ds.update_test_indices(test_size=0.1)
ds.save('dataset')
```The `update_test_indices` method automatically stratifies multi-class or multi-label data correctly.
### Build text classification models
See tests/ folder for usage.
#### Word based models
When dataset represented as `(docs, words)` word based models can be created using `TokenModelFactory`.
```python
from keras_text.models import TokenModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN# RNN models can use `max_tokens=None` to indicate variable length words per mini-batch.
factory = TokenModelFactory(1, tokenizer.token_index, max_tokens=100, embedding_type='glove.6B.100d')
word_encoder_model = YoonKimCNN()
model = factory.build_model(token_encoder_model=word_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()
```Currently supported models include:
- [Yoon Kim CNN](https://arxiv.org/abs/1408.5882)
- Stacked RNNs
- Attention (with/without context) based RNN encoders.`TokenModelFactory.build_model` uses the provided word encoder which is then classified via `Dense` block.
#### Sentence based models
When dataset represented as `(docs, sentences, words)` sentence based models can be created using `SentenceModelFactory`.
```python
from keras_text.models import SentenceModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN, AveragingEncoder# Pad max sentences per doc to 500 and max words per sentence to 200.
# Can also use `max_sents=None` to allow variable sized max_sents per mini-batch.
factory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500, max_tokens=200, embedding_type='glove.6B.100d')
word_encoder_model = AttentionRNN()
sentence_encoder_model = AttentionRNN()# Allows you to compose arbitrary word encoders followed by sentence encoder.
model = factory.build_model(word_encoder_model, sentence_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()
```Currently supported models include:
- [Yoon Kim CNN](https://arxiv.org/abs/1408.5882)
- Stacked RNNs
- Attention (with/without context) based RNN encoders.`SentenceModelFactory.build_model` created a tiered model where words within a sentence is first encoded using
`word_encoder_model`. All such encodings per sentence is then encoded using `sentence_encoder_model`.- [Hierarchical attention networks](http://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf)
(HANs) can be build by composing two attention based RNN models. This is useful when a document is very large.
- For smaller document a reasonable way to encode sentences is to average words within it. This can be done by using
`token_encoder_model=AveragingEncoder()`
- Mix and match encoders as you see fit for your problem.## Resources
TODO: Update documentation and add notebook examples.
Stay tuned for better documentation and examples.
Until then, the best resource is to refer to the [API docs](https://raghakot.github.io/keras-text/)## Installation
1) Install [keras](https://github.com/fchollet/keras/blob/master/README.md#installation)
with theano or tensorflow backend. Note that this library requires Keras > 2.02) Install keras-text
> From sources
```bash
sudo python setup.py install
```> PyPI package
```bash
sudo pip install keras-text
```3) Download target spacy model
keras-text uses the excellent spacy library for tokenization. See instructions on how to
[download model](https://spacy.io/docs/usage/models#download) for target language.## Citation
Please cite keras-text in your publications if it helped your research. Here is an example BibTeX entry:
```
@misc{raghakotkerastext
title={keras-text},
author={Kotikalapudi, Raghavendra and contributors},
year={2017},
publisher={GitHub},
howpublished={\url{https://github.com/raghakot/keras-text}},
}
```