https://github.com/IndicoDataSolutions/Passage
A little library for text analysis with RNNs.
https://github.com/IndicoDataSolutions/Passage
Last synced: 11 months ago
JSON representation
A little library for text analysis with RNNs.
- Host: GitHub
- URL: https://github.com/IndicoDataSolutions/Passage
- Owner: IndicoDataSolutions
- License: mit
- Created: 2015-01-15T17:51:38.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2018-09-10T17:31:35.000Z (over 7 years ago)
- Last Synced: 2024-10-31T21:35:40.797Z (over 1 year ago)
- Language: Python
- Size: 46.9 KB
- Stars: 531
- Watchers: 78
- Forks: 134
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.txt
- License: LICENSE
Awesome Lists containing this project
- awesome-rnn - Passage
- awesome-rnn - Passage
README
**Passage**
===================
A little library for text analysis with RNNs.
Warning: very alpha, work in progress.
## Install
via Github (version under active development)
```
git clone http://github.com/IndicoDataSolutions/passage.git
python setup.py develop
```
or via pip
```
sudo pip install passage
```
## Example
Using Passage to do binary classification of text, this example:
* Tokenizes some training text, converting it to a format Passage can use.
* Defines the model's structure as a list of layers.
* Creates the model with that structure and a cost to be optimized.
* Trains the model for one iteration over the training text.
* Uses the model and tokenizer to predict on new text.
* Saves and loads the model.
```
from passage.preprocessing import Tokenizer
from passage.layers import Embedding, GatedRecurrent, Dense
from passage.models import RNN
from passage.utils import save, load
tokenizer = Tokenizer()
train_tokens = tokenizer.fit_transform(train_text)
layers = [
Embedding(size=128, n_features=tokenizer.n_features),
GatedRecurrent(size=128),
Dense(size=1, activation='sigmoid')
]
model = RNN(layers=layers, cost='BinaryCrossEntropy')
model.fit(train_tokens, train_labels)
model.predict(tokenizer.transform(test_text))
save(model, 'save_test.pkl')
model = load('save_test.pkl')
```
Where:
* train_text is a list of strings ['hello world', 'foo bar']
* train_labels is a list of labels [0, 1]
* test_text is another list of strings
## Datasets
Without sizeable datasets RNNs have difficulty achieving results better than traditional sparse linear models. Below are a few datasets that are appropriately sized, useful for experimentation. Hopefully this list will grow over time, please feel free to propose new datasets for inclusion through either an issue or a pull request.
**__Note__**: __None of these datasets were created by indico, nor should their inclusion here indicate any kind of endorsement__
Blogger Dataset: http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip (Age and gender data)