https://github.com/chakki-works/chariot

Deliver the ready-to-train data to your NLP model.
https://github.com/chakki-works/chariot

keras natural-language-processing preprocessing python tensorflow

Last synced: 3 months ago
JSON representation

Deliver the ready-to-train data to your NLP model.

Host: GitHub
URL: https://github.com/chakki-works/chariot
Owner: chakki-works
License: apache-2.0
Created: 2018-06-10T22:57:09.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2022-07-15T18:40:51.000Z (almost 3 years ago)
Last Synced: 2025-02-15T04:11:21.307Z (4 months ago)
Topics: keras, natural-language-processing, preprocessing, python, tensorflow
Language: Jupyter Notebook
Homepage: https://chakki-works.github.io/chariot/
Size: 5.61 MB
Stars: 121
Watchers: 7
Forks: 9
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # chariot

[![PyPI version](https://badge.fury.io/py/chariot.svg)](https://badge.fury.io/py/chariot)

[![Build Status](https://travis-ci.org/chakki-works/chariot.svg?branch=master)](https://travis-ci.org/chakki-works/chariot)

[![codecov](https://codecov.io/gh/chakki-works/chariot/branch/master/graph/badge.svg)](https://codecov.io/gh/chakki-works/chariot)

**Deliver the ready-to-train data to your NLP model.**

* Prepare Dataset

  * You can prepare typical NLP datasets through the [chazutsu](https://github.com/chakki-works/chazutsu).

* Build & Run Preprocess

  * You can build the preprocess pipeline like [scikit-learn Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

  * Preprocesses for each dataset column are executed in parallel by [Joblib](https://pythonhosted.org/joblib/index.html).

  * Multi-language text tokenization is supported by [spaCy](https://spacy.io/).

* Format Batch

  * Sampling a batch from preprocessed dataset and format it to train the model (padding etc).

  * You can use pre-trained word vectors through the [chakin](https://github.com/chakki-works/chakin).

**chariot** enables you to concentrate on training your model!

![chariot flow](./docs/images/chariot_feature.gif)

## Install

```

pip install chariot

```

## Prepare dataset

You can download various dataset by using [chazutsu](https://github.com/chakki-works/chazutsu).  

```py

import chazutsu

from chariot.storage import Storage

storage = Storage("your/data/root")

r = chazutsu.datasets.MovieReview.polarity().download(storage.path("raw"))

df = storage.chazutsu(r.root).data()

df.head(5)

```

Then

```

	polarity	review

0	0	synopsis : an aging master art thief , his sup...

1	0	plot : a separated , glamorous , hollywood cou...

2	0	a friend invites you to a movie . this film wo...

```

`Storage` class manage the directory structure that follows [cookie-cutter datascience](https://drivendata.github.io/cookiecutter-data-science/).

```

Project root

  └── data

       ├── external     <- Data from third party sources (ex. word vectors).

       ├── interim      <- Intermediate data that has been transformed.

       ├── processed    <- The final, canonical datasets for modeling.

       └── raw          <- The original, immutable data dump.

```

## Build & Run Preprocess

### Build a preprocess pipeline

All preprocessors are defined at `chariot.transformer`.  

Transformers are implemented by extending [scikit-learn `Transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html).  

Because of this, the API of Transformer is familiar to you. And you can mix [scikit-learn's preprocessors](https://scikit-learn.org/stable/modules/preprocessing.html).

```py

import chariot.transformer as ct

from chariot.preprocessor import Preprocessor

preprocessor = Preprocessor()

preprocessor\

    .stack(ct.text.UnicodeNormalizer())\

    .stack(ct.Tokenizer("en"))\

    .stack(ct.token.StopwordFilter("en"))\

    .stack(ct.Vocabulary(min_df=5, max_df=0.5))\

    .fit(train_data)

preprocessor.save("my_preprocessor.pkl")

loaded = Preprocessor.load("my_preprocessor.pkl")

```

There is 6 type of transformers are prepared in chariot.

* TextPreprocessor

  * Preprocess the text before tokenization.

  * `TextNormalizer`: Normalize text (replace some character etc).

  * `TextFilter`: Filter the text (delete some span in text stc).

* Tokenizer

  * Tokenize the texts.

  * It powered by [spaCy](https://spacy.io/) and you can choose [MeCab](https://github.com/taku910/mecab) or [Janome](https://github.com/mocobeta/janome) for Japanese.

* TokenPreprocessor

  * Normalize/Filter the tokens after tokenization.

  * `TokenNormalizer`: Normalize tokens (to lower, to original form etc).

  * `TokenFilter`: Filter tokens (extract only noun etc).

* Vocabulary

  * Make vocabulary and convert tokens to indices.

* Formatter

  * Format (preprocessed) data for training your model.

* Generator

  * Genrate target data to train your (language) model.

### Build a preprocess for dataset

When you want to make preprocess to each of your dataset column, you can use `DatasetPreprocessor`.

```py

from chariot.dataset_preprocessor import DatasetPreprocessor

from chariot.transformer.formatter import Padding

dp = DatasetPreprocessor()

dp.process("review")\

    .by(ct.text.UnicodeNormalizer())\

    .by(ct.Tokenizer("en"))\

    .by(ct.token.StopwordFilter("en"))\

    .by(ct.Vocabulary(min_df=5, max_df=0.5))\

    .by(Padding(length=pad_length))\

    .fit(train_data["review"])

dp.process("polarity")\

    .by(ct.formatter.CategoricalLabel(num_class=3))

preprocessed = dp.preprocess(data)

# DatasetPreprocessor has multiple preprocessor.

# Because of this, save file format is `tar.gz`.

dp.save("my_dataset_preprocessor.tar.gz")

loaded = DatasetPreprocessor.load("my_dataset_preprocessor.tar.gz")

```

## Train your model with chariot

`chariot` has feature to traing your model.

```py

formatted = dp(train_data).preprocess().format().processed

model.fit(formatted["review"], formatted["polarity"], batch_size=32,

          validation_split=0.2, epochs=15, verbose=2)

```

```py

for batch in dp(train_data.preprocess().iterate(batch_size=32, epoch=10):

    model.train_on_batch(batch["review"], batch["polarity"])

```

You can use pre-trained word vectors by [chakin](https://github.com/chakki-works/chakin).  

```py

from chariot.storage import Storage

from chariot.transformer.vocabulary import Vocabulary

# Download word vector

storage = Storage("your/data/root")

storage.chakin(name="GloVe.6B.50d")

# Make embedding matrix

vocab = Vocabulary()

vocab.set(["you", "loaded", "word", "vector", "now"])

embed = vocab.make_embedding(storage.path("external/glove.6B.50d.txt"))

print(embed.shape)  # (len(vocab.count), 50)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chakki-works/chariot

Awesome Lists containing this project

README