Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/uiur/demae

A framework to build a machine learning batch
https://github.com/uiur/demae

Last synced: 3 months ago
JSON representation

A framework to build a machine learning batch

Host: GitHub
URL: https://github.com/uiur/demae
Owner: uiur
License: mit
Created: 2017-09-11T11:14:04.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2017-10-20T09:26:21.000Z (about 7 years ago)
Last Synced: 2024-10-07T19:07:27.799Z (3 months ago)
Language: Python
Size: 13.7 KB
Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # demae

[![Build Status](https://travis-ci.org/uiureo/demae.svg?branch=master)](https://travis-ci.org/uiureo/demae)

[![PyPI version](https://badge.fury.io/py/demae.svg)](https://badge.fury.io/py/demae)

demae is a framework to build a batch program using Machine Learning.

Makes it easier to deploy your ML model into production.

Main features:

- handle data source and destination easily

- support parallel execution

- print stats of execution time

This example is to fetch input from S3, transform it and push output to S3.

`S3 -> transform -> S3`

```python

from demae import Base

from demae.source import S3Source

from demae.dest import S3Dest

"""

requires `source`, `dest` and `transform` to be implemented

"""

class Batch(Base):

    """

    Set data source

    This reads input from files with the prefix in `redshift-copy-buffer` bucket.

    Input files must be in tsv format.

    """

    source = S3Source(

        bucket='bucket',

        prefix='{env}/example_input/{date}/example_input.tsv',

        columns=['id', 'text'],

    )

    """

    Specify output destination in s3.

    key_map : a function (input key -> output key)

    This example maps input:

      from: development/example_input/2017-12-24/example_input.0000_part_00.gz

      to:   development/example_output/2017-12-24/example_output.0000_part_00.gz

    """

    dest = S3Dest(

        key_map=lambda key: re.sub('_input', '_output', key)

    )

    """

    Write your inference code here

    data : pandas DataFrame

        columns is automatically set from source.columns.

    must returns array-like objects (DataFrame, numpy array or list)

    """

    def transform(self, data):

        output = predict(data[:, 'text'])

        return output

```

To run:

```python

batch = Batch(

  env='development',

  date='2017-02-13'

)

batch.run()

```

## Parallel execution

Parallel execution is supported by providing environment variables that are specified in `parallel_env`.

A batch handles only a corresponding part of input.

```python

source = S3Source(

    bucket='bucket',

    prefix='development/foo/foo.tsv',

    columns=['id', 'text'],

    parallel_env={'index': 'PARALLEL_INDEX', 'size': 'PARALLEL_SIZE'},

)

```

For example,

input files: `input.tsv.part0` `input.tsv.part1` `input.tsv.part2`

When `PARALLEL_INDEX=1` and `PARALLEL_SIZE=3` are provided, it handles only `input.tsv.part1`.

## License

MIT

This software is developed while working for [Cookpad Inc.](https://github.com/cookpad)