Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/uiur/demae
A framework to build a machine learning batch
https://github.com/uiur/demae
Last synced: 3 months ago
JSON representation
A framework to build a machine learning batch
- Host: GitHub
- URL: https://github.com/uiur/demae
- Owner: uiur
- License: mit
- Created: 2017-09-11T11:14:04.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-10-20T09:26:21.000Z (about 7 years ago)
- Last Synced: 2024-10-07T19:07:27.799Z (3 months ago)
- Language: Python
- Size: 13.7 KB
- Stars: 7
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# demae
[![Build Status](https://travis-ci.org/uiureo/demae.svg?branch=master)](https://travis-ci.org/uiureo/demae)
[![PyPI version](https://badge.fury.io/py/demae.svg)](https://badge.fury.io/py/demae)demae is a framework to build a batch program using Machine Learning.
Makes it easier to deploy your ML model into production.Main features:
- handle data source and destination easily
- support parallel execution
- print stats of execution timeThis example is to fetch input from S3, transform it and push output to S3.
`S3 -> transform -> S3`
```python
from demae import Base
from demae.source import S3Source
from demae.dest import S3Dest"""
requires `source`, `dest` and `transform` to be implemented
"""
class Batch(Base):
"""
Set data sourceThis reads input from files with the prefix in `redshift-copy-buffer` bucket.
Input files must be in tsv format.
"""
source = S3Source(
bucket='bucket',
prefix='{env}/example_input/{date}/example_input.tsv',
columns=['id', 'text'],
)"""
Specify output destination in s3.key_map : a function (input key -> output key)
This example maps input:
from: development/example_input/2017-12-24/example_input.0000_part_00.gz
to: development/example_output/2017-12-24/example_output.0000_part_00.gz
"""
dest = S3Dest(
key_map=lambda key: re.sub('_input', '_output', key)
)"""
Write your inference code here
data : pandas DataFrame
columns is automatically set from source.columns.
must returns array-like objects (DataFrame, numpy array or list)
"""
def transform(self, data):
output = predict(data[:, 'text'])
return output```
To run:
```python
batch = Batch(
env='development',
date='2017-02-13'
)
batch.run()
```## Parallel execution
Parallel execution is supported by providing environment variables that are specified in `parallel_env`.A batch handles only a corresponding part of input.
```python
source = S3Source(
bucket='bucket',
prefix='development/foo/foo.tsv',
columns=['id', 'text'],
parallel_env={'index': 'PARALLEL_INDEX', 'size': 'PARALLEL_SIZE'},
)
```For example,
input files: `input.tsv.part0` `input.tsv.part1` `input.tsv.part2`When `PARALLEL_INDEX=1` and `PARALLEL_SIZE=3` are provided, it handles only `input.tsv.part1`.
## License
MIT
This software is developed while working for [Cookpad Inc.](https://github.com/cookpad)