https://github.com/mbalatsko/mlmr

This library will help you easily parallelize your python code for all kind of data transformations in MapReduce fashion.
https://github.com/mbalatsko/mlmr

mapreduce ml parallel parallel-computing sklearn-library

Last synced: 14 days ago
JSON representation

This library will help you easily parallelize your python code for all kind of data transformations in MapReduce fashion.

Host: GitHub
URL: https://github.com/mbalatsko/mlmr
Owner: mbalatsko
License: mit
Created: 2020-05-14T19:38:22.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-05-18T15:11:19.000Z (about 6 years ago)
Last Synced: 2025-01-21T01:12:48.280Z (over 1 year ago)
Topics: mapreduce, ml, parallel, parallel-computing, sklearn-library
Language: Jupyter Notebook
Homepage:
Size: 11.7 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          # MLMR 

[![PyPI version](https://badge.fury.io/py/mlmr.svg)](https://badge.fury.io/py/mlmr)

[![Downloads](https://pepy.tech/badge/mlmr)](https://pepy.tech/project/mlmr)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

This library will help you easily parallelize your python code for all kind of data transformations. 

Core functions are built on Map-Reduce paradigm. In this library Map part is parallelized using native 

python `multiprocessing` module.

## Installation

```bash

pip install mlmr

```

## Usage

In order to find out library API specification and advanced usage I recommend you to start with these short tutorials:

1. [Functional API tutorial](https://github.com/mbalatsko/mlmr/blob/master/tutorials/Function%20tutorial.ipynb)

1. [Sklearn integration tutorial](https://github.com/mbalatsko/mlmr/blob/master/tutorials/Sklearn%20integration%20tutorial.ipynb)

Here I'll post several real world `mlmr` API applications.

### Sum of squares in MapReduce fashion example

```python

import numpy as np

from mlmr.function import map_reduce

arr = [1, 2, 3, 4, 5]

def squares_of_slice(arr_slice): # our map function, with partial reduction

    return sum(map(lambda x: x**2, arr_slice))

def get_split_data_func(n_slices): # wrapper function of split data function

    def split_data(data):

        return np.array_split(data, n_slices)

    return split_data

n_jobs = 2

result = map_reduce(

    data=arr,

    data_split_func=get_split_data_func(n_jobs), # split data into n_jobs slices

    map_func=squares_of_slice,

    reduce_func=sum,

    n_jobs=n_jobs

)

```

### Pandas apply parallelization in MapReduce fashion example

In this example function performs parallel data transformations on `df` (pd.DataFrame, pd.Series).

From `n_jobs` argument, number of processes to run in parallel is calculated. Data is evenly divided into number 

of processes slices. Then `our_transform_func` is applied on each slice in parallel (every process has its own slice).

After calculation is complete all transformation results are flattened. Flattened result is returned.

```python

from mlmr.function import transform_concat

def comutation_costly_transformation(*_):

    pass

def our_transform_func(df):

    return df.apply(cosly_computation_func)

df_transformed = transform_concat(df, transform_func=our_transform_func, n_jobs=-1)

```

### Sklearn MapReduce transformer integration into Pipeline

```python

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier

from mlmr.transformers import BaseMapReduceTransformer

def comutation_costly_text_transformation(df):

    pass

class TextPreprocessor(BaseMapReduceTransformer):

    

    def transform_part(self, X):

        return comutation_costly_text_transformation(X)

n_jobs = 4

text_classification_pipeline = Pipeline([

     ('text_preprocessor', TextPreprocessor(n_jobs=n_jobs)),

     ('vectorizer', TfidfVectorizer(analyzer = "word", max_features=10000)),

     ('classifier', RandomForestClassifier(n_estimators=100, n_jobs=n_jobs))

])

```

Alternative implementation:

```python

import pandas as pd

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier

from mlmr.transformers import FunctionMapReduceTransformer

def get_split_data_func(n_slices): # wrapper function of split data function

    def split_data(data):

        return np.array_split(data, n_slices)

    return split_data

def comutation_costly_text_transformation(df):

    pass

n_jobs = 4

text_classification_pipeline = Pipeline([

     ('text_preprocessor', FunctionMapReduceTransformer(

         map_func=comutation_costly_text_transformation,

         reduce_func=pd.concat,

         data_split_func=get_split_data_func(n_jobs),

         n_jobs=n_jobs

     )),

     ('vectorizer', TfidfVectorizer(analyzer = "word", max_features=10000)),

     ('classifier', RandomForestClassifier(n_estimators=100, n_jobs=n_jobs))

])

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mbalatsko/mlmr

Awesome Lists containing this project

README