https://github.com/thesage21/dataflow

A data flow diagram editor to generate Machine learning pipeline scripts.[DEPRECATED]
https://github.com/thesage21/dataflow

Last synced: 4 months ago
JSON representation

A data flow diagram editor to generate Machine learning pipeline scripts.[DEPRECATED]

Host: GitHub
URL: https://github.com/thesage21/dataflow
Owner: theSage21
License: mit
Created: 2017-01-30T08:55:32.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-02-09T17:30:43.000Z (over 8 years ago)
Last Synced: 2025-01-28T21:29:22.582Z (5 months ago)
Language: Python
Homepage: https://thesage21.github.io/dataflow/
Size: 336 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        DataFlow

========

[DEPRECATED] **You might want to take a look at [Luigi](https://github.com/spotify/luigi)**

I used to love working on [Blender's](https://www.blender.org) node editor in

my college days. That love was invoked again once I encountered Microsoft's

MLStudio. I liked it for the simple fact that I did not have to type a lot.

That is where dataflow originates from.

Dataflow provides nodes/boxes/operators which represent processing units. These

are connected via links/data paths/lines to show dependency. There are four

main kinds of boxes (Source, Sink, DataData, DataEstimator). These are named so

because Dataflow tries to expose as much of the sk-learn api as possible via a

dataflow diagram.

Todo, ideas, and notes

-------

- [x] Support flow drawing

- [x] Generate py code

- [ ] Run code on server?

- [ ] Run independent code on clusters?

- [ ] Support more languages?

- [ ] Need better interface design. I need to learn more about 'the website'.

- [ ] Apologize to [sdrdis](https://github.com/sdrdis/jquery.flowchart) for cannibalizing the repo. Find a way to use it as a proper submodule.

Stability and Plans

---------

- This is a very early sketch of what came to mind. Things will change.

- In the future dataflow may not remain limited to Python and me take the path of Jupyter. (Oh how I wish!)

- If you have a suggestion I'd love to hear. Open up an issue in the repo!

- If you have a PR, even better.

Get Started

-----------

Here's a typical `bash` session from start to finish.

```bash

virtualenv -p python3 env

source env/bin/activate

git clone https://github.com/theSage21/dataflow dataflow

cd dataflow

pip install -r requirements.txt

dataflow

```

This should start a bottle web server at `0.0.0.0:8080`. When you navigate to that address via the browser, you see a simple page.

1. There is a textbox at the top. This shows code associated with blocks/boxes. It is editable.

2. There is a workspace right in the middle of the screen. This is where all the diagrams go.

3. There are buttons on the left. Clicking one adds a box to the workspace. They all get added on the top left corner.

4. You can move the boxes around. To connect inputs to outputs click on an output and then on another input.

5. Once your diagram is done, click on the `make py` button at the bottom.

6. It will ask you for a name to give to your diagram.

7. You can now navigate to the `/scripts` url and see all the scripts.

**You can delete parts of the diagram by selecting and pressing 'x' / 'del'**

In case you don't like what is on offer, you can add a custom button using the

buttons given at the bottom. The letters included inside `[ ]` act as hotkeys

to add those types of boxes.

Trivia

-------------

- There are three main types of blocks

    - **Source blocks** create something out of nothing. These are the blocks which read datasets and create estimators

    - **D2D** Data to data blocks transform data in some way.

    - **DE** Data and Estimator blocks take in a dataset and an estimator and return a transformed dataset and estimator. These are typically training blocks.

    - **Sink** Data sinks are things like `print` blocks which create no outputs.

- All blocks can be connected in a data flow diagram. 

- Controls:

To use dataflow, run `dataflow` and navigate to `127.0.0.1:8080` in your browser. The data blocks can be added and connected by hand.

Once the command is issued to make a file, it is generated in `static/scripts/`

![Main page screenshot](screenshots/main.png)

## This flowchart generates the following code

```python

# Generated on

# 2017-01-31 17:31:48.506545

# via DataFlow: https://github.com/theSage21/dataflow

import numpy as np

import pandas as pd

from sklearn.metrics import roc_auc_score, accuracy_score, f1_score

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score, train_test_split

#########################################################

def ReadCsv0():

    data = pd.read_csv("data.csv")

    return data

    

def RandomForestClassifier1():

    est = RandomForestClassifier(n_jobs=-1)

    return est

    

def TrTsSplit2(data=None):

    X, Y = data.drop("target", axis=1), data.target

    x_tr, x_ts, y_tr, y_ts = train_test_split(X, Y, 0.25)

    train = x_tr["target"] = y_tr

    test = x_ts["target"] = y_ts

    return train, test

    

def TrainClassifier3(est=None, data=None):

    X, Y = data.drop("target", axis=1), data.target

    est.fit(X, Y)

    return est

    

def Score4(est=None, data=None):

    X, Y = data.drop("target", axis=1), data.target

    p = est.predict(X)

    score = roc_auc_score(Y, p)

    return score

    

def Print7(inp=None):

    print(inp)

    return 

    

def Print8(inp=None):

    print(inp)

    return 

    

def Print9(inp=None):

    print(inp)

    return 

    

def Score5(est=None, data=None):

    X, Y = data.drop("target", axis=1), data.target

    p = est.predict(X)

    score = accuracy_score(Y, p)

    return score

    

def Score6(est=None, data=None):

    X, Y = data.drop("target", axis=1), data.target

    p = est.predict(X)

    score = f1_score(Y, p)

    return score

    

#########################

#MAIN

#########################

# Parts within steps can be run in parallel

# Step --------------------------<[1]>-

var1 = RandomForestClassifier1()

var0 = ReadCsv0()

# Step --------------------------<[2]>-

var2, var3 = TrTsSplit2(var0)

# Step --------------------------<[3]>-

var4 = TrainClassifier3(var1, var2)

# Step --------------------------<[4]>-

var7 = Score6(var4, var3)

var6 = Score5(var4, var3)

var5 = Score4(var4, var3)

# Step --------------------------<[5]>-

Print9(var5)

Print8(var6)

Print7(var7)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thesage21/dataflow

Awesome Lists containing this project

README