An open API service indexing awesome lists of open source software.

https://github.com/surfstudio/ocean

A workflow managing tool for Machine Learning and Data Science projects
https://github.com/surfstudio/ocean

Last synced: 7 months ago
JSON representation

A workflow managing tool for Machine Learning and Data Science projects

Awesome Lists containing this project

README

          

# Ocean

A template creation tool for Machine Learning and Data Science projects.

πŸ‡·πŸ‡Ί [Π—Π΄Π΅ΡΡŒ](README_ru.md) Π»Π΅ΠΆΠΈΡ‚ русскоязычная вСрсия этого README.

## Table of contents

* [tldr](#tldr)
* [Installation](#Installation)
* [Usage](#Usage)
* [History and main features](#History-and-main-features)
* [Cookiecutter-data-science](#Cookiecutter-data-science)
* [Experiments](#Experiments)

## tldr

### Installation

1) Install Sphinx for automatic documentation support.

Follow [this link](http://www.sphinx-doc.org/en/1.4/install.html) for the installation instructions. Preferred way of installing is via pip3: `pip3 install -U sphinx`.

2) Execute commands in Terminal:
```
sudo -i
git clone https://github.com/EnlightenedCSF/Ocean.git
cd
pip install --upgrade .
```

### Usage
Creating a new project:
```
ocean project new -n "" \ # ! must be provided !
-a "" \ # default is `Surf`
-v "" \ # default is `0.0.1`
-d "" \ # default is ``
-l "" \ # default is `MIT`
-p "" # default is `.`
```

Install the project code as a package:
```
make -B package
```

Creating a new experiment in the project:
```
ocean exp new -n "" # ! must be provided !
-a "" # ! must be provided !
```

## History and main features

### Cookiecutter-data-science

The project is based on [cookiecutter-data-science](https://drivendata.github.io/cookiecutter-data-science/) template, but is a modification of it. Before continue reading, I highly recommend you to follow the given link and take a look, because many key points listed there are important.

---

Let's see how the original cookiecutter is structured:

```
β”œβ”€β”€ LICENSE
β”œβ”€β”€ Makefile <- Makefile with commands like `make data` or `make train`
β”œβ”€β”€ README.md <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚ β”œβ”€β”€ external <- Data from third party sources.
β”‚ β”œβ”€β”€ interim <- Intermediate data that has been transformed.
β”‚ β”œβ”€β”€ processed <- The final, canonical data sets for modeling.
β”‚ └── raw <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs <- A default Sphinx project; see sphinx-doc.org for details
β”‚
β”œβ”€β”€ models <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚ the creator's initials, and a short `-` delimited description, e.g.
β”‚ `1.0-jqp-initial-data-exploration`.
β”‚
β”œβ”€β”€ references <- Data dictionaries, manuals, and all other explanatory materials.
β”‚
β”œβ”€β”€ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚ └── figures <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β”‚ generated with `pip freeze > requirements.txt`
β”‚
β”œβ”€β”€ setup.py <- Make this project pip installable with `pip install -e`
β”œβ”€β”€ src <- Source code for use in this project.
β”‚ β”œβ”€β”€ __init__.py <- Makes src a Python module
β”‚ β”‚
β”‚ β”œβ”€β”€ data <- Scripts to download or generate data
β”‚ β”‚ └── make_dataset.py
β”‚ β”‚
β”‚ β”œβ”€β”€ features <- Scripts to turn raw data into features for modeling
β”‚ β”‚ └── build_features.py
β”‚ β”‚
β”‚ β”œβ”€β”€ models <- Scripts to train models and then use trained models to make
β”‚ β”‚ β”‚ predictions
β”‚ β”‚ β”œβ”€β”€ predict_model.py
β”‚ β”‚ └── train_model.py
β”‚ β”‚
β”‚ └── visualization <- Scripts to create exploratory and results oriented visualizations
β”‚ └── visualize.py
β”‚
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org

```

---

It can be upgraded at once:
1. we added `make docs` command for automatic generation of Sphinx documentation based on a whole `src` module's docstrings;
2. we added a conveinient file logger (and `logs` folder, respectivelly);
3. we added a coordinator entity for an easy navigation throughout the project, taking off the necessity of writing `os.path.join`, `os.path.abspath` ΠΈΠ»ΠΈ `os.path.dirname` every time.

But what problems are there?

* The folder `data` could grow significantly, but what script/notebook generated each file is a mystery. The amount of different files stored there can be misleading. Also it is not clear whether any of them is useful for a new feature implementation, because there is no place to contain descriptions and explanations.
* The folder `data` lacks the `features` submodule which could be a good use: the one can store calculated statistics, embeddings and other features. There is [a nice writing](https://www.logicalclocks.com/feature-store/) about this which I strongly recommend.
* The `src` folder is an another problem. It contains both functionality that is relevant project-wise (like `src.data` submodule) and functionality relevant to concrete and often small sub-tasks (like `src.models`).
* The folder `references` exists, but there is an opened question, who, when and how has to put some records there. And there is a lot to explain during the development process: which experiments have been done, what were the results, what are we doing next.

For a sake of solving listed problems I introduce the _experiment_ entity.

### Experiments

So, the _experiment_ is a place which contains all the data relevant to some hypothesis checking.

Including:
* What data was used
* What data (or artefacts) was produced
* Code version
* Timestamp of beginning and ending of an experiment
* Source file
* Parameters
* Metrics
* Logs

Many things can be logged via tracker utilities like [mlflow](https://mlflow.org/docs/latest/tracking.html), but it is not enough. We can improve our workflow.

This is what an example experiment looks like:

```

└── experiments
β”œβ”€β”€ exp-001-Tree-models
β”‚ β”œβ”€β”€ config <- yaml-files with grid search parameters or just model parameters
β”‚ β”œβ”€β”€ models <- dumped models
β”‚ β”œβ”€β”€ notebooks <- notebooks for research
β”‚ β”œβ”€β”€ scripts <- scripts like train.py or predict.py
β”‚ β”œβ”€β”€ Makefile <- for handling experiment with just few words put in console
β”‚ β”œβ”€β”€ requirements.txt <- dependent libraries
β”‚ └── log.md <- logs of how the experiment is going
β”‚
β”œβ”€β”€ exp-002-Gradient-boosting
...
```

Let's take a look at the workflow for one experiment.
1. The notebooks are created where data is being prepared for a model, and model's structure is being introduced.
2. Once the code is ready, it is moved to `train.py`
- Use might track model parameters from there (for instance, with `mlflow`)
- Create a relevant `config`-file for a training configuration
- The code should has the possibility to be called from the console
- It could take paths to the data, the `config`-file, and the directory to dump model to.
3. Then, Makefile is modified to start the training process via console. Provide a command like `make train`.
4. Many models are trained, all the metrics and parameters are sent to `mlflow`. The one can use `mlflow ui` to check the results.
5. Finally, all results are being recorded into `log.md`. It has some [impact analysis](https://en.wikipedia.org/wiki/Change_impact_analysis) elements: the developer needs to point out what data was used and what data was generated. This clarification can be used to generate automatically a readme file for a `data` folder and clarify where which file is used.