https://github.com/surfstudio/ocean

A workflow managing tool for Machine Learning and Data Science projects
https://github.com/surfstudio/ocean

Last synced: 7 months ago
JSON representation

A workflow managing tool for Machine Learning and Data Science projects

Host: GitHub
URL: https://github.com/surfstudio/ocean
Owner: surfstudio
License: mit
Created: 2019-01-23T12:00:23.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-10-27T11:23:22.000Z (almost 6 years ago)
Last Synced: 2025-01-25T22:58:03.585Z (9 months ago)
Language: HTML
Homepage:
Size: 1.93 MB
Stars: 17
Watchers: 6
Forks: 3
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Ocean

A template creation tool for Machine Learning and Data Science projects.

🇷🇺 [Здесь](README_ru.md) лежит русскоязычная версия этого README.

## Table of contents

* [tldr](#tldr)
* [Installation](#Installation)
* [Usage](#Usage)
* [History and main features](#History-and-main-features)
* [Cookiecutter-data-science](#Cookiecutter-data-science)
* [Experiments](#Experiments)

## tldr

### Installation

1) Install Sphinx for automatic documentation support.

Follow [this link](http://www.sphinx-doc.org/en/1.4/install.html) for the installation instructions. Preferred way of installing is via pip3: `pip3 install -U sphinx`.

2) Execute commands in Terminal:
```
sudo -i
git clone https://github.com/EnlightenedCSF/Ocean.git
cd
pip install --upgrade .
```

### Usage
Creating a new project:
```
ocean project new -n "" \ # ! must be provided !
-a "" \ # default is `Surf`
-v "" \ # default is `0.0.1`
-d "" \ # default is ``
-l "" \ # default is `MIT`
-p "" # default is `.`
```

Install the project code as a package:
```
make -B package
```

Creating a new experiment in the project:
```
ocean exp new -n "" # ! must be provided !
-a "" # ! must be provided !
```

## History and main features

### Cookiecutter-data-science

The project is based on [cookiecutter-data-science](https://drivendata.github.io/cookiecutter-data-science/) template, but is a modification of it. Before continue reading, I highly recommend you to follow the given link and take a look, because many key points listed there are important.

---

Let's see how the original cookiecutter is structured:

```
├── LICENSE
├── Makefile
├── README.md
├── data
│ ├── external
│ ├── interim
│ ├── processed
│ └── raw
│
├── docs
│
├── models
│
├── notebooks
│
│
│
├── references
│
├── reports
│ └── figures
│
├── requirements.txt
│
│
├── setup.py
├── src
│ ├── __init__.py
│ │
│ ├── data
│ │ └── make_dataset.py
│ │
│ ├── features
│ │ └── build_features.py
│ │
│ ├── models
│ │ │
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization
│ └── visualize.py
│
└── tox.ini <- Makefile with commands like `make data` or `make train` <- The top-level README for developers using this project. <- Data from third party sources. <- Intermediate data that has been transformed. <- The final, canonical data sets for modeling. <- The original, immutable data dump. <- A default Sphinx project; see sphinx-doc.org for details <- Trained and serialized models, model predictions, or model summaries <- Jupyter notebooks. Naming convention is a number (for ordering), the creator's initials, and a short `-` delimited description, e.g. `1.0-jqp-initial-data-exploration`. <- Data dictionaries, manuals, and all other explanatory materials. <- Generated analysis as HTML, PDF, LaTeX, etc. <- Generated graphics and figures to be used in reporting <- The requirements file for reproducing the analysis environment, e.g. generated with `pip freeze > requirements.txt` <- Make this project pip installable with `pip install -e` <- Source code for use in this project. <- Makes src a Python module <- Scripts to download or generate data <- Scripts to turn raw data into features for modeling <- Scripts to train models and then use trained models to make predictions <- Scripts to create exploratory and results oriented visualizations <- tox file with settings for running tox; see tox.testrun.org

```

---

It can be upgraded at once:
1. we added `make docs` command for automatic generation of Sphinx documentation based on a whole `src` module's docstrings;
2. we added a conveinient file logger (and `logs` folder, respectivelly);
3. we added a coordinator entity for an easy navigation throughout the project, taking off the necessity of writing `os.path.join`, `os.path.abspath` или `os.path.dirname` every time.

But what problems are there?

* The folder `data` could grow significantly, but what script/notebook generated each file is a mystery. The amount of different files stored there can be misleading. Also it is not clear whether any of them is useful for a new feature implementation, because there is no place to contain descriptions and explanations.
* The folder `data` lacks the `features` submodule which could be a good use: the one can store calculated statistics, embeddings and other features. There is [a nice writing](https://www.logicalclocks.com/feature-store/) about this which I strongly recommend.
* The `src` folder is an another problem. It contains both functionality that is relevant project-wise (like `src.data` submodule) and functionality relevant to concrete and often small sub-tasks (like `src.models`).
* The folder `references` exists, but there is an opened question, who, when and how has to put some records there. And there is a lot to explain during the development process: which experiments have been done, what were the results, what are we doing next.

For a sake of solving listed problems I introduce the _experiment_ entity.

### Experiments

So, the _experiment_ is a place which contains all the data relevant to some hypothesis checking.

Including:
* What data was used
* What data (or artefacts) was produced
* Code version
* Timestamp of beginning and ending of an experiment
* Source file
* Parameters
* Metrics
* Logs

Many things can be logged via tracker utilities like [mlflow](https://mlflow.org/docs/latest/tracking.html), but it is not enough. We can improve our workflow.

This is what an example experiment looks like:

```

└── experiments
├── exp-001-Tree-models
│ ├── config <- yaml-files with grid search parameters or just model parameters
│ ├── models <- dumped models
│ ├── notebooks <- notebooks for research
│ ├── scripts <- scripts like train.py or predict.py
│ ├── Makefile <- for handling experiment with just few words put in console
│ ├── requirements.txt <- dependent libraries
│ └── log.md <- logs of how the experiment is going
│
├── exp-002-Gradient-boosting
...
```

Let's take a look at the workflow for one experiment.
1. The notebooks are created where data is being prepared for a model, and model's structure is being introduced.
2. Once the code is ready, it is moved to `train.py`
- Use might track model parameters from there (for instance, with `mlflow`)
- Create a relevant `config`-file for a training configuration
- The code should has the possibility to be called from the console
- It could take paths to the data, the `config`-file, and the directory to dump model to.
3. Then, Makefile is modified to start the training process via console. Provide a command like `make train`.
4. Many models are trained, all the metrics and parameters are sent to `mlflow`. The one can use `mlflow ui` to check the results.
5. Finally, all results are being recorded into `log.md`. It has some [impact analysis](https://en.wikipedia.org/wiki/Change_impact_analysis) elements: the developer needs to point out what data was used and what data was generated. This clarification can be used to generate automatically a readme file for a `data` folder and clarify where which file is used.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/surfstudio/ocean

Awesome Lists containing this project

README