https://github.com/surfstudio/ocean
A workflow managing tool for Machine Learning and Data Science projects
https://github.com/surfstudio/ocean
Last synced: 7 months ago
JSON representation
A workflow managing tool for Machine Learning and Data Science projects
- Host: GitHub
- URL: https://github.com/surfstudio/ocean
- Owner: surfstudio
- License: mit
- Created: 2019-01-23T12:00:23.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-10-27T11:23:22.000Z (almost 6 years ago)
- Last Synced: 2025-01-25T22:58:03.585Z (9 months ago)
- Language: HTML
- Homepage:
- Size: 1.93 MB
- Stars: 17
- Watchers: 6
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Ocean
A template creation tool for Machine Learning and Data Science projects.
π·πΊ [ΠΠ΄Π΅ΡΡ](README_ru.md) Π»Π΅ΠΆΠΈΡ ΡΡΡΡΠΊΠΎΡΠ·ΡΡΠ½Π°Ρ Π²Π΅ΡΡΠΈΡ ΡΡΠΎΠ³ΠΎ README.
## Table of contents
* [tldr](#tldr)
* [Installation](#Installation)
* [Usage](#Usage)
* [History and main features](#History-and-main-features)
* [Cookiecutter-data-science](#Cookiecutter-data-science)
* [Experiments](#Experiments)## tldr
### Installation
1) Install Sphinx for automatic documentation support.
Follow [this link](http://www.sphinx-doc.org/en/1.4/install.html) for the installation instructions. Preferred way of installing is via pip3: `pip3 install -U sphinx`.
2) Execute commands in Terminal:
```
sudo -i
git clone https://github.com/EnlightenedCSF/Ocean.git
cd
pip install --upgrade .
```### Usage
Creating a new project:
```
ocean project new -n "" \ # ! must be provided !
-a "" \ # default is `Surf`
-v "" \ # default is `0.0.1`
-d "" \ # default is ``
-l "" \ # default is `MIT`
-p "" # default is `.`
```Install the project code as a package:
```
make -B package
```Creating a new experiment in the project:
```
ocean exp new -n "" # ! must be provided !
-a "" # ! must be provided !
```## History and main features
### Cookiecutter-data-science
The project is based on [cookiecutter-data-science](https://drivendata.github.io/cookiecutter-data-science/) template, but is a modification of it. Before continue reading, I highly recommend you to follow the given link and take a look, because many key points listed there are important.
---
Let's see how the original cookiecutter is structured:
```
βββ LICENSE
βββ Makefile <- Makefile with commands like `make data` or `make train`
βββ README.md <- The top-level README for developers using this project.
βββ data
β βββ external <- Data from third party sources.
β βββ interim <- Intermediate data that has been transformed.
β βββ processed <- The final, canonical data sets for modeling.
β βββ raw <- The original, immutable data dump.
β
βββ docs <- A default Sphinx project; see sphinx-doc.org for details
β
βββ models <- Trained and serialized models, model predictions, or model summaries
β
βββ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β the creator's initials, and a short `-` delimited description, e.g.
β `1.0-jqp-initial-data-exploration`.
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials.
β
βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β βββ figures <- Generated graphics and figures to be used in reporting
β
βββ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β generated with `pip freeze > requirements.txt`
β
βββ setup.py <- Make this project pip installable with `pip install -e`
βββ src <- Source code for use in this project.
β βββ __init__.py <- Makes src a Python module
β β
β βββ data <- Scripts to download or generate data
β β βββ make_dataset.py
β β
β βββ features <- Scripts to turn raw data into features for modeling
β β βββ build_features.py
β β
β βββ models <- Scripts to train models and then use trained models to make
β β β predictions
β β βββ predict_model.py
β β βββ train_model.py
β β
β βββ visualization <- Scripts to create exploratory and results oriented visualizations
β βββ visualize.py
β
βββ tox.ini <- tox file with settings for running tox; see tox.testrun.org```
---
It can be upgraded at once:
1. we added `make docs` command for automatic generation of Sphinx documentation based on a whole `src` module's docstrings;
2. we added a conveinient file logger (and `logs` folder, respectivelly);
3. we added a coordinator entity for an easy navigation throughout the project, taking off the necessity of writing `os.path.join`, `os.path.abspath` ΠΈΠ»ΠΈ `os.path.dirname` every time.But what problems are there?
* The folder `data` could grow significantly, but what script/notebook generated each file is a mystery. The amount of different files stored there can be misleading. Also it is not clear whether any of them is useful for a new feature implementation, because there is no place to contain descriptions and explanations.
* The folder `data` lacks the `features` submodule which could be a good use: the one can store calculated statistics, embeddings and other features. There is [a nice writing](https://www.logicalclocks.com/feature-store/) about this which I strongly recommend.
* The `src` folder is an another problem. It contains both functionality that is relevant project-wise (like `src.data` submodule) and functionality relevant to concrete and often small sub-tasks (like `src.models`).
* The folder `references` exists, but there is an opened question, who, when and how has to put some records there. And there is a lot to explain during the development process: which experiments have been done, what were the results, what are we doing next.For a sake of solving listed problems I introduce the _experiment_ entity.
### Experiments
So, the _experiment_ is a place which contains all the data relevant to some hypothesis checking.
Including:
* What data was used
* What data (or artefacts) was produced
* Code version
* Timestamp of beginning and ending of an experiment
* Source file
* Parameters
* Metrics
* LogsMany things can be logged via tracker utilities like [mlflow](https://mlflow.org/docs/latest/tracking.html), but it is not enough. We can improve our workflow.
This is what an example experiment looks like:
```
βββ experiments
βββ exp-001-Tree-models
β βββ config <- yaml-files with grid search parameters or just model parameters
β βββ models <- dumped models
β βββ notebooks <- notebooks for research
β βββ scripts <- scripts like train.py or predict.py
β βββ Makefile <- for handling experiment with just few words put in console
β βββ requirements.txt <- dependent libraries
β βββ log.md <- logs of how the experiment is going
β
βββ exp-002-Gradient-boosting
...
```Let's take a look at the workflow for one experiment.
1. The notebooks are created where data is being prepared for a model, and model's structure is being introduced.
2. Once the code is ready, it is moved to `train.py`
- Use might track model parameters from there (for instance, with `mlflow`)
- Create a relevant `config`-file for a training configuration
- The code should has the possibility to be called from the console
- It could take paths to the data, the `config`-file, and the directory to dump model to.
3. Then, Makefile is modified to start the training process via console. Provide a command like `make train`.
4. Many models are trained, all the metrics and parameters are sent to `mlflow`. The one can use `mlflow ui` to check the results.
5. Finally, all results are being recorded into `log.md`. It has some [impact analysis](https://en.wikipedia.org/wiki/Change_impact_analysis) elements: the developer needs to point out what data was used and what data was generated. This clarification can be used to generate automatically a readme file for a `data` folder and clarify where which file is used.