{"id":21748347,"url":"https://github.com/surfstudio/ocean","last_synced_at":"2025-03-21T02:25:20.609Z","repository":{"id":71210894,"uuid":"167175927","full_name":"surfstudio/ocean","owner":"surfstudio","description":"A workflow managing tool for Machine Learning and Data Science projects","archived":false,"fork":false,"pushed_at":"2019-10-27T11:23:22.000Z","size":2023,"stargazers_count":17,"open_issues_count":2,"forks_count":3,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-01-25T22:58:03.585Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/surfstudio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-23T12:00:23.000Z","updated_at":"2023-02-18T19:20:56.000Z","dependencies_parsed_at":"2023-09-01T13:30:54.406Z","dependency_job_id":null,"html_url":"https://github.com/surfstudio/ocean","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/surfstudio%2Focean","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/surfstudio%2Focean/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/surfstudio%2Focean/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/surfstudio%2Focean/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/surfstudio","download_url":"https://codeload.github.com/surfstudio/ocean/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244723796,"owners_count":20499332,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-26T08:13:09.089Z","updated_at":"2025-03-21T02:25:20.603Z","avatar_url":"https://github.com/surfstudio.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ocean\n\nA template creation tool for Machine Learning and Data Science projects.\n\n🇷🇺 [Здесь](README_ru.md) лежит русскоязычная версия этого README.\n\n## Table of contents\n\n* [tldr](#tldr)\n    * [Installation](#Installation)\n    * [Usage](#Usage)\n* [History and main features](#History-and-main-features)\n    * [Cookiecutter-data-science](#Cookiecutter-data-science)\n    * [Experiments](#Experiments)\n\n## tldr\n\n### Installation\n\n1) Install Sphinx for automatic documentation support.\n\nFollow [this link](http://www.sphinx-doc.org/en/1.4/install.html) for the installation instructions.  Preferred way of installing is via pip3: `pip3 install -U sphinx`.\n\n2) Execute commands in Terminal:\n```\nsudo -i\ngit clone https://github.com/EnlightenedCSF/Ocean.git\ncd \u003ccloned repo\u003e\npip install --upgrade .\n```\n\n### Usage\nCreating a new project:\n```\nocean project new -n \"\u003cproject_name\u003e\" \\    # ! must be provided !\n                  -a \"\u003cauthor\u003e\" \\          # default is `Surf`\n                  -v \"\u003cversion\u003e\" \\         # default is `0.0.1`\n                  -d \"\u003cdescription\u003e\" \\     # default is ``\n                  -l \"\u003clicence\u003e\" \\         # default is `MIT`\n                  -p \"\u003cpath\u003e\"              # default is `.`\n```\n\nInstall the project code as a package:\n```\nmake -B package\n```\n\nCreating a new experiment in the project:\n```\nocean exp new -n \"\u003cexp_name\u003e\"   # ! must be provided !\n              -a \"\u003cauthor\u003e\"     # ! must be provided !\n```\n\n## History and main features\n\n### Cookiecutter-data-science\n\nThe project is based on [cookiecutter-data-science](https://drivendata.github.io/cookiecutter-data-science/) template, but is a modification of it. Before continue reading, I highly recommend you to follow the given link and take a look, because many key points listed there are important.\n\n---\n\n\n\u003cdetails\u003e\n    \u003csummary\u003eLet's see how the original cookiecutter is structured:\u003c/summary\u003e\n\n```\n├── LICENSE\n├── Makefile           \u003c- Makefile with commands like `make data` or `make train`\n├── README.md          \u003c- The top-level README for developers using this project.\n├── data\n│   ├── external       \u003c- Data from third party sources.\n│   ├── interim        \u003c- Intermediate data that has been transformed.\n│   ├── processed      \u003c- The final, canonical data sets for modeling.\n│   └── raw            \u003c- The original, immutable data dump.\n│\n├── docs               \u003c- A default Sphinx project; see sphinx-doc.org for details\n│\n├── models             \u003c- Trained and serialized models, model predictions, or model summaries\n│\n├── notebooks          \u003c- Jupyter notebooks. Naming convention is a number (for ordering),\n│                         the creator's initials, and a short `-` delimited description, e.g.\n│                         `1.0-jqp-initial-data-exploration`.\n│\n├── references         \u003c- Data dictionaries, manuals, and all other explanatory materials.\n│\n├── reports            \u003c- Generated analysis as HTML, PDF, LaTeX, etc.\n│   └── figures        \u003c- Generated graphics and figures to be used in reporting\n│\n├── requirements.txt   \u003c- The requirements file for reproducing the analysis environment, e.g.\n│                         generated with `pip freeze \u003e requirements.txt`\n│\n├── setup.py           \u003c- Make this project pip installable with `pip install -e`\n├── src                \u003c- Source code for use in this project.\n│   ├── __init__.py    \u003c- Makes src a Python module\n│   │\n│   ├── data           \u003c- Scripts to download or generate data\n│   │   └── make_dataset.py\n│   │\n│   ├── features       \u003c- Scripts to turn raw data into features for modeling\n│   │   └── build_features.py\n│   │\n│   ├── models         \u003c- Scripts to train models and then use trained models to make\n│   │   │                 predictions\n│   │   ├── predict_model.py\n│   │   └── train_model.py\n│   │\n│   └── visualization  \u003c- Scripts to create exploratory and results oriented visualizations\n│       └── visualize.py\n│\n└── tox.ini            \u003c- tox file with settings for running tox; see tox.testrun.org\n\n```\n\u003c/details\u003e\n\n---\n\nIt can be upgraded at once:\n1. we added `make docs` command for automatic generation of Sphinx documentation based on a whole `src` module's docstrings;\n2. we added a conveinient file logger (and `logs` folder, respectivelly);\n3. we added a coordinator entity for an easy navigation throughout the project, taking off the necessity of writing `os.path.join`, `os.path.abspath` или `os.path.dirname` every time.\n\nBut what problems are there?\n\n* The folder `data` could grow significantly, but what script/notebook generated each file is a mystery. The amount of different files stored there can be misleading. Also it is not clear whether any of them is useful for a new feature implementation, because there is no place to contain descriptions and explanations.\n* The folder `data` lacks the `features` submodule which could be a good use: the one can store calculated statistics, embeddings and other features. There is [a nice writing](https://www.logicalclocks.com/feature-store/) about this which I strongly recommend.\n* The `src` folder is an another problem. It contains both functionality that is relevant project-wise (like `src.data` submodule) and functionality relevant to concrete and often small sub-tasks (like `src.models`).\n* The folder `references` exists, but there is an opened question, who, when and how has to put some records there. And there is a lot to explain during the development process: which experiments have been done, what were the results, what are we doing next.\n\nFor a sake of solving listed problems I introduce the _experiment_ entity.\n\n\n### Experiments\n\nSo, the _experiment_ is a place which contains all the data relevant to some hypothesis checking.\n\nIncluding:\n* What data was used\n* What data (or artefacts) was produced\n* Code version\n* Timestamp of beginning and ending of an experiment\n* Source file\n* Parameters\n* Metrics\n* Logs\n\nMany things can be logged via tracker utilities like [mlflow](https://mlflow.org/docs/latest/tracking.html), but it is not enough. We can improve our workflow.\n\nThis is what an example experiment looks like:\n\n```\n\u003cproject_root\u003e\n    └── experiments\n        ├── exp-001-Tree-models\n        │   ├── config            \u003c- yaml-files with grid search parameters or just model parameters\n        │   ├── models            \u003c- dumped models\n        │   ├── notebooks         \u003c- notebooks for research\n        │   ├── scripts           \u003c- scripts like train.py or predict.py\n        │   ├── Makefile          \u003c- for handling experiment with just few words put in console\n        │   ├── requirements.txt  \u003c- dependent libraries\n        │   └── log.md            \u003c- logs of how the experiment is going\n        │\n        ├── exp-002-Gradient-boosting\n       ...\n```\n\nLet's take a look at the workflow for one experiment.\n1. The notebooks are created where data is being prepared for a model, and model's structure is being introduced.\n2. Once the code is ready, it is moved to `train.py`\n    - Use might track model parameters from there (for instance, with `mlflow`)\n    - Create a relevant `config`-file for a training configuration\n    - The code should has the possibility to be called from the console\n    - It could take paths to the data, the `config`-file, and the directory to dump model to.\n3. Then, Makefile is modified to start the training process via console. Provide a command like `make train`.\n4. Many models are trained, all the metrics and parameters are sent to `mlflow`. The one can use `mlflow ui` to check the results.\n5. Finally, all results are being recorded into `log.md`. It has some [impact analysis](https://en.wikipedia.org/wiki/Change_impact_analysis) elements: the developer needs to point out what data was used and what data was generated. This clarification can be used to generate automatically a readme file for a `data` folder and clarify where which file is used.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsurfstudio%2Focean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsurfstudio%2Focean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsurfstudio%2Focean/lists"}