https://github.com/ml-tooling/ml-project-template

ML project template facilitating both research and production phases.
https://github.com/ml-tooling/ml-project-template

docker machine-learning reproducibility

Last synced: 9 months ago
JSON representation

ML project template facilitating both research and production phases.

Host: GitHub
URL: https://github.com/ml-tooling/ml-project-template
Owner: ml-tooling
Created: 2019-07-18T18:41:21.000Z (almost 7 years ago)
Default Branch: develop
Last Pushed: 2019-08-05T15:50:39.000Z (almost 7 years ago)
Last Synced: 2025-09-12T03:12:47.378Z (10 months ago)
Topics: docker, machine-learning, reproducibility
Language: Python
Homepage:
Size: 68.4 KB
Stars: 111
Watchers: 4
Forks: 30
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ML Project Template

This repository contains a template project that can be easily adapted for all kinds of Machine Learning tasks.
Typically, solving such task entails two main phases, _research_ and _production_ with very different focuses. The template intends to faciliatate work on ML projects by guiding practitioners to adopt some best practices.

[`research`](./research): exploratory data analyses, model prototyping and experiments are dumped here in a structured way

[`production`](./production): distilled utils lib, training job and inference service are implemented here

It is recommended to simply clone this repo and customize it to the specific use-case at hand.

---

## Repository Structure

- **[research](./research)**: Scripts and Notebooks for experimentation.
- **[develop](./research/develop)** (Python): Experimental code to try out new ideas and experiments. Use Jupyter notebooks wherever you can. Naming convention: `YYYY-MM-DD_userid_short-description`. If you cannot use a notebook and have multiple scripts/files for an experiment, create a folder with the same naming convention. Each file should be handled by one person only.
- **[deliver](./research/deliver)** (Python): Refactored notebooks that contain valuable insights or results (e.g. visualizations, training runs). Notebooks should be refactored, documented, contain outputs, and use the following naming schema: `YYYY-MM-DD_short-description`. Notebooks in deliver should not be changed or rerun. If you want to rerun a deliver Notebook, please duplicate it into the develop folder.
- **[templates](./research/templates)** (Python): Refactored Notebooks that are reusable for a specific task (e.g. model training, data exploration). Notebooks should be refactored, documented, not contain any output, and use the following naming schema: `short-description`. If you like to make use of a template Notebook, duplicate the notebook into develop folder.
- **[production](./production)**: The production-ready solution(s) composed of libraries, services, and jobs.
- **[python-utils-lib](./production/python-utils-lib)** (Python): Utility functions that are distilled from the research phase and used across multiple scripts. Should only contain refactored and tested Python scripts/modules. Installable via pip.
- **[training-job](./production/training-job)** (Python/Docker): Combines required data exports, preprocessing and training scripts into a Docker container. This makes results reproducible and the production model retrainable in _any_ ennvironment.
- **[inference-service](./production/inference-service)** (Python/Docker): Docker container that provides the final model prediction capabilities via a REST API.

## Naming Conventions

### Code Artifacts

- develop notebooks/scripts: `YYYY-MM-DD_userid_short-description`
- deliver notebooks/scripts: `YYYY-MM-DD_short-description`
- template notebooks/scripts: `short-description`
- services: `-service` suffix
- jobs: `-job` suffix
- libraries: `-lib` suffix

### Files

`__.`

#### Examples:

- `blogs-metadata.csv`
- `blogs-metadata_cl-rs_ft-vec.vectors`
- `categories2blogs_cl-rs-lm_tfidf-lsvm.model.zip`
- `categories2blogs-questions_cl-rs-lm_tfidf-lsvm.model.zip`

#### Name Identifier Descriptions:

Name
Description

Dataset Identifiers:

categories2blogs
Dataset containing blogs with the text content, blogs item URI, and connected primary tags.

blogs-metadata
Dataset containing all blogs and related metadata (properties).

Preprocessing Identifiers:

cl
Default text cleaning (lowercasing, regex cleaning).

rs
Remove Stopwords.

lm
Text lemmatization.

Training Identifiers:

ft-vec
Text vectorizer using Fasttext.

tfidf
Text vectorizer using TFIDF.

lsvm
Classifier using linear SVM.

Filetype Identifiers:

.model
Model file.

.vectors
Binary vectors file.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ml-tooling/ml-project-template

Awesome Lists containing this project

README