https://github.com/ml-tooling/ml-project-template
ML project template facilitating both research and production phases.
https://github.com/ml-tooling/ml-project-template
docker machine-learning reproducibility
Last synced: 8 months ago
JSON representation
ML project template facilitating both research and production phases.
- Host: GitHub
- URL: https://github.com/ml-tooling/ml-project-template
- Owner: ml-tooling
- Created: 2019-07-18T18:41:21.000Z (almost 7 years ago)
- Default Branch: develop
- Last Pushed: 2019-08-05T15:50:39.000Z (almost 7 years ago)
- Last Synced: 2025-09-12T03:12:47.378Z (9 months ago)
- Topics: docker, machine-learning, reproducibility
- Language: Python
- Homepage:
- Size: 68.4 KB
- Stars: 111
- Watchers: 4
- Forks: 30
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ML Project Template
This repository contains a template project that can be easily adapted for all kinds of Machine Learning tasks.
Typically, solving such task entails two main phases, _research_ and _production_ with very different focuses. The template intends to faciliatate work on ML projects by guiding practitioners to adopt some best practices.
[`research`](./research): exploratory data analyses, model prototyping and experiments are dumped here in a structured way
[`production`](./production): distilled utils lib, training job and inference service are implemented here
It is recommended to simply clone this repo and customize it to the specific use-case at hand.
---
## Repository Structure
- **[research](./research)**: Scripts and Notebooks for experimentation.
- **[develop](./research/develop)** (Python): Experimental code to try out new ideas and experiments. Use Jupyter notebooks wherever you can. Naming convention: `YYYY-MM-DD_userid_short-description`. If you cannot use a notebook and have multiple scripts/files for an experiment, create a folder with the same naming convention. Each file should be handled by one person only.
- **[deliver](./research/deliver)** (Python): Refactored notebooks that contain valuable insights or results (e.g. visualizations, training runs). Notebooks should be refactored, documented, contain outputs, and use the following naming schema: `YYYY-MM-DD_short-description`. Notebooks in deliver should not be changed or rerun. If you want to rerun a deliver Notebook, please duplicate it into the develop folder.
- **[templates](./research/templates)** (Python): Refactored Notebooks that are reusable for a specific task (e.g. model training, data exploration). Notebooks should be refactored, documented, not contain any output, and use the following naming schema: `short-description`. If you like to make use of a template Notebook, duplicate the notebook into develop folder.
- **[production](./production)**: The production-ready solution(s) composed of libraries, services, and jobs.
- **[python-utils-lib](./production/python-utils-lib)** (Python): Utility functions that are distilled from the research phase and used across multiple scripts. Should only contain refactored and tested Python scripts/modules. Installable via pip.
- **[training-job](./production/training-job)** (Python/Docker): Combines required data exports, preprocessing and training scripts into a Docker container. This makes results reproducible and the production model retrainable in _any_ ennvironment.
- **[inference-service](./production/inference-service)** (Python/Docker): Docker container that provides the final model prediction capabilities via a REST API.
## Naming Conventions
### Code Artifacts
- develop notebooks/scripts: `YYYY-MM-DD_userid_short-description`
- deliver notebooks/scripts: `YYYY-MM-DD_short-description`
- template notebooks/scripts: `short-description`
- services: `-service` suffix
- jobs: `-job` suffix
- libraries: `-lib` suffix
### Files
`__.`
#### Examples:
- `blogs-metadata.csv`
- `blogs-metadata_cl-rs_ft-vec.vectors`
- `categories2blogs_cl-rs-lm_tfidf-lsvm.model.zip`
- `categories2blogs-questions_cl-rs-lm_tfidf-lsvm.model.zip`
#### Name Identifier Descriptions:
Name
Description
Dataset Identifiers:
categories2blogs
Dataset containing blogs with the text content, blogs item URI, and connected primary tags.
blogs-metadata
Dataset containing all blogs and related metadata (properties).
Preprocessing Identifiers:
cl
Default text cleaning (lowercasing, regex cleaning).
rs
Remove Stopwords.
lm
Text lemmatization.
Training Identifiers:
ft-vec
Text vectorizer using Fasttext.
tfidf
Text vectorizer using TFIDF.
lsvm
Classifier using linear SVM.
Filetype Identifiers:
.model
Model file.
.vectors
Binary vectors file.