Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/outerbounds/full-stack-ML-metaflow-tutorial

It's all in the name
https://github.com/outerbounds/full-stack-ML-metaflow-tutorial

Last synced: 4 months ago
JSON representation

It's all in the name

Host: GitHub
URL: https://github.com/outerbounds/full-stack-ML-metaflow-tutorial
Owner: outerbounds
License: mit
Created: 2022-03-16T01:01:18.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-07-09T07:48:39.000Z (over 1 year ago)
Last Synced: 2024-08-01T16:39:58.906Z (7 months ago)
Language: HTML
Size: 13.9 MB
Stars: 72
Watchers: 6
Forks: 41
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

**If you're taking this tutorial at a conference, please pull the repository 24 hours before the tutorial begins to make sure to have the most recent version!**

**If you're at SciPy and want to access the content in your Metaflow sandbox, click [here](https://account.outerbounds.dev/account/?workspace=/home/workspace/workspaces/scipy-full-stack-ml/workspace.code-workspace).**

# Full stack ML with Metaflow tutorial

One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present an introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows.

Also, note that when we say we're teaching "Full Stack Machine Learning", we are not advocating for the existence of "Full Stack Data Scientists"! Rather, our goals lay in teaching which layers of the modern data stack data scientists need to be focusing on, while having (relatively) easy access to key infrastructural layers:

![flow0](img/data-triangle.jpg)

## Prerequisites

* programming fundamentals and the basics of the Python programming language (e.g., variables, for loops);
* a bit about the PyData stack: `numpy`, `pandas`, `scikit-learn`, for example;
* a bit about Jupyter Notebooks and Jupyter Lab;
* your way around the terminal/shell.

**However, we have always found that the most important and beneficial prerequisite is a will to learn new things so if you have this quality, you'll definitely get something out of this workshop.**

## Getting set up on your Metaflow sandbox

If you want to get going as soon as possible you can use your [Metaflow sandbox](https://account.outerbounds.dev/account/) for free!
We have already installed the workspace and its dependencies in your sandbox, so you can get going without any installation required.

## Getting set up on your own infrastructure

The easiest way to get started on your own infrastructure is to follow this [CloudFormation template](https://github.com/outerbounds/metaflow-tools/blob/master/aws/cloudformation/metaflow-cfn-template.yml). You can find [instructions here](https://github.com/outerbounds/metaflow-tools/tree/master/aws/cloudformation#how-to-deploy-from-the-aws-console).

> _Note_: The CloudFormation template uses AWS Batch to provide compute resources. Some of the code in this repository uses the `@kubernetes` decorator, so if your Metaflow deployment uses Batch instead of Kubernetes, you can replace `@kubernetes` with `@batch` as needed.

### 1. Deploy Metaflow

To set up to run on your own infrastructure stack, please review the [Metaflow engineering guides](https://outerbounds.com/engineering/welcome/). There you will find information on how to configure and operate Metaflow on AWS, Azure, or GCP.

### 2. Clone the repository

To get set up for this live coding session, clone this repository. You can do so by executing the following in your terminal:

```
git clone https://github.com/outerbounds/full-stack-ML-metaflow-tutorial
```

Alternatively, you can download the zip file of the repository at the top of the main page of the repository. If you prefer not to use git or don't have experience with it, this a good option.

### 3. Download Anaconda (if you haven't already)

If you do not already have the [Anaconda distribution](https://www.anaconda.com/download/) of Python 3, go get it.

### 4. Create your conda environment for this session

Navigate to the relevant directory `scipy-full-stack-ml` and install required packages in a new conda environment:

```
conda env create -f env.yml
```

This will create a new environment called `scipy-full-stack-ml`. To activate the environment on OSX/Linux, execute

```
source activate scipy-full-stack-ml
```

If you're using Windows, please follow the instructions under Metaflow Windows Support [here](https://docs.metaflow.org/v/r/getting-started/install#windows-support): Metaflow currently doesn't offer native support for Windows. However, if you are using Windows 10, then you can use WSL (Windows Subsystem for Linux) to install Metaflow.

### 5. Open Jupyter Lab, VSCode notebook, etc.

In the terminal, execute `jupyter lab`.

Then open the notebook `1-Laptop-ML.ipynb` and we're ready to get coding. Enjoy.

## Session Outline

- Lesson 1: Laptop Machine Learning (the refresher)

This lesson will be a refresher on laptop machine learning, that is, when you’re using local compute resources, not working on the cloud: using the PyData stack (packages such as NumPy, pandas, and scikit-learn) to do basic forms of prediction and inference locally. We will also cover common pitfalls and gotchas, which motivate the next lessons.

- Lesson 2: Machine learning workflows and DAGs

This lesson will focus on building local machine learning workflows using Metaflow, although the high-level concepts taught will be applicable to any workflow orchestrator. Attendees will get a feel for writing flows and DAGs to define the steps in their workflows. We’ll also use DAG cards to visualize our ML workflows. This lesson will be local computation and in the next lesson, we’ll burst to the cloud.

We'll introduce the framework Metaflow, which allows data scientists to focus on the top layers of the ML stack, while having access to the infrastructural layers.

- Lesson 3: Bursting to the Cloud

In this lesson, we’ll see how we can move ML steps or entire workflows to the cloud from the comfort of our own IDE. In this case, we’ll be using AWS Batch compute resources, but the techniques are generalizable.

- Lesson 4 (optional and time permitting): Integrating other tools into your ML pipelines

We’ll also see how to begin integrating other tools into our pipelines, such as dbt for data transformation, great expectations for data validation, Weights & Biases for experiment tracking, and Amazon Sagemaker for model deployment. Once again, the intention is not to tie us to any of these tools, but to use them to illustrate various aspects of the ML stack and to develop a workflow in which they can easily be switched out for other tools, depending on where you work and who you’re collaborating with.

To be clear, lessons 1-3 above get you far! As your projects mature, the more advanced topics in Lesson 4 become relevant.