Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/outerbounds/metaflow-trainium

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/outerbounds/metaflow-trainium
Owner: outerbounds
Created: 2024-01-23T20:56:21.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-04-11T13:57:00.000Z (10 months ago)
Last Synced: 2024-11-10T02:30:55.630Z (3 months ago)
Language: Python
Size: 1.2 MB
Stars: 5
Watchers: 6
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Metaflow-Trainium Examples

This repository contains examples that demonstrate how to use [Metaflow](https://metaflow.org/) to define and run machine learning training jobs with [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/). The training jobs are executed as batch jobs running on AWS EC2 trn1 instances in [AWS Batch](https://aws.amazon.com/batch/).

To run these examples, you first need to provision AWS resources for Metaflow and AWS Batch. Please refer to the [installation guide](./install_metaflow_and_batch.md) for instructions on how to deploy the required resources using CloudFormation and finalize your Metaflow setup.

## Step 1: Deploy infrastructure
https://github.com/outerbounds/metaflow-trainium/assets/40632488/850e474e-098c-44eb-81bb-d3a379eb1fab

## Step 2: Configure Metaflow
https://github.com/outerbounds/metaflow-trainium/assets/40632488/c89f8600-1038-4353-978a-2a347c3a2c49

## Step 3: Run experiments
Once the required resources have been created and configured, please try to run the included [allreduce example](./allreduce-trn) as a basic test of the Metaflow/Trainium/Batch setup. When the allreduce example is successfully running, you can then proceed to the more realistic workflows such as [Llama2-7b pretraining](./llama2-7b-pretrain-trn).

AWS Trainium is currently supported in us-east-1, us-east-2, and us-west-2. Please make sure that you are working in one of these supported regions.

## Example registry

We have included the following examples, and are happy to take requests to expand the list. Note that some sub-directories for Trainium have counterpart implementations for running comparisons on GPUs. This is not intended to be a benchmarking repository, but running a comparison against GPUs you have access to is useful for understanding general performance characteristics relative to other hardware architectures.

### [Llama2 pre-training](./llama2-7b-pretrain-trn/)
Pre-train Llama2 using ≥4 nodes with `trn.32xlarge` instances.

### [Llama2 fine-tuning on Trainium](./llama2-7b-finetune-trn/)
Fine-tune Llama2 on a single `trn.32xlarge` instance using the [`optimum-neuron`](https://huggingface.co/docs/optimum-neuron/en/index) library from Huggingface.

For a minimal code change GPU implementation, see [here](./llama2-7b-finetune-gpu-single-node/).
Note: We found A100 GPUs to have the most comparable characteristics, but it is far from an apples-to-apples comparison.

### [BERT fine-tuning on Trainium](./bert-finetune-trn/)
Fine-tune BERT on a single `trn.2xlarge` instance using the `optimum-neuron` library from Huggingface.

For a minimal code change GPU implementation, see [here](./bert-finetune-gpu/).