{"id":19329718,"url":"https://github.com/outerbounds/metaflow-checkpoint-examples","last_synced_at":"2025-04-22T21:31:57.831Z","repository":{"id":257780448,"uuid":"858839586","full_name":"outerbounds/metaflow-checkpoint-examples","owner":"outerbounds","description":"Examples for the Metaflow Checkpoint Extension","archived":false,"fork":false,"pushed_at":"2024-10-28T20:38:50.000Z","size":134,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-10-28T21:23:08.058Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/outerbounds.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-17T16:11:51.000Z","updated_at":"2024-10-28T20:14:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"d8858a90-338f-4c2a-b7b0-1cb4c903a054","html_url":"https://github.com/outerbounds/metaflow-checkpoint-examples","commit_stats":null,"previous_names":["outerbounds/metaflow-checkpoint-examples"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outerbounds%2Fmetaflow-checkpoint-examples","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outerbounds%2Fmetaflow-checkpoint-examples/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outerbounds%2Fmetaflow-checkpoint-examples/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outerbounds%2Fmetaflow-checkpoint-examples/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/outerbounds","download_url":"https://codeload.github.com/outerbounds/metaflow-checkpoint-examples/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223905702,"owners_count":17222938,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T02:29:32.624Z","updated_at":"2024-11-10T02:29:33.127Z","avatar_url":"https://github.com/outerbounds.png","language":"Python","readme":"# Metaflow `@checkpoint`/`@model`/`@huggingface_hub` Examples\n\nLong-running data processing and machine learning jobs often present several challenges:\n\n1. **Failure Recovery**: Recovering from failures can be painful and time-consuming.\n   - *Example*: Suppose you're training a deep learning model that takes 12 hours to complete. If the process crashes at the 10-hour mark due to a transient error, without checkpoints, you'd have to restart the entire training from scratch.\n   - *Example*: During data preprocessing, you generate intermediate datasets like tokenized text or transformed images. Losing these intermediates means re-running expensive computations, which can be especially problematic if they took hours to create.\n\n2. **External Dependencies**: Jobs may require large external data (e.g., pre-trained models) that are cumbersome to manage.\n   - *Example*: Loading a pre-trained transformer model from Hugging Face Hub can take a significant amount of time and bandwidth. If this model isn't cached, every run or worker node (in a distributed training context) would need to download it separately, leading to inefficiencies.\n\n3. **Version Control in Multi-User Environments**: Managing checkpoints and models in a multi-user setting requires proper version control to prevent overwriting and ensure correct loading during failure recovery.\n   - *Example*: If multiple data scientists are training models and saving checkpoints to a shared storage, one user's checkpoint might accidentally overwrite another's. This can lead to confusion and loss of valuable work. Moreover, when a job resumes after a failure, it must load the correct checkpoint corresponding to that specific run and user.\n\nTo address these challenges, Metaflow introduces the `@checkpoint`/ `@model`/ `@huggingface_hub` decorators, which simplify the process of saving and loading checkpoints and models within your flows. These decorators ensure that your long-running jobs can be resumed seamlessly after a failure, manage external dependencies efficiently, and maintain proper version control in collaborative environments.\n\nThis repository contains a gallery of examples demonstrating how to leverage `@checkpoint`/`@model`/`@huggingface_hub` to overcome the aforementioned challenges. By exploring these examples, you'll learn practical ways to integrate checkpointing and model management into your workflows, enhancing robustness, efficiency, and collaboration.**\n\n---\n\n## Starter Examples\n\n**Basic Checkpointing with `@checkpoint`:**\n\n- [MNIST Training with Vanilla PyTorch](./mnist_torch_vanilla)\n- [MNIST Training with Keras](./mnist_keras)\n- [MNIST Training with PyTorch Lightning](./mnist_ptl)\n- [MNIST Training with Hugging Face Transformers](./mnist_huggingface)\n- [Saving XGBoost Models as Part of the Model Registry](./xgboost/)\n\nThese starter examples introduce the fundamentals of checkpointing and model saving. They show how to implement `@checkpoint` in simple training workflows, ensuring that you can recover from failures without losing progress. You'll also see how `@model` helps in saving and loading models/checkpoints effortlessly.\n\n---\n\n## Intermediate Examples\n\n**Checkpointing with Large Models and Managing External Dependencies:**\n\n- [Training LoRA Models with Hugging Face](./lora_huggingface/)\n- [Training LoRA Models on NVIDIA GPU Cloud with `@nvidia`](./nim_lora/)\n- [Generating Videos from Text Using Stable Diffusion XL and Stable Diffusion Video](./stable-diff/)\n\nThese intermediate examples dive into more complex scenarios where managing large external models becomes crucial. You'll learn how to use `@checkpoint`/`@model` alongside external resources like Hugging Face Hub (with `@huggingface_hub`). \n\n---\n\n## Advanced Examples\n\n**Checkpointing and Failure Recovery in Distributed Training Environments:**\n\n- [Multi-node Distributed Training of CIFAR-10 with PyTorch DDP](./cifar_distributed/)\n\nThe advanced examples focus on distributed training environments where the complexity of failure recovery and model management increases. You'll explore how `@checkpoint` facilitates seamless recovery across multiple nodes. ","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fouterbounds%2Fmetaflow-checkpoint-examples","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fouterbounds%2Fmetaflow-checkpoint-examples","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fouterbounds%2Fmetaflow-checkpoint-examples/lists"}