{"id":25309901,"url":"https://github.com/kubeflow/trainer","last_synced_at":"2025-12-29T18:25:47.507Z","repository":{"id":37097085,"uuid":"95700338","full_name":"kubeflow/trainer","owner":"kubeflow","description":"Distributed ML Training and Fine-Tuning on Kubernetes","archived":false,"fork":false,"pushed_at":"2025-03-30T22:35:30.000Z","size":104926,"stargazers_count":1747,"open_issues_count":142,"forks_count":767,"subscribers_count":79,"default_branch":"master","last_synced_at":"2025-04-05T19:08:50.104Z","etag":null,"topics":["ai","distributed","fine-tuning","gpu","huggingface","jax","kubeflow","kubernetes","llm","machine-learning","mlops","python","pytorch","tensorflow","xgboost"],"latest_commit_sha":null,"homepage":"https://www.kubeflow.org/docs/components/training","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kubeflow.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-28T18:38:14.000Z","updated_at":"2025-04-04T14:01:17.000Z","dependencies_parsed_at":"2023-10-10T17:22:23.560Z","dependency_job_id":"db19eb09-eae9-4186-9c44-89366ab1b643","html_url":"https://github.com/kubeflow/trainer","commit_stats":{"total_commits":1000,"total_committers":200,"mean_commits":5.0,"dds":0.902,"last_synced_commit":"f3792b08bdcc08c7b394336d2a2c0cd3356bb5dd"},"previous_names":["kubeflow/tf-operator","tensorflow/k8s","kubeflow/trainer"],"tags_count":43,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kubeflow%2Ftrainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kubeflow%2Ftrainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kubeflow%2Ftrainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kubeflow%2Ftrainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kubeflow","download_url":"https://codeload.github.com/kubeflow/trainer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248312130,"owners_count":21082638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","distributed","fine-tuning","gpu","huggingface","jax","kubeflow","kubernetes","llm","machine-learning","mlops","python","pytorch","tensorflow","xgboost"],"created_at":"2025-02-13T13:06:59.039Z","updated_at":"2025-12-29T18:25:47.502Z","avatar_url":"https://github.com/kubeflow.png","language":"Python","readme":"# Kubeflow Trainer\n\n[![Join Slack](https://img.shields.io/badge/Join_Slack-blue?logo=slack)](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels)\n[![Coverage Status](https://coveralls.io/repos/github/kubeflow/trainer/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/trainer?branch=master)\n[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/trainer)](https://goreportcard.com/report/github.com/kubeflow/trainer)\n[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/10435/badge)](https://www.bestpractices.dev/projects/10435)\n[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/kubeflow/trainer)\n[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fkubeflow%2Ftrainer.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2Fkubeflow%2Ftrainer?ref=badge_shield)\n\n\u003ch1 align=\"center\"\u003e\n    \u003cimg src=\"./docs/images/trainer-logo.svg\" alt=\"logo\" width=\"200\"\u003e\n  \u003cbr\u003e\n\u003c/h1\u003e\n\nLatest News 🔥\n\n- [2025/09] Kubeflow SDK v0.1 is officially released with support for CustomTrainer,\n  BuiltinTrainer, and local PyTorch execution. Check out\n  [the GitHub release notes](https://github.com/kubeflow/sdk/releases/tag/0.1.0).\n- [2025/07] PyTorch on Kubernetes: Kubeflow Trainer Joins the PyTorch Ecosystem. Find the\n  announcement in [the PyTorch blog post](https://pytorch.org/blog/pytorch-on-kubernetes-kubeflow-trainer-joins-the-pytorch-ecosystem/).\n- [2025/07] Kubeflow Trainer v2.0 has been officially released. Check out\n  [the blog post announcement](https://blog.kubeflow.org/trainer/intro/) and [the\n  release notes](https://github.com/kubeflow/trainer/releases/tag/v2.0.0).\n- [2025/04] From High Performance Computing To AI Workloads on Kubernetes: MPI Runtime in\n  Kubeflow TrainJob. See the [KubeCon + CloudNativeCon London talk](https://youtu.be/Fnb1a5Kaxgo)\n\n## Overview\n\nKubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs)\nfine-tuning and enabling scalable, distributed training of machine learning (ML) models across\nvarious frameworks, including PyTorch, JAX, TensorFlow, and others.\n\nYou can integrate other ML libraries such as [HuggingFace](https://huggingface.co),\n[DeepSpeed](https://github.com/microsoft/DeepSpeed), or [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)\nwith Kubeflow Trainer to run them on Kubernetes.\n\nKubeflow Trainer enables you to effortlessly develop your LLMs with the\n[Kubeflow Python SDK](https://github.com/kubeflow/sdk/), and build Kubernetes-native Training\nRuntimes using Kubernetes Custom Resource APIs.\n\n\u003ch1 align=\"center\"\u003e\n    \u003cimg src=\"./docs/images/trainer-tech-stack.drawio.svg\" alt=\"logo\" width=\"500\"\u003e\n  \u003cbr\u003e\n\u003c/h1\u003e\n\n## Kubeflow Trainer Introduction\n\nThe following KubeCon + CloudNativeCon 2024 talk provides an overview of Kubeflow Trainer capabilities:\n\n[![Kubeflow Trainer](https://img.youtube.com/vi/Lgy4ir1AhYw/0.jpg)](https://www.youtube.com/watch?v=Lgy4ir1AhYw)\n\n## Getting Started\n\nPlease check [the official Kubeflow Trainer documentation](https://www.kubeflow.org/docs/components/trainer/getting-started)\nto install and get started with Kubeflow Trainer.\n\n## Community\n\nThe following links provide information on how to get involved in the community:\n\n- Join our [`#kubeflow-trainer` Slack channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack).\n- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV) community meeting.\n- Check out [who is using Kubeflow Trainer](ADOPTERS.md).\n\n## Contributing\n\nPlease refer to the [CONTRIBUTING guide](CONTRIBUTING.md).\n\n## Changelog\n\nPlease refer to the [CHANGELOG](CHANGELOG.md).\n\n## Kubeflow Training Operator V1\n\nKubeflow Trainer project is currently in \u003cstrong\u003ealpha\u003c/strong\u003e status, and APIs may change.\nIf you are using Kubeflow Training Operator V1, please refer [to this migration document](https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/).\n\nKubeflow Community will maintain the Training Operator V1 source code at\n[the `release-1.9` branch](https://github.com/kubeflow/trainer/tree/release-1.9).\n\nYou can find the documentation for Kubeflow Training Operator V1 in [these guides](https://www.kubeflow.org/docs/components/trainer/legacy-v1).\n\n## Acknowledgement\n\nThis project was originally started as a distributed training operator for TensorFlow and later we\nmerged efforts from other Kubeflow Training Operators to provide a unified and simplified experience\nfor both users and developers. We are very grateful to all who filed issues or helped resolve them,\nasked and answered questions, and were part of inspiring discussions.\nWe'd also like to thank everyone who's contributed to and maintained the original operators.\n\n- PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors)\n  and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS).\n- MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors)\n  and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS).\n- XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors)\n  and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS).\n- Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and\n  [maintainers](https://github.com/kubeflow/common/blob/master/OWNERS).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkubeflow%2Ftrainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkubeflow%2Ftrainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkubeflow%2Ftrainer/lists"}