{"id":19888237,"url":"https://github.com/project-codeflare/mlbatch","last_synced_at":"2026-03-05T11:33:36.084Z","repository":{"id":246396309,"uuid":"820618204","full_name":"project-codeflare/mlbatch","owner":"project-codeflare","description":"Queuing and quota management for AI/ML batch jobs on Kubernetes","archived":false,"fork":false,"pushed_at":"2025-07-16T17:01:34.000Z","size":716,"stargazers_count":14,"open_issues_count":2,"forks_count":8,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-10-17T01:32:52.574Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/project-codeflare.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-06-26T20:50:01.000Z","updated_at":"2025-07-16T17:01:07.000Z","dependencies_parsed_at":"2024-07-31T22:45:42.783Z","dependency_job_id":"623721ae-f1c7-4752-9007-594f71f531c8","html_url":"https://github.com/project-codeflare/mlbatch","commit_stats":null,"previous_names":["project-codeflare/mlbatch"],"tags_count":18,"template":false,"template_full_name":null,"purl":"pkg:github/project-codeflare/mlbatch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fmlbatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fmlbatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fmlbatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fmlbatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/project-codeflare","download_url":"https://codeload.github.com/project-codeflare/mlbatch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/project-codeflare%2Fmlbatch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30122191,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T11:11:57.947Z","status":"ssl_error","status_checked_at":"2026-03-05T11:11:29.001Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T18:06:42.757Z","updated_at":"2026-03-05T11:33:36.038Z","avatar_url":"https://github.com/project-codeflare.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MLBatch\n\nThis repository describes the [setup](SETUP.md) and [use](USAGE.md) of the\nMLBatch queuing and quota management system on OpenShift and Kubernetes clusters. MLBatch\nleverages [Kueue](https://kueue.sigs.k8s.io), the [Kubeflow Training\nOperator](https://www.kubeflow.org/docs/components/training/),\n[KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html), and the\n[Codeflare Operator](https://github.com/project-codeflare/codeflare-operator)\nfrom [Red Hat OpenShift\nAI](https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai).\nMLBatch enables [AppWrappers](https://project-codeflare.github.io/appwrapper/)\nand adds\n[Coscheduler](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md).\nMLBatch includes a number of configuration steps to help these components work\nin harmony and support large workloads on large clusters.\n\nMLBatch handles the queuing and dispatching of batch workloads on OpenShift and Kubernetes\nclusters. It enforces team quotas at the namespace level. It automates the\nborrowing and reclamation of unused quotas across teams. Teams can use\npriorities within their namespaces without impact on other teams. Using\nAppWrappers to submit workloads activates a number of fault detection and\nrecovery capabilities, including automatically detecting failed pods and\nautomatically retrying failed workloads. Coscheduler supports gang scheduling\nand minimizes fragmentation by preferentially packing jobs requiring less than a\nfull node's worth of GPUs together.\n\n## Cluster Setup\n\nTo learn how to setup MLBatch on a cluster and onboard teams see\n[SETUP.md](SETUP.md).\n\n*Quota maintenance* is a key aspect of smoothly administering an MLBatch cluster.\nCluster admins should carefully read [QUOTA_MAINTENANCE.md](QUOTA_MAINTENANCE.md).\n\n## Running Workloads\n\nTo learn how to run workloads on an MLBatch cluster see [USAGE.md](USAGE.md) or\n[CODEFLARE.md](CODEFLARE.md) if you are already familiar with the\n[CodeFlare](https://github.com/project-codeflare) stack for managing AI/ML\nworkloads on Kubernetes.\n\n### PyTorchJobs via the MLBatch Helm Chart\n\nProperly configuring a distributed `PyTorchJob` to make effective use of the\nMLBatch system and hardware accelerators (GPUs, RoCE GDR) can be tedious. To\nautomate this process, we provide a Helm chart that captures best practices and\ncommon configuration options. Using this Helm chart helps eliminate common\nmistakes. Please see [pytorchjob-generator](tools/pytorchjob-generator) for\ndetailed usage instructions.\n\n## Development Setup\n\nIf you will be contributing to the development of the MLBatch project, you must\nsetup precommit hooks for your local clone of the repository. Do the following\nonce, immediately after cloning this repo:\n```shell\nhelm plugin install https://github.com/helm-unittest/helm-unittest.git\npre-commit install\n```\n\n## License\n\nCopyright 2024 IBM Corporation.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproject-codeflare%2Fmlbatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fproject-codeflare%2Fmlbatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproject-codeflare%2Fmlbatch/lists"}