{"id":19621593,"url":"https://github.com/deeprec-ai/extension","last_synced_at":"2025-04-23T21:22:19.005Z","repository":{"id":240194515,"uuid":"731040742","full_name":"DeepRec-AI/extension","owner":"DeepRec-AI","description":"DeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on DeepRec.","archived":false,"fork":false,"pushed_at":"2024-05-17T06:46:26.000Z","size":1664,"stargazers_count":10,"open_issues_count":0,"forks_count":1,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-30T03:41:12.116Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DeepRec-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-13T08:25:54.000Z","updated_at":"2024-07-05T02:26:29.000Z","dependencies_parsed_at":"2024-05-17T08:49:57.535Z","dependency_job_id":"39c7ceb6-b3dc-4f23-b8d7-1c07426d4407","html_url":"https://github.com/DeepRec-AI/extension","commit_stats":null,"previous_names":["deeprec-ai/extension"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepRec-AI%2Fextension","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepRec-AI%2Fextension/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepRec-AI%2Fextension/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepRec-AI%2Fextension/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DeepRec-AI","download_url":"https://codeload.github.com/DeepRec-AI/extension/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250515326,"owners_count":21443371,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T11:23:39.412Z","updated_at":"2025-04-23T21:22:18.975Z","avatar_url":"https://github.com/DeepRec-AI.png","language":"C++","readme":"# DeepRec Extension\n\n## Introduction\n\nDeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on [DeepRec](https://github.com/DeepRec-AI/DeepRec).\n\n## Features\n\n![](docs/image/extension_framework.svg)\n\n### Auto-scaling\n\nLarge-scale distributed training tasks contain many roles, such as chief, ps, and worker. Native interfaces for distributed training tasks require users to specify the number and resource allocation for each role, which makes a significant challenge for users. For users, it is difficult to configure these hyperparameters appropriately to ensure high resource utilization for training tasks. In many scenarios, users configure them too small lead to Out-of-Memory (OOM) errors in their training tasks, while in other scenarios, excessive configurations result in wasted resources due to over-allocation.\n\nSome solutions achieve elastic training by stopping training and restarting it integrating checkpointing mechanisms. This approach is intrusive to users, as the process of halting and resuming training requires resources, resulting in an increase in overall training time. This overhead is particularly significant in scenarios where training tasks frequently require elasticity adjustments and cannot be overlooked. Restoring model with latest checkpoint causes training samples/model rollback and compute resources wasting.\n\nDynamic Embedding Server(DES) scale-up/scale-down PS nodes without job restart. It makes parameters redistribution and server dynamic addition and deletion automatically.\n\n### Gazer\n\nGazer is a metrics system for DeepRec/TensorFlow. It collects runtime machine load status and graph execution information, reporting them to the Master node, making decision elastic scaling of tasks by the master or presenting them to users via the TensorBoard interface.\n\n### Fast-fault-tolerance\n\nExisting checkpoint mechanisms, when a PS node fails unexpectedly, it requires restarting the entire task and reverting the model to the previous checkpoint. This process significantly squanders the training outcomes from the previous checkpoint up until the node failure. Moreover, inconsistencies in the rollback of the distributed training model and samples give rise to the additional issue of sample loss.\n\nFirstly, we support the consistency of the sample and the model by extra checkpoint. Secondly, we restart single PS node when PS node crash instead of restarting job. Lastly, we make backups of the model parameters to enable rapid recovery in the event of PS node failures.\n\n### Master Controller\n\nIn TensorFlow training tasks, there is a lack of a task-level master node for managing the state control of all the aforementioned functionalities. Taking into account the resource scheduling ecosystem of cloud-native K8S, we have extended tfjob in Kubeflow by adding a CRD with master capabilities.\n\n## How to build\n\n1. clone extension source code \u0026 init submodule\n\n```shell\ngit clone git@github.com:DeepRec-AI/extension.git /workspace/extension \u0026\u0026 cd /workspace/extension\ngit submodule update --init --recursive\n```\n\n2. start container\n\n```shell\ndocker run -ti --name deeprec-extension-dev --net=host -v /workspace:/workspace alideeprec/extension-dev:cpu-py36-ubuntu18.04 bash\n```\n\n3. build all python wheel modules\n\n```shell\ncd /workspace/extension\nmake gazer des master tft -j32\n```\n\n## How to deploy\n\n### Prerequisites\n\n1. `golang` version\u003e=1.20.12\n\n2. `kubectl` client [install kubectl on linux](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#install-kubectl-binary-with-curl-on-linux)\n\n### Installation\n\n1. install \u0026 configure kubectl client\n\n```shell\n$HOME/.kube/config\n```\n\n2. deploy kubeflow-operator\n\n```shell\n# clone kubeflow source code\ngit clone git@github.com:kubeflow/training-operator.git /workspace/training-operator \u0026\u0026 cd /workspace/training-operator\n# install kubeflow CRD\nmake install\n# deploy kubeflow image with v1.7.0\nmake deploy IMG=kubeflow/training-operator:v1-5525468\n```\n\n3. build \u0026 deploy aimaster-operator image\n\n```shell\ncd /workspace/extension/aimaster_operator/\n# install aimaster-operator CRD\nmake install\n# build aimaster-operator image\n# {image} is aimaster-operator image name\nmake docker-build IMG={image}\n# push image to YOUR dockerhub\n# such as: Alibaba Container Registry, ACR, https://cr.console.aliyun.com\nmake docker-push IMG={image}\n# deploy aimaster-operator image to k8s\nmake deploy IMG={image}\n```\n\n4. build aimaster image\n\n```shell\ncd /workspace/extension\n# build aimaster image\nmake master-build-docker\n# push image to YOUR dockerhub\n# such as: Alibaba Container Registry, ACR, https://cr.console.aliyun.com\ndocker push {image} {namespace/image:tag}\n```\n\n5. build deeprec-extension image\n\n```shell\ncd /workspace/extension\n# build deeprec-extension image\nbash tools/examples/build_docker.sh\n# push image to YOUR dockerhub\n# such as: Alibaba Container Registry, ACR, https://cr.console.aliyun.com\ndocker push {image} {namespace/image:tag}\n```\n\n## How to use\n\n```shell\nkubectl apply -f tools/examples/extension_test.yaml\n```\n\n## Latest Images\n\n### aimaster-operator\n\n```shell\nalideeprec/extension-operator-release:latest\n```\n\n### aimaster\n\n```shell\nalideeprec/extension-aimaster-release:latest\n```\n\n### extension\n\ndeeprec + estimator + gazer + des + tf-fault-tolerance + deeprec-master\n\n```shell\nalideeprec/extension-release:latest\n```\n\n## Example on [ACK](https://www.aliyun.com/product/kubernetes)\n\n1. Create ACK\n\n2. deploy kubeflow-operator\n\n3. deploy aimaster-operator\n\n4. execute training job\n\n```shell\nkubectl apply -f tools/examples/extension_test.yaml\n```\n\n### Estimator sample code\n\n[train.py](tools/examples/train.py)\n![](docs/image/code_diff.png)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeeprec-ai%2Fextension","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeeprec-ai%2Fextension","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeeprec-ai%2Fextension/lists"}