{"id":20859789,"url":"https://github.com/stefanofioravanzo/distributed-deeplearning-kubernetes","last_synced_at":"2026-04-19T21:35:04.246Z","repository":{"id":77066988,"uuid":"136899344","full_name":"StefanoFioravanzo/distributed-deeplearning-kubernetes","owner":"StefanoFioravanzo","description":"Collection of resources for automatic deployment of distributed deep learning jobs on a Kubernetes cluster","archived":false,"fork":false,"pushed_at":"2018-09-18T07:48:14.000Z","size":126,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-19T07:31:28.507Z","etag":null,"topics":["azure-kubernetes-service","distributed-deep-learning","kubernetes-operator","mxnet","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StefanoFioravanzo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-11T08:49:42.000Z","updated_at":"2024-05-27T02:22:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"68e954d4-aceb-48e4-af71-c6ff101226fb","html_url":"https://github.com/StefanoFioravanzo/distributed-deeplearning-kubernetes","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdistributed-deeplearning-kubernetes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdistributed-deeplearning-kubernetes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdistributed-deeplearning-kubernetes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdistributed-deeplearning-kubernetes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StefanoFioravanzo","download_url":"https://codeload.github.com/StefanoFioravanzo/distributed-deeplearning-kubernetes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243232029,"owners_count":20258023,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure-kubernetes-service","distributed-deep-learning","kubernetes-operator","mxnet","tensorflow"],"created_at":"2024-11-18T04:53:09.459Z","updated_at":"2025-12-25T22:17:49.123Z","avatar_url":"https://github.com/StefanoFioravanzo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Distributed Deep Learning Using Kubernetes and CustomOperators\n\nThis repository is a collection of resources and scripts for the deployment and automation of resources allocation for Kubernetes operators. You can find a version of my MXNet Go operator at [https://github.com/StefanoFioravanzo/mx-operator](https://github.com/StefanoFioravanzo/mx-operator) and the more general Python DLOperator at [https://github.com/StefanoFioravanzo/dl-operator](https://github.com/StefanoFioravanzo/dl-operator).\n\n\n```\n.\n├── README.md\n├── azure_cli\n├── clean_up.sh\n├── config.sh\n├── deploy_job.sh\n├── init.sh\n├── mxnet_distributed\n│   ├── docker\n│   │   ├── cifar10\n│   │   └── linear_model\n│   └── local_distributed_kvstore\n├── operator_deployments\n├── remote_volume_storage\n└── tf_distributed\n    ├── image_painting_docker\n    ├── mnist_docker\n    └── template_cpu_distributed\n```\n\n##### `azure-cli/`\n\nCollection of scripts to automate the setup of Azure cloud resources, management of a cluster o nodes, setup of remote storage. \n\n- Resource Groups \u0026 Storage Accounts: An Azure resource group is a logical container into which Azure resources are deployed and managed, useful for concerns separation and resource management. A storage account contains all the Azure Storage data objects: blobs, files, queues, ...\n- AKS: Azure provides its own Kubernetes container orchestration service to manage and deploy Kubernetes clusters. \n- File Shares: Azure Files offers fully managed file shares in the cloud that are accessible via the industry standard SMB protocol. Azure Files are the optimal choice for loading and storing training data in the cloud and for result and log reporting.\n- Blob Storage: Blob storage is optimized for storing massive amounts of unstructured data, such as text or binary data.\n\n##### `mxnet-distributed/`\n\nExamples and docker files to test MXNet distributed environment\n\n- `docker/`: Docker is the perfect companion when testing applications that involve multiple processes that would need to run on multiple machines. The provided setup will run either a linear model or a cifar10 architecture with 1 master, 1 scheduler and 1 worker. \u2028\u2028To test the models just cd into either `cifar10` or `linear_model` folder and run `docker-compose up`.\u2028\n- `local_distributed_kvstore/`: Test and run a linear model with the distributed MXNet distributed setup. Explore the KVStore APIs with the `kvstore` notebook.\n\n##### `operator_deployments`\n\nSome examples for the creation of a custom resource definition, the deployment of a new training job to the cluster and an example of using a parametrized Helm chart for hyperparameter sweep. Helm makes it easy to create dynamic Kubernetes deployments by iterating over the values provided in `values.yaml`.\n\n##### `tf-distributed`\n\nSome example architectures to test out Tensorflow distributed architectures, using Docker.\n\n##### `remote_volume_storage`\n\nKubernetes tries to abstract the cloud provided as much as possible when creating, deleting and managing cluster resources. This is valid also for storage resources. Kubernetes offers a series of objects that can be deployed trough any Kubernetes client (e.g. `kubectl`) to create a storage account, (persistent) volume claims and file shares secrets.\n\nThese two scripts show two different paths - Azure `az` tool or Kubernetes `kubectl` - to achieve the same thing, mounting an Azure File Share:\n\n- `setup_storage_static_az.sh`: Example script to setup an Azure File Share starting from the creation of a Storage Account, using the `az` tools.\n- `setup_storage_static_kube.sh`: Example script that uses the kubernetes APIs to setup a new Azure File Share, starting from the creation of a new Storage Account.\n\n##### `config.sh`\n\nSettings file for the automation scripts.\n\n##### `init.sh`\n\nComplete automation script for the deployment of a new AKS cluster, the setup of an Azure account storage and storage file, the upload of a new dataset and the launch of a new distributed job.\n\n##### `deploy_job.sh`\n\nOne single script to create a new Azure AKS cluster, setup the nodes, deploy a custom operator (packaged with Helm) and start a new distributed training job.\n\n##### `clean_up.sh`\n\nClean up cloud resources.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefanofioravanzo%2Fdistributed-deeplearning-kubernetes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstefanofioravanzo%2Fdistributed-deeplearning-kubernetes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefanofioravanzo%2Fdistributed-deeplearning-kubernetes/lists"}