Projects in Awesome Lists tagged with distributed-training
A curated list of projects in awesome lists tagged with distributed-training .
https://github.com/gokumohandas/made-with-ml
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 05 Mar 2026
https://github.com/GokuMohandas/MadeWithML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 03 Mar 2025
https://github.com/GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 15 Mar 2025
https://github.com/huggingface/pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
augmix convnext distributed-training efficientnet image-classification imagenet maxvit mixnet mobile-deep-learning mobilenet-v2 mobilenetv3 nfnets normalization-free-training optimizer pretrained-models pretrained-weights pytorch randaugment resnet vision-transformer-models
Last synced: 11 Dec 2025
https://github.com/paddlepaddle/paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability
Last synced: 12 Jan 2026
https://github.com/PaddlePaddle/Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability
Last synced: 16 Mar 2025
https://github.com/paddlepaddle/paddlenlp
Easy-to-use and powerful LLM and SLM library with awesome model zoo.
bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie
Last synced: 09 Sep 2025
https://github.com/PaddlePaddle/PaddleNLP
👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie
Last synced: 18 Mar 2025
https://github.com/skypilot-org/skypilot
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu
Last synced: 02 Apr 2026
https://github.com/FedML-AI/FedML
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training
Last synced: 04 Apr 2025
https://github.com/idea-ccnl/fengshenbang-lm
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers
Last synced: 14 May 2025
https://github.com/IDEA-CCNL/Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers
Last synced: 26 Mar 2025
https://github.com/fedml-ai/fedml
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training
Last synced: 08 May 2025
https://github.com/bytedance/byteps
A high performance and generic framework for distributed DNN training
deep-learning distributed-training keras machine-learning mxnet pytorch tensorflow
Last synced: 14 May 2025
https://github.com/tensorflow/adanet
Fast and flexible AutoML with learning guarantees.
automl deep-learning distributed-training ensemble gpu learning-theory machine-learning neural-architecture-search python tensorflow tpu
Last synced: 10 Apr 2025
https://github.com/determined-ai/determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
data-science deep-learning distributed-training hyperparameter-optimization hyperparameter-search hyperparameter-tuning keras kubernetes machine-learning ml-infrastructure ml-platform mlops pytorch tensorflow
Last synced: 14 May 2025
https://github.com/alpa-projects/alpa
Training and serving large-scale neural networks with auto parallelization.
alpa auto-parallelization compiler deep-learning distributed-computing distributed-training high-performance-computing jax llm machine-learning
Last synced: 05 Jun 2026
https://github.com/learning-at-home/hivemind
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
asynchronous-programming asyncio deep-learning dht distributed-systems distributed-training hivemind machine-learning mixture-of-experts neural-networks pytorch volunteer-computing
Last synced: 13 May 2025
https://github.com/intelligent-machine-learning/dlrover
DLRover: An Automatic Distributed Deep Learning System
distributed-training hacktoberfest k8s llm-training
Last synced: 29 Dec 2025
https://github.com/pytorch/gloo
Collective communications library with various primitives for multi-machine training.
collectives distributed-training pytorch
Last synced: 11 Dec 2025
https://github.com/tensorlayer/hyperpose
Library for Fast and Flexible Human Pose Estimation
computer-vision distributed-training mobilenet neural-networks openpose pose-estimation tensorflow tensorlayer tensorrt
Last synced: 12 Jan 2026
https://github.com/tensorlayer/HyperPose
Library for Fast and Flexible Human Pose Estimation
computer-vision distributed-training mobilenet neural-networks openpose pose-estimation tensorflow tensorlayer tensorrt
Last synced: 13 Apr 2025
https://github.com/deeprec-ai/deeprec
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine
Last synced: 14 May 2025
https://github.com/DeepRec-AI/DeepRec
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine
Last synced: 30 Mar 2025
https://github.com/mryab/efficient-dl-systems
Efficient Deep Learning Systems course materials (HSE, YSDA)
cuda deep-learning distributed-training efficient-deep-learning machine-learning ml-infrastructure mlops pytorch
Last synced: 15 May 2025
https://github.com/alibaba/Megatron-LLaMA
Best practice for training LLaMA models in Megatron-LM
deepspeed distributed-training llama llm megatron-lm pretraining pytorch
Last synced: 27 Mar 2025
https://github.com/Guitaricet/relora
Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
deep-learning distributed-training llama nlp peft transformer
Last synced: 22 Jul 2025
https://github.com/petuum/adaptdl
Resource-adaptive cluster scheduler for deep learning training.
aws cloud deep-learning distributed-systems distributed-training kubernetes machine-learning pytorch
Last synced: 04 Oct 2025
https://github.com/lambdalabsml/distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm
Last synced: 16 May 2025
https://github.com/Oneflow-Inc/libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
data-parallelism deep-learning distributed-training large-scale model-parallelism nlp oneflow pipeline-parallelism self-supervised-learning transformer vision-transformer
Last synced: 09 May 2025
https://github.com/oneflow-inc/libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
data-parallelism deep-learning distributed-training large-scale model-parallelism nlp oneflow pipeline-parallelism self-supervised-learning transformer vision-transformer
Last synced: 08 Apr 2025
https://github.com/LambdaLabsML/distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm
Last synced: 08 Mar 2025
https://github.com/datacanvasio/hypergbm
A full pipeline AutoML tool for tabular data
adversarial-validation automl catboost dask dask-distributed datacleaning distributed-training ensemble-learning fullpipeline gbm gpu-acceleration lightgbm preprocessing pseudo-labeling rapidsai semi-supervised-learning sklearn tabular-data xgboost
Last synced: 15 May 2025
https://github.com/sail-sg/oat
🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.
alignment distributed-rl distributed-training dpo dueling-bandits grpo llm llm-aligment llm-exploration online-alignment online-rl ppo r1-zero reasoning rlhf thompson-sampling
Last synced: 08 May 2025
https://github.com/DataCanvasIO/HyperGBM
A full pipeline AutoML tool for tabular data
adversarial-validation automl catboost dask dask-distributed datacleaning distributed-training ensemble-learning fullpipeline gbm gpu-acceleration lightgbm preprocessing pseudo-labeling rapidsai semi-supervised-learning sklearn tabular-data xgboost
Last synced: 09 May 2025
https://github.com/maudzung/yolo3d-yolov4-pytorch
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)
3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4
Last synced: 10 Apr 2025
https://github.com/maudzung/YOLO3D-YOLOv4-PyTorch
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)
3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4
Last synced: 20 Mar 2025
https://github.com/PKU-DAIR/Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
artificial-intelligence autograd data-science deep-learning deep-neural-networks distributed-systems distributed-training embeddings gpu high-dimensional machine-learning python state-of-the-art
Last synced: 20 Mar 2025
https://github.com/lsds/kungfu
Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.
distributed-systems distributed-training keras tensorflow
Last synced: 09 Apr 2025
https://github.com/DeNA/HandyRL
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning
Last synced: 03 Apr 2025
https://github.com/dena/handyrl
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning
Last synced: 16 May 2025
https://github.com/hmunachi/nanodl
A Jax-based library for designing and training transformer models from scratch.
attention attention-mechanism deep-learning distributed-training flax gpt jax llama machine-learning mistral nlp transformer
Last synced: 05 Apr 2025
https://github.com/alibaba/easyparallellibrary
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
data-parallelism deep-learning distributed-training gpu memory-efficient model-parallelism pipeline-parallelism
Last synced: 14 Oct 2025
https://github.com/alibaba/EasyParallelLibrary
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
data-parallelism deep-learning distributed-training gpu memory-efficient model-parallelism pipeline-parallelism
Last synced: 04 Apr 2025
https://github.com/chairc/integrated-design-diffusion-model
IDDM (Industrial, landscape, animate, spectrogram...), support DDPM, DDIM, PLMS, webui and distributed training. Pytorch实现扩散模型,生成模型,分布式训练
aigc ddim ddpm diffusion-models distributed-training plms pytorch
Last synced: 16 May 2025
https://github.com/zh320/realtime-semantic-segmentation-pytorch
PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training, Optuna etc.
cityscapes distributed-training enet knowledge-distillation optuna pytorch real-time semantic-segmentation
Last synced: 04 Apr 2025
https://github.com/huggingface/chug
Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
computer-vision dataloading datasets distributed-training document-understanding multi-modal-learning pdf-document webdataset
Last synced: 14 Oct 2025
https://github.com/paddlepaddle/plsc
Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.
arcface cait convmae cosface data-parallel deit distributed-training face-recognition facevit hight-speed large-scale mae moco-v3 model-parallel paddle paddlepaddle partial-fc resnet swin-transformer vit
Last synced: 05 Mar 2026
https://github.com/aws/sagemaker-xgboost-container
This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.
aws distributed-training gbm inference machine-learning python sagemaker training xgboost
Last synced: 12 Jan 2026
https://github.com/microsoft/nnscaler
nnScaler: Compiling DNN models for Parallel Training
compiler deep-learning distributed-training llm machine-learning parallel-computing
Last synced: 05 Apr 2025
https://github.com/alibaba/tepdist
TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.
auto-parallelization compiler deep-learning disthlo distributed-computing distributed-systems distributed-training high-performance-computing machine-learning rhino
Last synced: 14 Oct 2025
https://github.com/alibaba/TePDist
TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.
auto-parallelization compiler deep-learning disthlo distributed-computing distributed-systems distributed-training high-performance-computing machine-learning rhino
Last synced: 04 Apr 2025
https://github.com/ai-hypercomputer/gpu-recipes
Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.
benchmarks distributed-training google-cloud-platform gpu serving
Last synced: 25 Jun 2025
https://github.com/omerbsezer/fast-kubeflow
This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks on pods, Kubeflow Pipelines, Experiments, KALE, KATIB (AutoML: Hyperparameter Tuning), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc.
automl distributed-training jupyter-notebooks kale katib kubeflow kubeflow-component kubeflow-demo kubeflow-pipeline kubernetes training-operators
Last synced: 28 Apr 2025
https://github.com/bryanyzhu/video-tutorial-cvpr2020
A Comprehensive Tutorial on Video Modeling
distributed-training gluoncv human-action-recognition mxnet video-classification
Last synced: 22 Mar 2025
https://github.com/hkproj/pytorch-transformer-distributed
Distributed training (multi-node) of a Transformer model
collective-communication data-parallelism deep-learning distributed-data-parallel distributed-training gradient-accumulation machine-learning model-parallelism pytorch tutorial
Last synced: 06 May 2025
https://github.com/tanyuqian/redco
NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference
differential-privacy diffusion-models distributed-training fedavg federated-learning flan-t5-xxl gemma image-captioning jax large-language-models llama maml meta-learning mixed-precision mlsys model-parallelism ppo reinforcement-learning seq2seq stable-diffusion
Last synced: 06 Apr 2025
https://github.com/adrianbzg/llm-distributed-finetune
Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances
aws deep-learning distributed-training falcon fine-tuning huggingface large-language-models natural-language-processing transformers
Last synced: 07 Oct 2025
https://github.com/notedance/note
Machine learning library, Distributed training, Deep learning, Reinforcement learning, Models, TensorFlow, PyTorch
artificial-intelligence deep-learning deep-reinforcement-learning deeplearning deepreinforcementlearning distributed-training dl drl machine-learning machine-learning-library machinelearning ml neural-network neuralnetwork parallel-training pytorch reinforcement-learning reinforcementlearning rl tensorflow
Last synced: 20 Aug 2025
https://github.com/pinpoint-apm/pinpoint-node-agent
Pinpoint Node.js agent
agent apm distributed-training monitoring node performance pinpoint
Last synced: 05 Apr 2025
https://github.com/aws-samples/aws-do-eks
Create, List, Update, Delete Amazon EKS clusters. Deploy and manage software on EKS. Run distributed model training and inference examples.
deployment distributed-training do-framework docker eks eksctl inference observability terraform
Last synced: 16 May 2025
https://github.com/andreped/gradientaccumulator
:dart: Accumulated Gradients for TensorFlow 2
accumulated-batch-normalization accumulated-gradients adaptive-gradient-clipping batch-size deep-learning distributed-training float16 gpu gradient-accumulation hacktoberfest huggingface keras memory-constraints mixed-precision multi-gpu tensorflow tensorflow2 tf2 tpu
Last synced: 13 Apr 2025
https://github.com/AdrianBZG/LLM-distributed-finetune
Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances
aws deep-learning distributed-training falcon fine-tuning huggingface large-language-models natural-language-processing transformers
Last synced: 08 May 2025
https://github.com/saareliad/FTPipe
FTPipe and related pipeline model parallelism research.
deep-neural-networks distributed-training fine-tuning nlp pipeline-parallelism t5
Last synced: 13 Apr 2025
https://github.com/uw-mad-dash/shockwave
Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]
cloud-computing cluster-scheduler deep-learning distributed-systems distributed-training machine-learning pytorch
Last synced: 27 Apr 2025
https://github.com/saforem2/ezpz
Write once, run anywhere; ezpz 🍋
ai-tools deepspeed distributed-training fsdp launcher machine-learning mpi mpi4py parallelism python pytorch rich slurm torch training
Last synced: 26 Jun 2026
https://github.com/aws-samples/TensorFlow-in-SageMaker-workshop
Running your TensorFlow models in Amazon SageMaker
amazon-sagemaker distributed-training pipemode tensorflow
Last synced: 12 Apr 2025
https://github.com/4paradigm/openembedding
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.
distributed-training embedding-layers model-parallel parameter-server tensorflow tensorflow-training
Last synced: 27 Mar 2026
https://github.com/determined-ai/determined-examples
Example ML projects that use the Determined library.
deep-learning distributed-training hyperparameter-tuning keras machine-learning ml-infrastructure pytorch tensorflow
Last synced: 30 Apr 2025
https://github.com/AshishKumar4/FlaxDiff
A simple, easy-to-understand library for diffusion models using Flax and Jax. Includes detailed notebooks on DDPM, DDIM, and EDM with simplified mathematical explanations. Made as part of my journey for learning and experimenting with generative AI.
ai-research attention ddim ddpm deep-learning diffusion diffusion-models distributed-training edm flax generative-ai image-generation image2image jax karras machine-learning score-based-generative-modeling stable-diffusion tensorflow unet
Last synced: 06 Sep 2025
https://github.com/raywan-110/adaqp
Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
distributed-training graph-neural-networks quantization
Last synced: 21 Jun 2025
https://github.com/graykode/horovod-ansible
Create Horovod cluster easily using Ansible
ansible deeplearning distributed-training horovod openmpi pytorch tensorflow terraform
Last synced: 05 May 2025
https://github.com/sayakpaul/distributed-training-in-tensorflow-2-with-ai-platform
Contains code to demonstrate distributed training in TensorFlow 2 with AI Platform and custom Docker contains.
ai-platform distributed-training docker gcp gcr keras tensorflow2
Last synced: 07 Jul 2025
https://github.com/taishan1994/pytorch-distributed-nlp
pytorch分布式训练
bert distributed-training pytorch text-classification
Last synced: 04 Apr 2025
https://github.com/SLAMPAI/large-scale-pretraining-transfer
Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)
big-transfer chest-x-ray14 chest-xray-images chexpert-dataset covidx-dataset deep-learning distributed-training few-shot-learning fine-tuning imagenet large-scale-learning medical-imaging mimic-cxr padchest-dataset pre-trained-model pre-training pytorch scaling-laws supercomputing transfer-learning
Last synced: 08 May 2025
https://github.com/senli1073/seist
[TGRS] SeisT: A Foundational Deep-Learning Model for Earthquake Monitoring Tasks
baz-azimuth deep-learning distributed-training earthquake-detection epicentral-distance first-motion-polarity foundation magnitude phase-picking pytorch seismogram seismology transformer
Last synced: 14 Apr 2025
https://github.com/18520339/ml-distributed-training
Reduce the training time of CNNs by leveraging the power of multiple GPUs in 2 approaches, Multi-workers & Parameter Sever Training using TensorFlow 2
distributed distributed-tensorflow distributed-training multi-gpu multi-workers parameter-server tensorflow
Last synced: 27 Feb 2026
https://github.com/daekeun-ml/sm-distributed-training-step-by-step
This repository provides hands-on labs on PyTorch-based Distributed Training and SageMaker Distributed Training. It is written to make it easy for beginners to get started, and guides you through step-by-step modifications to the code based on the most basic BERT use cases.
data-parallelism distributed-training pytorch-ddp sagemaker
Last synced: 12 Oct 2025
https://github.com/zh320/medical-segmentation-pytorch
PyTorch implementation of medical semantic segmentations models, e.g. UNet, UNet++, DUCKNet, ResUNet, ResUNet++, and support knowledge distillation, distributed training, Optuna etc.
distributed-training knowledge-distillation medical-image-segmentation optuna polyp-segmentation pytorch resunet unet unetplusplus
Last synced: 12 Apr 2025
https://github.com/hunterdii/tensorflow-advanced-techniques-solution
Tensorflow Advanced Technique Specialization - Solution
computer-vision coursera coursera-specialization custom-model custom-training deep-learning deeplearning-ai distributed-training generative-ai image-detection image-segmentation-tensorflow machine-learning object-detection object-detection-model semantic-segmentation specialization tensorflow tensorflow-tutorials visualization
Last synced: 24 Oct 2025
https://github.com/bryanlimy/tf2-cyclegan
TensorFlow 2 implementation of CycleGAN with multi-GPU training.
cyclegan distributed-training gan mirroredstrategy multi-gpus tensorflow tensorflow2 tf2
Last synced: 08 May 2025
https://github.com/shenggan/atp
Adaptive Tensor Parallelism for Foundation Models
attention distributed-training gpt large-model model-parallelism pytorch transformer
Last synced: 19 Aug 2025
https://github.com/rosinality/meshfn
Framework for Human Alignment Learning
alignment distributed-training large-language-models
Last synced: 28 Apr 2025
https://github.com/saforem2/mmm
Multi-Modal Modeling
distributed-training llm multi-modal pytorch
Last synced: 13 Aug 2025
https://github.com/nwangfw/nerf_ddp
distributed-data-parallel distributed-training nerf neural-radiance-fields pytorch
Last synced: 22 Apr 2026
https://github.com/asprenger/distributed-training-patterns
Experiments with low level communication patterns that are useful for distributed training.
distributed-training horovod mpi mpi4py nccl tensorflow
Last synced: 15 Jul 2025
https://github.com/zerfoo/zerfoo
Pure Go machine learning framework. Train, run, and serve ML models with go build. Zero CGo.
autodiff deep-learning distributed-training float16 float8 fp16 fp8 go golang graph-ml machine-learning ml-framework neural-network onnx transformer
Last synced: 13 Jun 2026
https://github.com/alex-snd/trecover
📜 A python library for distributed training of a Transformer neural network across the Internet to solve the Running Key Cipher, widely known in the field of Cryptography.
celery cryptography deep-learning distributed-systems distributed-training fastapi hivemind keyless-reading llm machine-learning mkdocs neural-network nlp python pytorch pytorch-lightning streamlit text-recovery transformers volunteer-computing
Last synced: 08 Oct 2025
https://github.com/ler0ever/hpgo
Development of Project HPGO | Hybrid Parallelism Global Orchestration
data-parallelism distributed-training gpipe machine-learning model-parallelism pipedream pipeline-parallelism pytorch rust tensorflow
Last synced: 15 Jul 2025
https://github.com/tlatkowski/u-net-tpu
Tensorflow implementation of U-Net model with TPU Estimator support.
cnn convolutional-neural-networks deep-learning distributed-training encoder-decoder google-cloud-platform image-classification image-processing image-recognition image-segmentation tensorflow tensorflow-models tpu u-net unet unet-image-segmentation unet-model unet-tensorflow vision
Last synced: 02 May 2026
https://github.com/amanpriyanshu/fl-interactive-game
FL-Interactive-Game: Interactive web game that teaches basic components of Federated Learning
computer-vision decentralized-learning distributed-training federated-learning federated-learning-framework fl game interactive-visualizations machine-learning mnist neural-network neural-networks privacy privacy-enhancing-technologies privacy-protection tensorflow tensorflowjs
Last synced: 29 Jan 2026
https://github.com/denpalrius/bft-federated-learning
Federated Learning with Byzantine Fault Tolerance
artificial-intelligence bft bft-protocols cifar-10 distributed-training fault-tolerance federated-learning federated-learning-algorithm flower grpc machine-learning-algorithms pytorch
Last synced: 10 Feb 2026
https://github.com/valaydave/metaflow-kube-demo
Metaflow On Kubernetes
deep-learning distributed-training experiments-analytics kubernetes machine-learning-productivity metaflow
Last synced: 11 May 2026
https://github.com/tolgatasci/ai-farm
AI-Farm is a distributed deep learning training framework that enables efficient model training across multiple machines. It provides a scalable infrastructure with real-time monitoring through a web admin panel, adaptive task distribution, and support for both CPU and GPU training.
deep-learning distributed-training federated-learning gpu-training machine-learning python pytorch websockets
Last synced: 13 Apr 2026
https://github.com/ingero-io/ingero-fleet
GPU cluster straggler detection - custom OTEL Collector distribution
anomaly-detection distributed-training gpu gpu-observability kubernetes llm-inference machine-learning observability opentelemetry opentelemetry-collector otlp sre straggler-detection
Last synced: 02 May 2026
https://github.com/bjornmelin/deep-learning-evolution
🧠 Deep-Learning Evolution: Unified collection of TensorFlow & PyTorch projects, featuring custom CUDA kernels, distributed training, memory‑efficient methods, and production‑ready pipelines. Showcases advanced GPU optimizations, from foundational models to cutting‑edge architectures. 🚀
ai-research cuda data-science deep-learning distributed-training gan gpu-acceleration machine-learning model-optimization neural-networks python pytorch tensorflow training-pipeline transformers
Last synced: 09 May 2026
https://github.com/satvikpraveen/lightningmasterpro
Comprehensive PyTorch Lightning framework featuring 20+ educational notebooks, advanced ML patterns, and production-ready workflows. Covers vision, NLP, tabular, and time series domains with distributed training, mixed precision, custom loops, and deployment pipelines. Complete with synthetic data generators and testing.
artificial-intelligence computer-vision data-science deep-learning distributed-training gradient-accumulation machine-learning mixed-precision mlops model-deployment model-training natural-language-processing neural-networks onnx-export python pytorch pytorch-lightning tabular-data time-series torchscript
Last synced: 06 May 2026
https://github.com/jman4162/sizing-ai-training-by-cost-per-memory-bandwidth
A practical model (with math + Python) to tell if you’re compute-, memory-, or network-bound—and what to buy next
ai ai-infrastructure aws aws-ec2 cost-optimization distributed-systems distributed-training hbm llm llm-training machine-learning memory-bandwidth ml nccl pytorch roofline-model systems-performance transformer
Last synced: 13 Apr 2026
https://github.com/cmontemuino/amd-mi300x-ml-benchmarks
Comprehensive machine learning benchmarking framework for AMD MI300X GPUs on Dell PowerEdge XE9680 hardware. Supports both inference (vLLM) and training workloads with containerized test suites, hardware monitoring, and analysis tools for performance, power efficiency, and scalability research across the complete ML pipeline.
ai-infrastructure amd-gpu amd-mi300x containerized-testing deep-learning dell-poweredge distributed-training gpu-computing gpu-monitoring gpu-parallelism gpu-performance hardware-monitoring llm-benchmarking machine-learning performance-analysis power-efficiency pytorch rocm scalability-testing vllm
Last synced: 05 Feb 2026
https://github.com/stefanofioravanzo/dl-operator
General purpose Kubernetes operator for DL frameworks written in Python
deep-learning distributed-training kubernetes kubernetes-python-client operator
Last synced: 18 May 2026