An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with distributed-training

A curated list of projects in awesome lists tagged with distributed-training .

https://github.com/huggingface/pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

augmix convnext distributed-training efficientnet image-classification imagenet maxvit mixnet mobile-deep-learning mobilenet-v2 mobilenetv3 nfnets normalization-free-training optimizer pretrained-models pretrained-weights pytorch randaugment resnet vision-transformer-models

Last synced: 11 Dec 2025

https://github.com/paddlepaddle/paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability

Last synced: 12 Jan 2026

https://github.com/PaddlePaddle/Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability

Last synced: 16 Mar 2025

https://github.com/PaddlePaddle/PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie

Last synced: 18 Mar 2025

https://github.com/skypilot-org/skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu

Last synced: 02 Apr 2026

https://github.com/FedML-AI/FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training

Last synced: 04 Apr 2025

https://github.com/idea-ccnl/fengshenbang-lm

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 14 May 2025

https://github.com/IDEA-CCNL/Fengshenbang-LM

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 26 Mar 2025

https://github.com/fedml-ai/fedml

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training

Last synced: 08 May 2025

https://github.com/bytedance/byteps

A high performance and generic framework for distributed DNN training

deep-learning distributed-training keras machine-learning mxnet pytorch tensorflow

Last synced: 14 May 2025

https://github.com/determined-ai/determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

data-science deep-learning distributed-training hyperparameter-optimization hyperparameter-search hyperparameter-tuning keras kubernetes machine-learning ml-infrastructure ml-platform mlops pytorch tensorflow

Last synced: 14 May 2025

https://github.com/learning-at-home/hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

asynchronous-programming asyncio deep-learning dht distributed-systems distributed-training hivemind machine-learning mixture-of-experts neural-networks pytorch volunteer-computing

Last synced: 13 May 2025

https://github.com/intelligent-machine-learning/dlrover

DLRover: An Automatic Distributed Deep Learning System

distributed-training hacktoberfest k8s llm-training

Last synced: 29 Dec 2025

https://github.com/pytorch/gloo

Collective communications library with various primitives for multi-machine training.

collectives distributed-training pytorch

Last synced: 11 Dec 2025

https://github.com/deeprec-ai/deeprec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine

Last synced: 14 May 2025

https://github.com/DeepRec-AI/DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine

Last synced: 30 Mar 2025

https://github.com/alibaba/Megatron-LLaMA

Best practice for training LLaMA models in Megatron-LM

deepspeed distributed-training llama llm megatron-lm pretraining pytorch

Last synced: 27 Mar 2025

https://github.com/Guitaricet/relora

Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

deep-learning distributed-training llama nlp peft transformer

Last synced: 22 Jul 2025

https://github.com/petuum/adaptdl

Resource-adaptive cluster scheduler for deep learning training.

aws cloud deep-learning distributed-systems distributed-training kubernetes machine-learning pytorch

Last synced: 04 Oct 2025

https://github.com/lambdalabsml/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: 16 May 2025

https://github.com/LambdaLabsML/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: 08 Mar 2025

https://github.com/sail-sg/oat

🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.

alignment distributed-rl distributed-training dpo dueling-bandits grpo llm llm-aligment llm-exploration online-alignment online-rl ppo r1-zero reasoning rlhf thompson-sampling

Last synced: 08 May 2025

https://github.com/maudzung/yolo3d-yolov4-pytorch

YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4

Last synced: 10 Apr 2025

https://github.com/maudzung/YOLO3D-YOLOv4-PyTorch

YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4

Last synced: 20 Mar 2025

https://github.com/PKU-DAIR/Hetu

A high-performance distributed deep learning system targeting large-scale and automated distributed training.

artificial-intelligence autograd data-science deep-learning deep-neural-networks distributed-systems distributed-training embeddings gpu high-dimensional machine-learning python state-of-the-art

Last synced: 20 Mar 2025

https://github.com/lsds/kungfu

Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

distributed-systems distributed-training keras tensorflow

Last synced: 09 Apr 2025

https://github.com/DeNA/HandyRL

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning

Last synced: 03 Apr 2025

https://github.com/dena/handyrl

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning

Last synced: 16 May 2025

https://github.com/hmunachi/nanodl

A Jax-based library for designing and training transformer models from scratch.

attention attention-mechanism deep-learning distributed-training flax gpt jax llama machine-learning mistral nlp transformer

Last synced: 05 Apr 2025

https://github.com/alibaba/easyparallellibrary

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

data-parallelism deep-learning distributed-training gpu memory-efficient model-parallelism pipeline-parallelism

Last synced: 14 Oct 2025

https://github.com/alibaba/EasyParallelLibrary

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

data-parallelism deep-learning distributed-training gpu memory-efficient model-parallelism pipeline-parallelism

Last synced: 04 Apr 2025

https://github.com/chairc/integrated-design-diffusion-model

IDDM (Industrial, landscape, animate, spectrogram...), support DDPM, DDIM, PLMS, webui and distributed training. Pytorch实现扩散模型,生成模型,分布式训练

aigc ddim ddpm diffusion-models distributed-training plms pytorch

Last synced: 16 May 2025

https://github.com/zh320/realtime-semantic-segmentation-pytorch

PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training, Optuna etc.

cityscapes distributed-training enet knowledge-distillation optuna pytorch real-time semantic-segmentation

Last synced: 04 Apr 2025

https://github.com/huggingface/chug

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

computer-vision dataloading datasets distributed-training document-understanding multi-modal-learning pdf-document webdataset

Last synced: 14 Oct 2025

https://github.com/paddlepaddle/plsc

Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.

arcface cait convmae cosface data-parallel deit distributed-training face-recognition facevit hight-speed large-scale mae moco-v3 model-parallel paddle paddlepaddle partial-fc resnet swin-transformer vit

Last synced: 05 Mar 2026

https://github.com/aws/sagemaker-xgboost-container

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

aws distributed-training gbm inference machine-learning python sagemaker training xgboost

Last synced: 12 Jan 2026

https://github.com/microsoft/nnscaler

nnScaler: Compiling DNN models for Parallel Training

compiler deep-learning distributed-training llm machine-learning parallel-computing

Last synced: 05 Apr 2025

https://github.com/alibaba/tepdist

TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.

auto-parallelization compiler deep-learning disthlo distributed-computing distributed-systems distributed-training high-performance-computing machine-learning rhino

Last synced: 14 Oct 2025

https://github.com/alibaba/TePDist

TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.

auto-parallelization compiler deep-learning disthlo distributed-computing distributed-systems distributed-training high-performance-computing machine-learning rhino

Last synced: 04 Apr 2025

https://github.com/ai-hypercomputer/gpu-recipes

Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.

benchmarks distributed-training google-cloud-platform gpu serving

Last synced: 25 Jun 2025

https://github.com/omerbsezer/fast-kubeflow

This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks on pods, Kubeflow Pipelines, Experiments, KALE, KATIB (AutoML: Hyperparameter Tuning), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc.

automl distributed-training jupyter-notebooks kale katib kubeflow kubeflow-component kubeflow-demo kubeflow-pipeline kubernetes training-operators

Last synced: 28 Apr 2025

https://github.com/adrianbzg/llm-distributed-finetune

Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances

aws deep-learning distributed-training falcon fine-tuning huggingface large-language-models natural-language-processing transformers

Last synced: 07 Oct 2025

https://github.com/aws-samples/aws-do-eks

Create, List, Update, Delete Amazon EKS clusters. Deploy and manage software on EKS. Run distributed model training and inference examples.

deployment distributed-training do-framework docker eks eksctl inference observability terraform

Last synced: 16 May 2025

https://github.com/AdrianBZG/LLM-distributed-finetune

Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances

aws deep-learning distributed-training falcon fine-tuning huggingface large-language-models natural-language-processing transformers

Last synced: 08 May 2025

https://github.com/saareliad/FTPipe

FTPipe and related pipeline model parallelism research.

deep-neural-networks distributed-training fine-tuning nlp pipeline-parallelism t5

Last synced: 13 Apr 2025

https://github.com/uw-mad-dash/shockwave

Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]

cloud-computing cluster-scheduler deep-learning distributed-systems distributed-training machine-learning pytorch

Last synced: 27 Apr 2025

https://github.com/aws-samples/TensorFlow-in-SageMaker-workshop

Running your TensorFlow models in Amazon SageMaker

amazon-sagemaker distributed-training pipemode tensorflow

Last synced: 12 Apr 2025

https://github.com/4paradigm/openembedding

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

distributed-training embedding-layers model-parallel parameter-server tensorflow tensorflow-training

Last synced: 27 Mar 2026

https://github.com/AshishKumar4/FlaxDiff

A simple, easy-to-understand library for diffusion models using Flax and Jax. Includes detailed notebooks on DDPM, DDIM, and EDM with simplified mathematical explanations. Made as part of my journey for learning and experimenting with generative AI.

ai-research attention ddim ddpm deep-learning diffusion diffusion-models distributed-training edm flax generative-ai image-generation image2image jax karras machine-learning score-based-generative-modeling stable-diffusion tensorflow unet

Last synced: 06 Sep 2025

https://github.com/raywan-110/adaqp

Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training

distributed-training graph-neural-networks quantization

Last synced: 21 Jun 2025

https://github.com/sayakpaul/distributed-training-in-tensorflow-2-with-ai-platform

Contains code to demonstrate distributed training in TensorFlow 2 with AI Platform and custom Docker contains.

ai-platform distributed-training docker gcp gcr keras tensorflow2

Last synced: 07 Jul 2025

https://github.com/SLAMPAI/large-scale-pretraining-transfer

Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)

big-transfer chest-x-ray14 chest-xray-images chexpert-dataset covidx-dataset deep-learning distributed-training few-shot-learning fine-tuning imagenet large-scale-learning medical-imaging mimic-cxr padchest-dataset pre-trained-model pre-training pytorch scaling-laws supercomputing transfer-learning

Last synced: 08 May 2025

https://github.com/18520339/ml-distributed-training

Reduce the training time of CNNs by leveraging the power of multiple GPUs in 2 approaches, Multi-workers & Parameter Sever Training using TensorFlow 2

distributed distributed-tensorflow distributed-training multi-gpu multi-workers parameter-server tensorflow

Last synced: 27 Feb 2026

https://github.com/daekeun-ml/sm-distributed-training-step-by-step

This repository provides hands-on labs on PyTorch-based Distributed Training and SageMaker Distributed Training. It is written to make it easy for beginners to get started, and guides you through step-by-step modifications to the code based on the most basic BERT use cases.

data-parallelism distributed-training pytorch-ddp sagemaker

Last synced: 12 Oct 2025

https://github.com/zh320/medical-segmentation-pytorch

PyTorch implementation of medical semantic segmentations models, e.g. UNet, UNet++, DUCKNet, ResUNet, ResUNet++, and support knowledge distillation, distributed training, Optuna etc.

distributed-training knowledge-distillation medical-image-segmentation optuna polyp-segmentation pytorch resunet unet unetplusplus

Last synced: 12 Apr 2025

https://github.com/bryanlimy/tf2-cyclegan

TensorFlow 2 implementation of CycleGAN with multi-GPU training.

cyclegan distributed-training gan mirroredstrategy multi-gpus tensorflow tensorflow2 tf2

Last synced: 08 May 2025

https://github.com/shenggan/atp

Adaptive Tensor Parallelism for Foundation Models

attention distributed-training gpt large-model model-parallelism pytorch transformer

Last synced: 19 Aug 2025

https://github.com/rosinality/meshfn

Framework for Human Alignment Learning

alignment distributed-training large-language-models

Last synced: 28 Apr 2025

https://github.com/saforem2/mmm

Multi-Modal Modeling

distributed-training llm multi-modal pytorch

Last synced: 13 Aug 2025

https://github.com/asprenger/distributed-training-patterns

Experiments with low level communication patterns that are useful for distributed training.

distributed-training horovod mpi mpi4py nccl tensorflow

Last synced: 15 Jul 2025

https://github.com/zerfoo/zerfoo

Pure Go machine learning framework. Train, run, and serve ML models with go build. Zero CGo.

autodiff deep-learning distributed-training float16 float8 fp16 fp8 go golang graph-ml machine-learning ml-framework neural-network onnx transformer

Last synced: 13 Jun 2026

https://github.com/alex-snd/trecover

📜 A python library for distributed training of a Transformer neural network across the Internet to solve the Running Key Cipher, widely known in the field of Cryptography.

celery cryptography deep-learning distributed-systems distributed-training fastapi hivemind keyless-reading llm machine-learning mkdocs neural-network nlp python pytorch pytorch-lightning streamlit text-recovery transformers volunteer-computing

Last synced: 08 Oct 2025

https://github.com/ler0ever/hpgo

Development of Project HPGO | Hybrid Parallelism Global Orchestration

data-parallelism distributed-training gpipe machine-learning model-parallelism pipedream pipeline-parallelism pytorch rust tensorflow

Last synced: 15 Jul 2025

https://github.com/tolgatasci/ai-farm

AI-Farm is a distributed deep learning training framework that enables efficient model training across multiple machines. It provides a scalable infrastructure with real-time monitoring through a web admin panel, adaptive task distribution, and support for both CPU and GPU training.

deep-learning distributed-training federated-learning gpu-training machine-learning python pytorch websockets

Last synced: 13 Apr 2026

https://github.com/bjornmelin/deep-learning-evolution

🧠 Deep-Learning Evolution: Unified collection of TensorFlow & PyTorch projects, featuring custom CUDA kernels, distributed training, memory‑efficient methods, and production‑ready pipelines. Showcases advanced GPU optimizations, from foundational models to cutting‑edge architectures. 🚀

ai-research cuda data-science deep-learning distributed-training gan gpu-acceleration machine-learning model-optimization neural-networks python pytorch tensorflow training-pipeline transformers

Last synced: 09 May 2026

https://github.com/satvikpraveen/lightningmasterpro

Comprehensive PyTorch Lightning framework featuring 20+ educational notebooks, advanced ML patterns, and production-ready workflows. Covers vision, NLP, tabular, and time series domains with distributed training, mixed precision, custom loops, and deployment pipelines. Complete with synthetic data generators and testing.

artificial-intelligence computer-vision data-science deep-learning distributed-training gradient-accumulation machine-learning mixed-precision mlops model-deployment model-training natural-language-processing neural-networks onnx-export python pytorch pytorch-lightning tabular-data time-series torchscript

Last synced: 06 May 2026

https://github.com/cmontemuino/amd-mi300x-ml-benchmarks

Comprehensive machine learning benchmarking framework for AMD MI300X GPUs on Dell PowerEdge XE9680 hardware. Supports both inference (vLLM) and training workloads with containerized test suites, hardware monitoring, and analysis tools for performance, power efficiency, and scalability research across the complete ML pipeline.

ai-infrastructure amd-gpu amd-mi300x containerized-testing deep-learning dell-poweredge distributed-training gpu-computing gpu-monitoring gpu-parallelism gpu-performance hardware-monitoring llm-benchmarking machine-learning performance-analysis power-efficiency pytorch rocm scalability-testing vllm

Last synced: 05 Feb 2026

https://github.com/stefanofioravanzo/dl-operator

General purpose Kubernetes operator for DL frameworks written in Python

deep-learning distributed-training kubernetes kubernetes-python-client operator

Last synced: 18 May 2026