Projects in Awesome Lists tagged with distributed-training

https://github.com/gokumohandas/made-with-ml

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 30 Dec 2024

https://github.com/GokuMohandas/Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 27 Oct 2024

https://github.com/huggingface/pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

augmix convnext distributed-training dual-path-networks efficientnet image-classification imagenet maxvit mixnet mobile-deep-learning mobilenet-v2 mobilenetv3 nfnets normalization-free-training pretrained-models pretrained-weights pytorch randaugment resnet vision-transformer-models

Last synced: 30 Dec 2024

https://github.com/paddlepaddle/paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability

Last synced: 30 Dec 2024

https://github.com/PaddlePaddle/Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability

Last synced: 27 Oct 2024

https://github.com/paddlepaddle/paddlenlp

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie

Last synced: 30 Dec 2024

https://github.com/PaddlePaddle/PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie

Last synced: 27 Oct 2024

https://github.com/skypilot-org/skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu

Last synced: 30 Dec 2024

https://github.com/FedML-AI/FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training

Last synced: 05 Nov 2024

https://github.com/fedml-ai/fedml

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training

Last synced: 30 Dec 2024

https://github.com/idea-ccnl/fengshenbang-lm

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 24 Dec 2024

https://github.com/IDEA-CCNL/Fengshenbang-LM

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 30 Oct 2024

https://github.com/bytedance/byteps

A high performance and generic framework for distributed DNN training

deep-learning distributed-training keras machine-learning mxnet pytorch tensorflow

Last synced: 26 Dec 2024

https://github.com/tensorflow/adanet

Fast and flexible AutoML with learning guarantees.

automl deep-learning distributed-training ensemble gpu learning-theory machine-learning neural-architecture-search python tensorflow tpu

Last synced: 25 Dec 2024

https://github.com/determined-ai/determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

data-science deep-learning distributed-training hyperparameter-optimization hyperparameter-search hyperparameter-tuning keras kubernetes machine-learning ml-infrastructure ml-platform mlops pytorch tensorflow

Last synced: 30 Dec 2024

https://github.com/alpa-projects/alpa

Training and serving large-scale neural networks with auto parallelization.

alpa auto-parallelization compiler deep-learning distributed-computing distributed-training high-performance-computing jax llm machine-learning

Last synced: 14 Oct 2024

https://github.com/learning-at-home/hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

asynchronous-programming asyncio deep-learning dht distributed-systems distributed-training hivemind machine-learning mixture-of-experts neural-networks pytorch volunteer-computing

Last synced: 24 Dec 2024

https://github.com/intelligent-machine-learning/dlrover

DLRover: An Automatic Distributed Deep Learning System

distributed-training hacktoberfest k8s llm-training

Last synced: 25 Dec 2024

https://github.com/tensorlayer/hyperpose

Library for Fast and Flexible Human Pose Estimation

computer-vision distributed-training mobilenet neural-networks openpose pose-estimation tensorflow tensorlayer tensorrt

Last synced: 27 Dec 2024

https://github.com/tensorlayer/HyperPose

Library for Fast and Flexible Human Pose Estimation

computer-vision distributed-training mobilenet neural-networks openpose pose-estimation tensorflow tensorlayer tensorrt

Last synced: 07 Nov 2024

https://github.com/deeprec-ai/deeprec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine

Last synced: 27 Dec 2024

https://github.com/DeepRec-AI/DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine

Last synced: 01 Nov 2024

https://github.com/mryab/efficient-dl-systems

Efficient Deep Learning Systems course materials (HSE, YSDA)

cuda deep-learning distributed-training efficient-deep-learning machine-learning ml-infrastructure mlops pytorch

Last synced: 29 Dec 2024

https://github.com/alibaba/Megatron-LLaMA

Best practice for training LLaMA models in Megatron-LM

deepspeed distributed-training llama llm megatron-lm pretraining pytorch

Last synced: 30 Oct 2024

https://github.com/Guitaricet/relora

Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

deep-learning distributed-training llama nlp peft transformer

Last synced: 29 Nov 2024

https://github.com/petuum/adaptdl

Resource-adaptive cluster scheduler for deep learning training.

aws cloud deep-learning distributed-systems distributed-training kubernetes machine-learning pytorch

Last synced: 16 Nov 2024

https://github.com/oneflow-inc/libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

data-parallelism deep-learning distributed-training large-scale model-parallelism nlp oneflow pipeline-parallelism self-supervised-learning transformer vision-transformer

Last synced: 29 Dec 2024

https://github.com/Oneflow-Inc/libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

data-parallelism deep-learning distributed-training large-scale model-parallelism nlp oneflow pipeline-parallelism self-supervised-learning transformer vision-transformer

Last synced: 16 Nov 2024

https://github.com/datacanvasio/hypergbm

A full pipeline AutoML tool for tabular data

adversarial-validation automl catboost dask dask-distributed datacleaning distributed-training ensemble-learning fullpipeline gbm gpu-acceleration lightgbm preprocessing pseudo-labeling rapidsai semi-supervised-learning sklearn tabular-data xgboost

Last synced: 27 Dec 2024

https://github.com/DataCanvasIO/HyperGBM

A full pipeline AutoML tool for tabular data

adversarial-validation automl catboost dask dask-distributed datacleaning distributed-training ensemble-learning fullpipeline gbm gpu-acceleration lightgbm preprocessing pseudo-labeling rapidsai semi-supervised-learning sklearn tabular-data xgboost

Last synced: 16 Nov 2024

https://github.com/pytorch/torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

airflow aws-batch components deep-learning distributed-training kubernetes machine-learning pipelines python pytorch ray slurm

Last synced: 26 Dec 2024

https://github.com/lambdalabsml/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: 30 Dec 2024

https://github.com/maudzung/yolo3d-yolov4-pytorch

YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4

Last synced: 25 Dec 2024

https://github.com/maudzung/YOLO3D-YOLOv4-PyTorch

YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4

Last synced: 28 Oct 2024

https://github.com/dena/handyrl

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning

Last synced: 30 Dec 2024

https://github.com/DeNA/HandyRL

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning

Last synced: 04 Nov 2024

https://github.com/hmunachi/nanodl

A Jax-based library for designing and training transformer models from scratch.

attention attention-mechanism deep-learning distributed-training flax gpt jax llama machine-learning mistral nlp transformer

Last synced: 28 Dec 2024

https://github.com/alibaba/EasyParallelLibrary

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

data-parallelism deep-learning distributed-training gpu memory-efficient model-parallelism pipeline-parallelism

Last synced: 05 Nov 2024

https://github.com/PKU-DAIR/Hetu

A high-performance distributed deep learning system targeting large-scale and automated distributed training.

artificial-intelligence autograd data-science deep-learning deep-neural-networks distributed-systems distributed-training embeddings gpu high-dimensional machine-learning python state-of-the-art

Last synced: 28 Oct 2024

https://github.com/LambdaLabsML/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: 21 Oct 2024

https://github.com/paddlepaddle/plsc

Paddle Large Scale Classification Tools，supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.

arcface cait convmae cosface data-parallel deit distributed-training face-recognition facevit hight-speed large-scale mae moco-v3 model-parallel paddle paddlepaddle partial-fc resnet swin-transformer vit

Last synced: 24 Dec 2024

https://github.com/zh320/realtime-semantic-segmentation-pytorch

PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training, Optuna etc.

cityscapes distributed-training enet knowledge-distillation optuna pytorch real-time semantic-segmentation

Last synced: 25 Dec 2024

https://github.com/aws/sagemaker-xgboost-container

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

aws distributed-training gbm inference machine-learning python sagemaker training xgboost

Last synced: 28 Dec 2024

https://github.com/alibaba/TePDist

TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.

auto-parallelization compiler deep-learning disthlo distributed-computing distributed-systems distributed-training high-performance-computing machine-learning rhino

Last synced: 05 Nov 2024

https://github.com/microsoft/nnscaler

nnScaler: Compiling DNN models for Parallel Training

compiler deep-learning distributed-training llm machine-learning parallel-computing

Last synced: 24 Dec 2024

https://github.com/omerbsezer/fast-kubeflow

This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks on pods, Kubeflow Pipelines, Experiments, KALE, KATIB (AutoML: Hyperparameter Tuning), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc.

automl distributed-training jupyter-notebooks kale katib kubeflow kubeflow-component kubeflow-demo kubeflow-pipeline kubernetes training-operators

Last synced: 11 Nov 2024

https://github.com/bryanyzhu/video-tutorial-cvpr2020

A Comprehensive Tutorial on Video Modeling

distributed-training gluoncv human-action-recognition mxnet video-classification

Last synced: 28 Oct 2024

https://github.com/tanyuqian/redco

NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference

differential-privacy diffusion-models distributed-training fedavg federated-learning flan-t5-xxl gemma image-captioning jax large-language-models llama maml meta-learning mixed-precision mlsys model-parallelism ppo reinforcement-learning seq2seq stable-diffusion

Last synced: 30 Dec 2024

https://github.com/notedance/note

Machine learning library, Distributed training, Deep learning, Reinforcement learning, Models, TensorFlow, PyTorch

artificial-intelligence deep-learning deep-reinforcement-learning deeplearning deepreinforcementlearning distributed-training dl drl machine-learning machine-learning-library machinelearning ml neural-network neuralnetwork parallel-training pytorch reinforcement-learning reinforcementlearning rl tensorflow

Last synced: 19 Dec 2024

https://github.com/pinpoint-apm/pinpoint-node-agent

Pinpoint Node.js agent

agent apm distributed-training monitoring node performance pinpoint

Last synced: 29 Dec 2024

https://github.com/andreped/gradientaccumulator

:dart: Accumulated Gradients for TensorFlow 2

accumulated-batch-normalization accumulated-gradients adaptive-gradient-clipping batch-size deep-learning distributed-training float16 gpu gradient-accumulation hacktoberfest huggingface keras memory-constraints mixed-precision multi-gpu tensorflow tensorflow2 tf2 tpu

Last synced: 07 Nov 2024

https://github.com/adrianbzg/llm-distributed-finetune

Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances

aws deep-learning distributed-training falcon fine-tuning huggingface large-language-models natural-language-processing transformers

Last synced: 05 Nov 2024

https://github.com/AdrianBZG/LLM-distributed-finetune

Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances

aws deep-learning distributed-training falcon fine-tuning huggingface large-language-models natural-language-processing transformers

Last synced: 15 Nov 2024

https://github.com/saareliad/FTPipe

FTPipe and related pipeline model parallelism research.

deep-neural-networks distributed-training fine-tuning nlp pipeline-parallelism t5

Last synced: 07 Nov 2024

https://github.com/uw-mad-dash/shockwave

Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]

cloud-computing cluster-scheduler deep-learning distributed-systems distributed-training machine-learning pytorch

Last synced: 11 Nov 2024

https://github.com/aws-samples/TensorFlow-in-SageMaker-workshop

Running your TensorFlow models in Amazon SageMaker

amazon-sagemaker distributed-training pipemode tensorflow

Last synced: 07 Nov 2024

https://github.com/4paradigm/openembedding

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

distributed-training embedding-layers model-parallel parameter-server tensorflow tensorflow-training

Last synced: 20 Oct 2024

https://github.com/graykode/horovod-ansible

Create Horovod cluster easily using Ansible

ansible deeplearning distributed-training horovod openmpi pytorch tensorflow terraform

Last synced: 23 Oct 2024

https://github.com/hkproj/pytorch-transformer-distributed

Distributed training (multi-node) of a Transformer model

collective-communication data-parallelism deep-learning distributed-data-parallel distributed-training gradient-accumulation machine-learning model-parallelism pytorch tutorial

Last synced: 17 Nov 2024

https://github.com/taishan1994/pytorch-distributed-nlp

pytorch分布式训练

bert distributed-training pytorch text-classification

Last synced: 05 Nov 2024

https://github.com/sayakpaul/distributed-training-in-tensorflow-2-with-ai-platform

Contains code to demonstrate distributed training in TensorFlow 2 with AI Platform and custom Docker contains.

ai-platform distributed-training docker gcp gcr keras tensorflow2

Last synced: 28 Dec 2024

https://github.com/SLAMPAI/large-scale-pretraining-transfer

Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)

big-transfer chest-x-ray14 chest-xray-images chexpert-dataset covidx-dataset deep-learning distributed-training few-shot-learning fine-tuning imagenet large-scale-learning medical-imaging mimic-cxr padchest-dataset pre-trained-model pre-training pytorch scaling-laws supercomputing transfer-learning

Last synced: 15 Nov 2024

https://github.com/determined-ai/determined-examples

Example ML projects that use the Determined library.

deep-learning distributed-training hyperparameter-tuning keras machine-learning ml-infrastructure pytorch tensorflow

Last synced: 17 Nov 2024

https://github.com/saforem2/ezpz

Train across all your devices, ezpz 🍋

distributed-training launcher machine-learning python rich

Last synced: 29 Nov 2024

https://github.com/18520339/ml-distributed-training

Distributed training with Multi-worker & Parameter Server in TensorFlow 2

distributed distributed-tensorflow distributed-training multi-gpu multi-workers parameter-server tensorflow

Last synced: 11 Dec 2024

https://github.com/ai-hypercomputer/gpu-recipes

Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.

benchmarks distributed-training google-cloud-platform gpu serving

Last synced: 08 Nov 2024

https://github.com/bryanlimy/tf2-cyclegan

TensorFlow 2 implementation of CycleGAN with multi-GPU training.

cyclegan distributed-training gan mirroredstrategy multi-gpus tensorflow tensorflow2 tf2

Last synced: 23 Oct 2024

https://github.com/shenggan/atp

Adaptive Tensor Parallelism for Foundation Models

attention distributed-training gpt large-model model-parallelism pytorch transformer

Last synced: 18 Dec 2024

https://github.com/asprenger/distributed-training-patterns

Experiments with low level communication patterns that are useful for distributed training.

distributed-training horovod mpi mpi4py nccl tensorflow

Last synced: 23 Nov 2024

https://github.com/hunterdii/tensorflow-advanced-techniques-solution

Tensorflow Advanced Technique Specialization - Solution

computer-vision coursera coursera-specialization custom-model custom-training deep-learning deeplearning-ai distributed-training generative-ai image-detection image-segmentation-tensorflow machine-learning object-detection object-detection-model semantic-segmentation specialization tensorflow tensorflow-tutorials visualization

Last synced: 10 Oct 2024

https://github.com/alex-snd/trecover

📜 A python library for distributed training of a Transformer neural network across the Internet to solve the Running Key Cipher, widely known in the field of Cryptography.

celery cryptography deep-learning distributed-systems distributed-training fastapi hivemind keyless-reading llm machine-learning mkdocs neural-network nlp python pytorch pytorch-lightning streamlit text-recovery transformers volunteer-computing

Last synced: 10 Oct 2024

https://github.com/ler0ever/hpgo

Development of Project HPGO | Hybrid Parallelism Global Orchestration

data-parallelism distributed-training gpipe machine-learning model-parallelism pipedream pipeline-parallelism pytorch rust tensorflow

Last synced: 29 Oct 2024

https://github.com/zh320/medical-segmentation-pytorch

PyTorch implementation of medical semantic segmentations models, e.g. UNet, DUCKNet, and support knowledge distillation, distributed training, Optuna etc.

distributed-training knowledge-distillation medical-image-segmentation optuna polyp-segmentation pytorch unet