Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with distributed-training
A curated list of projects in awesome lists tagged with distributed-training .
https://github.com/gokumohandas/made-with-ml
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 29 Sep 2024
https://github.com/GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 31 Jul 2024
https://github.com/huggingface/pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
augmix convnext distributed-training dual-path-networks efficientnet image-classification imagenet maxvit mixnet mobile-deep-learning mobilenet-v2 mobilenetv3 nfnets normalization-free-training pretrained-models pretrained-weights pytorch randaugment resnet vision-transformer-models
Last synced: 29 Sep 2024
https://github.com/rwightman/pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
augmix convnext distributed-training dual-path-networks efficientnet image-classification imagenet maxvit mixnet mobile-deep-learning mobilenet-v2 mobilenetv3 nfnets normalization-free-training pretrained-models pretrained-weights pytorch randaugment resnet vision-transformer-models
Last synced: 05 Sep 2024
https://github.com/paddlepaddle/paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability
Last synced: 29 Sep 2024
https://github.com/PaddlePaddle/Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability
Last synced: 31 Jul 2024
https://github.com/paddlepaddle/paddlenlp
👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie
Last synced: 29 Sep 2024
https://github.com/PaddlePaddle/PaddleNLP
👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie
Last synced: 31 Jul 2024
https://github.com/skypilot-org/skypilot
SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu
Last synced: 29 Sep 2024
https://github.com/FedML-AI/FedML
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://fedml.ai) is your generative AI platform at scale.
ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training
Last synced: 01 Aug 2024
https://github.com/fedml-ai/fedml
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://fedml.ai) is your generative AI platform at scale.
ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training
Last synced: 30 Sep 2024
https://github.com/idea-ccnl/fengshenbang-lm
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers
Last synced: 30 Sep 2024
https://github.com/IDEA-CCNL/Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers
Last synced: 31 Jul 2024
https://github.com/bytedance/byteps
A high performance and generic framework for distributed DNN training
deep-learning distributed-training keras machine-learning mxnet pytorch tensorflow
Last synced: 29 Sep 2024
https://github.com/tensorflow/adanet
Fast and flexible AutoML with learning guarantees.
automl deep-learning distributed-training ensemble gpu learning-theory machine-learning neural-architecture-search python tensorflow tpu
Last synced: 29 Sep 2024
https://github.com/alpa-projects/alpa
Training and serving large-scale neural networks with auto parallelization.
alpa auto-parallelization compiler deep-learning distributed-computing distributed-training high-performance-computing jax llm machine-learning
Last synced: 30 Sep 2024
https://github.com/determined-ai/determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
data-science deep-learning distributed-training hyperparameter-optimization hyperparameter-search hyperparameter-tuning keras kubernetes machine-learning ml-infrastructure ml-platform mlops pytorch tensorflow
Last synced: 29 Sep 2024
https://github.com/learning-at-home/hivemind
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
asynchronous-programming asyncio deep-learning dht distributed-systems distributed-training hivemind machine-learning mixture-of-experts neural-networks pytorch volunteer-computing
Last synced: 30 Sep 2024
https://github.com/tensorlayer/HyperPose
Library for Fast and Flexible Human Pose Estimation
computer-vision distributed-training mobilenet neural-networks openpose pose-estimation tensorflow tensorlayer tensorrt
Last synced: 01 Aug 2024
https://github.com/tensorlayer/hyperpose
Library for Fast and Flexible Human Pose Estimation
computer-vision distributed-training mobilenet neural-networks openpose pose-estimation tensorflow tensorlayer tensorrt
Last synced: 30 Sep 2024
https://github.com/intelligent-machine-learning/dlrover
DLRover: An Automatic Distributed Deep Learning System
distributed-training k8s llm-training
Last synced: 01 Oct 2024
https://github.com/deeprec-ai/deeprec
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine
Last synced: 30 Sep 2024
https://github.com/DeepRec-AI/DeepRec
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine
Last synced: 01 Aug 2024
https://github.com/mryab/efficient-dl-systems
Efficient Deep Learning Systems course materials (HSE, YSDA)
cuda deep-learning distributed-training efficient-deep-learning machine-learning ml-infrastructure mlops pytorch
Last synced: 03 Oct 2024
https://github.com/alibaba/Megatron-LLaMA
Best practice for training LLaMA models in Megatron-LM
deepspeed distributed-training llama llm megatron-lm pretraining pytorch
Last synced: 31 Jul 2024
https://github.com/petuum/adaptdl
Resource-adaptive cluster scheduler for deep learning training.
aws cloud deep-learning distributed-systems distributed-training kubernetes machine-learning pytorch
Last synced: 03 Aug 2024
https://github.com/Oneflow-Inc/libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
data-parallelism deep-learning distributed-training large-scale model-parallelism nlp oneflow pipeline-parallelism self-supervised-learning transformer vision-transformer
Last synced: 03 Aug 2024
https://github.com/pytorch/torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
airflow aws-batch components deep-learning distributed-training kubernetes machine-learning pipelines python pytorch ray slurm
Last synced: 29 Sep 2024
https://github.com/DataCanvasIO/HyperGBM
A full pipeline AutoML tool for tabular data
adversarial-validation automl catboost dask dask-distributed datacleaning distributed-training ensemble-learning fullpipeline gbm gpu-acceleration lightgbm preprocessing pseudo-labeling rapidsai semi-supervised-learning sklearn tabular-data xgboost
Last synced: 03 Aug 2024
https://github.com/datacanvasio/hypergbm
A full pipeline AutoML tool for tabular data
adversarial-validation automl catboost dask dask-distributed datacleaning distributed-training ensemble-learning fullpipeline gbm gpu-acceleration lightgbm preprocessing pseudo-labeling rapidsai semi-supervised-learning sklearn tabular-data xgboost
Last synced: 26 Sep 2024
https://github.com/maudzung/YOLO3D-YOLOv4-PyTorch
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)
3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4
Last synced: 31 Jul 2024
https://github.com/DeNA/HandyRL
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning
Last synced: 01 Aug 2024
https://github.com/hmunachi/nanodl
A Jax-based library for designing and training transformer models from scratch.
attention attention-mechanism deep-learning distributed-training flax gpt jax llama machine-learning mistral nlp transformer
Last synced: 27 Sep 2024
https://github.com/alibaba/EasyParallelLibrary
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
data-parallelism deep-learning distributed-training gpu memory-efficient model-parallelism pipeline-parallelism
Last synced: 01 Aug 2024
https://github.com/PKU-DAIR/Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
artificial-intelligence autograd data-science deep-learning deep-neural-networks distributed-systems distributed-training embeddings gpu high-dimensional machine-learning python state-of-the-art
Last synced: 31 Jul 2024
https://github.com/alibaba/TePDist
TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.
auto-parallelization compiler deep-learning disthlo distributed-computing distributed-systems distributed-training high-performance-computing machine-learning rhino
Last synced: 01 Aug 2024
https://github.com/andreped/gradientaccumulator
:dart: Accumulated Gradients for TensorFlow 2
accumulated-batch-normalization accumulated-gradients adaptive-gradient-clipping batch-size deep-learning distributed-training float16 gpu gradient-accumulation hacktoberfest huggingface keras memory-constraints mixed-precision multi-gpu tensorflow tensorflow2 tf2 tpu
Last synced: 30 Sep 2024
https://github.com/AdrianBZG/LLM-distributed-finetune
Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances
aws deep-learning distributed-training falcon fine-tuning huggingface large-language-models natural-language-processing transformers
Last synced: 03 Aug 2024
https://github.com/saareliad/FTPipe
FTPipe and related pipeline model parallelism research.
deep-neural-networks distributed-training fine-tuning nlp pipeline-parallelism t5
Last synced: 01 Aug 2024
https://github.com/uw-mad-dash/shockwave
Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]
cloud-computing cluster-scheduler deep-learning distributed-systems distributed-training machine-learning pytorch
Last synced: 02 Aug 2024
https://github.com/aws-samples/TensorFlow-in-SageMaker-workshop
Running your TensorFlow models in Amazon SageMaker
amazon-sagemaker distributed-training pipemode tensorflow
Last synced: 01 Aug 2024
https://github.com/4paradigm/openembedding
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.
distributed-training embedding-layers model-parallel parameter-server tensorflow tensorflow-training
Last synced: 02 Oct 2024
https://github.com/SLAMPAI/large-scale-pretraining-transfer
Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)
big-transfer chest-x-ray14 chest-xray-images chexpert-dataset covidx-dataset deep-learning distributed-training few-shot-learning fine-tuning imagenet large-scale-learning medical-imaging mimic-cxr padchest-dataset pre-trained-model pre-training pytorch scaling-laws supercomputing transfer-learning
Last synced: 03 Aug 2024
https://github.com/asprenger/distributed-training-patterns
Experiments with low level communication patterns that are useful for distributed training.
distributed-training horovod mpi mpi4py nccl tensorflow
Last synced: 05 Aug 2024
https://github.com/alex-snd/trecover
📜 A python library for distributed training of a Transformer neural network across the Internet to solve the Running Key Cipher, widely known in the field of Cryptography.
celery cryptography deep-learning distributed-systems distributed-training fastapi hivemind keyless-reading llm machine-learning mkdocs neural-network nlp python pytorch pytorch-lightning streamlit text-recovery transformers volunteer-computing
Last synced: 27 Sep 2024
https://github.com/hunterdii/tensorflow-advanced-techniques-solution
Tensorflow Advanced Technique Specialization - Solution
computer-vision coursera coursera-specialization custom-model custom-training deep-learning deeplearning-ai distributed-training generative-ai image-detection image-segmentation-tensorflow machine-learning object-detection object-detection-model semantic-segmentation specialization tensorflow tensorflow-tutorials visualization
Last synced: 26 Sep 2024