Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with distributed-training

A curated list of projects in awesome lists tagged with distributed-training .

https://github.com/huggingface/pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

augmix convnext distributed-training dual-path-networks efficientnet image-classification imagenet maxvit mixnet mobile-deep-learning mobilenet-v2 mobilenetv3 nfnets normalization-free-training pretrained-models pretrained-weights pytorch randaugment resnet vision-transformer-models

Last synced: 29 Sep 2024

https://github.com/rwightman/pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

augmix convnext distributed-training dual-path-networks efficientnet image-classification imagenet maxvit mixnet mobile-deep-learning mobilenet-v2 mobilenetv3 nfnets normalization-free-training pretrained-models pretrained-weights pytorch randaugment resnet vision-transformer-models

Last synced: 05 Sep 2024

https://github.com/paddlepaddle/paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability

Last synced: 29 Sep 2024

https://github.com/PaddlePaddle/Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

deep-learning distributed-training efficiency machine-learning neural-network paddlepaddle python scalability

Last synced: 31 Jul 2024

https://github.com/paddlepaddle/paddlenlp

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie

Last synced: 29 Sep 2024

https://github.com/PaddlePaddle/PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie

Last synced: 31 Jul 2024

https://github.com/skypilot-org/skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu

Last synced: 29 Sep 2024

https://github.com/FedML-AI/FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://fedml.ai) is your generative AI platform at scale.

ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training

Last synced: 01 Aug 2024

https://github.com/fedml-ai/fedml

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://fedml.ai) is your generative AI platform at scale.

ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training

Last synced: 30 Sep 2024

https://github.com/idea-ccnl/fengshenbang-lm

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 30 Sep 2024

https://github.com/IDEA-CCNL/Fengshenbang-LM

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

aigc chinese-nlp distributed-training multimodal pretrained-models pytorch transformers

Last synced: 31 Jul 2024

https://github.com/bytedance/byteps

A high performance and generic framework for distributed DNN training

deep-learning distributed-training keras machine-learning mxnet pytorch tensorflow

Last synced: 29 Sep 2024

https://github.com/determined-ai/determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

data-science deep-learning distributed-training hyperparameter-optimization hyperparameter-search hyperparameter-tuning keras kubernetes machine-learning ml-infrastructure ml-platform mlops pytorch tensorflow

Last synced: 29 Sep 2024

https://github.com/learning-at-home/hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

asynchronous-programming asyncio deep-learning dht distributed-systems distributed-training hivemind machine-learning mixture-of-experts neural-networks pytorch volunteer-computing

Last synced: 30 Sep 2024

https://github.com/intelligent-machine-learning/dlrover

DLRover: An Automatic Distributed Deep Learning System

distributed-training k8s llm-training

Last synced: 01 Oct 2024

https://github.com/deeprec-ai/deeprec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine

Last synced: 30 Sep 2024

https://github.com/DeepRec-AI/DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine

Last synced: 01 Aug 2024

https://github.com/alibaba/Megatron-LLaMA

Best practice for training LLaMA models in Megatron-LM

deepspeed distributed-training llama llm megatron-lm pretraining pytorch

Last synced: 31 Jul 2024

https://github.com/petuum/adaptdl

Resource-adaptive cluster scheduler for deep learning training.

aws cloud deep-learning distributed-systems distributed-training kubernetes machine-learning pytorch

Last synced: 03 Aug 2024

https://github.com/pytorch/torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

airflow aws-batch components deep-learning distributed-training kubernetes machine-learning pipelines python pytorch ray slurm

Last synced: 29 Sep 2024

https://github.com/maudzung/YOLO3D-YOLOv4-PyTorch

YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

3d-object-detection darknet distributed-training object-detection point-cloud real-time rotated-boxes-iou yolo3d yolov4

Last synced: 31 Jul 2024

https://github.com/DeNA/HandyRL

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

deep-learning distributed-training games machine-learning policy-gradient pytorch reinforcement-learning

Last synced: 01 Aug 2024

https://github.com/hmunachi/nanodl

A Jax-based library for designing and training transformer models from scratch.

attention attention-mechanism deep-learning distributed-training flax gpt jax llama machine-learning mistral nlp transformer

Last synced: 27 Sep 2024

https://github.com/alibaba/EasyParallelLibrary

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

data-parallelism deep-learning distributed-training gpu memory-efficient model-parallelism pipeline-parallelism

Last synced: 01 Aug 2024

https://github.com/PKU-DAIR/Hetu

A high-performance distributed deep learning system targeting large-scale and automated distributed training.

artificial-intelligence autograd data-science deep-learning deep-neural-networks distributed-systems distributed-training embeddings gpu high-dimensional machine-learning python state-of-the-art

Last synced: 31 Jul 2024

https://github.com/alibaba/TePDist

TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.

auto-parallelization compiler deep-learning disthlo distributed-computing distributed-systems distributed-training high-performance-computing machine-learning rhino

Last synced: 01 Aug 2024

https://github.com/AdrianBZG/LLM-distributed-finetune

Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the training on multiple AWS GPU instances

aws deep-learning distributed-training falcon fine-tuning huggingface large-language-models natural-language-processing transformers

Last synced: 03 Aug 2024

https://github.com/saareliad/FTPipe

FTPipe and related pipeline model parallelism research.

deep-neural-networks distributed-training fine-tuning nlp pipeline-parallelism t5

Last synced: 01 Aug 2024

https://github.com/uw-mad-dash/shockwave

Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]

cloud-computing cluster-scheduler deep-learning distributed-systems distributed-training machine-learning pytorch

Last synced: 02 Aug 2024

https://github.com/aws-samples/TensorFlow-in-SageMaker-workshop

Running your TensorFlow models in Amazon SageMaker

amazon-sagemaker distributed-training pipemode tensorflow

Last synced: 01 Aug 2024

https://github.com/4paradigm/openembedding

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

distributed-training embedding-layers model-parallel parameter-server tensorflow tensorflow-training

Last synced: 02 Oct 2024

https://github.com/SLAMPAI/large-scale-pretraining-transfer

Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)

big-transfer chest-x-ray14 chest-xray-images chexpert-dataset covidx-dataset deep-learning distributed-training few-shot-learning fine-tuning imagenet large-scale-learning medical-imaging mimic-cxr padchest-dataset pre-trained-model pre-training pytorch scaling-laws supercomputing transfer-learning

Last synced: 03 Aug 2024

https://github.com/asprenger/distributed-training-patterns

Experiments with low level communication patterns that are useful for distributed training.

distributed-training horovod mpi mpi4py nccl tensorflow

Last synced: 05 Aug 2024

https://github.com/alex-snd/trecover

📜 A python library for distributed training of a Transformer neural network across the Internet to solve the Running Key Cipher, widely known in the field of Cryptography.

celery cryptography deep-learning distributed-systems distributed-training fastapi hivemind keyless-reading llm machine-learning mkdocs neural-network nlp python pytorch pytorch-lightning streamlit text-recovery transformers volunteer-computing

Last synced: 27 Sep 2024