https://github.com/shroukmansour/distributed-model-training-ray

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/shroukmansour/distributed-model-training-ray
Owner: ShroukMansour
Created: 2024-09-15T23:55:06.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-09-16T00:26:01.000Z (9 months ago)
Last Synced: 2025-01-10T14:30:23.762Z (5 months ago)
Language: Python
Size: 3.91 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# distributed-model-training-ray

Description: This project demonstrates how to train a PyTorch neural network model in a distributed manner using Ray. The project implements a distributed training pipeline that scales across multiple workers, optimizing communication and synchronization through Ray's TorchTrainer. This setup allows efficient training of large models on multiple machines or GPUs, leveraging advanced techniques such as all-reduce for gradient synchronization.

Key Features:

Distributed training using Ray's TorchTrainer with multiple workers.

Neural network training on the FashionMNIST dataset using PyTorch.

Data parallelism, worker synchronization, and model gradient aggregation using the all-reduce algorithm.

Configurable to run on both CPU and GPU environments.

Demonstrates integration of data loaders and model preparation using ray.train.torch.

Technologies Used:

Ray for distributed training
PyTorch for model building and training
TorchTrainer for scaling and synchronization
FashionMNIST dataset for image classification

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shroukmansour/distributed-model-training-ray

Awesome Lists containing this project

README