Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shroukmansour/distributed-model-training-ray
https://github.com/shroukmansour/distributed-model-training-ray
Last synced: 4 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/shroukmansour/distributed-model-training-ray
- Owner: ShroukMansour
- Created: 2024-09-15T23:55:06.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-09-16T00:26:01.000Z (2 months ago)
- Last Synced: 2024-09-17T01:45:13.307Z (about 2 months ago)
- Language: Python
- Size: 3.91 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# distributed-model-training-ray
Description: This project demonstrates how to train a PyTorch neural network model in a distributed manner using Ray. The project implements a distributed training pipeline that scales across multiple workers, optimizing communication and synchronization through Ray's TorchTrainer. This setup allows efficient training of large models on multiple machines or GPUs, leveraging advanced techniques such as all-reduce for gradient synchronization.
Key Features:
Distributed training using Ray's TorchTrainer with multiple workers.
Neural network training on the FashionMNIST dataset using PyTorch.
Data parallelism, worker synchronization, and model gradient aggregation using the all-reduce algorithm.
Configurable to run on both CPU and GPU environments.
Demonstrates integration of data loaders and model preparation using ray.train.torch.
Technologies Used:
Ray for distributed training
PyTorch for model building and training
TorchTrainer for scaling and synchronization
FashionMNIST dataset for image classification