https://github.com/abhilash1910/framework-optimization
Framework, Model & Kernel Optimizations for Distributed Deep Learning - Data Hack Summit
https://github.com/abhilash1910/framework-optimization
codegen ddp deepspeed fsdp inductor pipelineparallel pytorch tensorparallel triton
Last synced: 3 months ago
JSON representation
Framework, Model & Kernel Optimizations for Distributed Deep Learning - Data Hack Summit
- Host: GitHub
- URL: https://github.com/abhilash1910/framework-optimization
- Owner: abhilash1910
- Created: 2023-07-02T10:39:48.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-08-01T18:34:59.000Z (almost 2 years ago)
- Last Synced: 2025-01-22T12:13:12.799Z (5 months ago)
- Topics: codegen, ddp, deepspeed, fsdp, inductor, pipelineparallel, pytorch, tensorparallel, triton
- Language: Python
- Homepage:
- Size: 41.7 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
## Distributed Framework Optimization - Data Hack Summit 2023

Deep Learning Frameworks form the baseline over which millions of models (LLMs, multimodals , auto regressive) are being compiled and built on.
Many of these frameworks require sophisticated optimization to make models train and infer faster in constrained hardware chips. The intrinsic kernels which form a part of these Frameworks (such as Pytorch) leverage profound adaptive features to help break perf- benchmarks in supercomputing and federated deep learning . This is a glimpse of different sub-kernel, intermediate framework and superficial model optimization techniques which help people run large models such as GPTs on constrained environments and clusters.
### Pytorch
Most of the session would revolve around different model optimizations strategies and how the Pytorch framework can make training and finetuning efficient. This would involve features such as aten Graph Capture ,Lowering , Composite Graph Compilation ( by Inductor) followed by device specific IR which the device compiler can optimize further for model performance.

### Distributed Pytorch
To extend different parallelisms over a dedicated set of hardware device combinations (CPU-GPU,GPU-GPU,multi XPU,multi TPU ,MPS) the distributed backend of pytorch comes into picture. It enables scale up and out of sharded models , data and parameters to efficiently distribute gradients, checkpoints, activations across different devices.

### Data ,Model and Pipeline Parallelism
In data parallel training, the dataset is split into several shards, each shard is allocated to a device. This is equivalent to parallelize the training process along the batch dimension.

Model Parallelism involves sharding model blocks (not separate tensor lists) across devices in a uniform manner.

Pipeline parallelism splits the model layer by layer into several chunks, each chunk is given to a device. The caveat here includes a single optimizer.step forces forward (increasing pipeline stages) and backward (decreasing pipeline stages) in an interleaved manner .

### Deepspeed Zero
ZeRO leverages the aggregate computation and memory resources of data parallelism to reduce the memory and compute requirements of each device (GPU) used for model training. ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the available devices (GPUs and CPUs) in the distributed training hardware.

### Triton Compiler
Triton is a deep learning compiler created specifically to abstract IR code and optimize kernels which would otherwise be difficult to optimiz in cuda.
