Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/argonne-lcf/mlprof

Profiling tools for performance studies of competing ML frameworks on HPC systems
https://github.com/argonne-lcf/mlprof
Last synced: 14 days ago
JSON representation
Profiling tools for performance studies of competing ML frameworks on HPC systems
Host: GitHub
URL: https://github.com/argonne-lcf/mlprof
Owner: argonne-lcf
Created: 2022-12-05T20:42:42.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-07-31T17:57:03.000Z (over 1 year ago)
Last Synced: 2024-08-01T16:53:11.633Z (3 months ago)
Language: Python
Size: 272 KB
Stars: 2
Watchers: 7
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # mlprof 

[](https://wandb.ai/l2hmc-qcd/mlprof?workspace=user-saforem2)   

Profiling tools for performance studies of competing ML frameworks on HPC systems

TODO



## TODO

### 06/05/2023

- [ ] Add check to determine if running on Intel GPUs, if so: load `intel_extension_for_{pytorch,deepspeed}`

  - [ ] Modify implementation to add support for Intel GPUs, test on ALCF systems

- [ ] Add support for additional (transformer based) model architectures in [`src/mlprof/network/pytorch/*`](./src/mlprof/network/pytorch/)

  - [ ] _ideally_, support for pulling in arbitrary models from HuggingFace, `torchvision`, etc.

	

### 04/17/2023

- [ ] Work on repeating MPI profile experiments with larger batch size / network size using `module load conda/2023-01-10-unstable` on Polaris

- [ ] Try with single + multiple nodes to measure performance impact

### Older

- [x] Write DeepSpeed Trainer that wraps [`src/mlprof/network/pytorch/network.py`](./src/mlprof/network/pytorch/network.py)

    - Reference: [DeepSpeed -- Getting Started](https://www.deepspeed.ai/getting-started/)

- [ ] MPI Profiling to get all collective comm. ops with same model in DeepSpeed, DDP, and Horovod

   - Reference: [Profiling](https://github.com/argonne-lcf/mlprof#profiling) using `libmpitrace.so` on Polaris

- [ ] Start with 2 nodes first and next scale w/ increasing number of nodes

- [ ] Get profiles for DeepSpeed Zero 1, 2, 3 and Mixture of experts (MoE)

- [ ] Identify what parameters can impact performance such as NCCL environment variables and framework-specific parameters

- [ ] Do the analysis for standard models and large language models (LLMs)

- [ ] Develop auto-tuning methods to set these parameters for optimal performance

    

#### 2023-02-20

- [ ] Associate `mpiprofile`'s with backend + attach logs to keep everything together

- [ ] Scale up message sizes in mpiprofiles

- [ ] Aggregate into table, grouped by backend 

- [ ] Test `fp16` support w/ all backends

- [ ] Ensure all GPUs being utilized 

	- w/ `conda/2022-09-08-hvd-nccl` all processes get mapped to GPU0 for some reason     



## Setup

> **Note**

> 
 These instructions assume that your active environment already has 

> the required ML libraries installed.

>

> This allows us to perform an isolated editable installation _inside_ our

existing environment, and allows it to access previously installed libraries.

To install:

```bash

# for ALCF systems, first:

module load conda ; conda activate base

# otherwise, start here:

python3 -m venv venv --system-site-packages

source venv/bin/activate

python3 -m pip install --upgrade pip setuptools wheel

python3 -m pip install -e .

```

## Running Experiments

We support distributed training using the following backends:

- [microsoft/DeepSpeed](https://github.com/microsoft/deepspeed) (`backend=deepspeed`)

- [horovod/horovod](https://github.com/horovod/horovod) (`backend=horovod`)

- [pytorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) (`backend=DDP`)

which we specify via `backend=BACKEND` as an argument to the [src/mlprof/train.sh](./src/mlprof/train.sh) script:

```bash

cd src/mlprof

./train.sh backend=BACKEND train.log 2>&1 &

```

and view the resulting output:

```bash

tail -f train.log $(tail -1 logs/latest)

```

### Configuration

Configuration options can be overridden on the command line, e.g.

(and are specified in [`src/mlprof/conf/config.yaml`](src/mlprof/conf/config.yaml))

```bash

./train.sh backend=DDP data.batch_size=256 network.hidden_size=64 > train.log 2>&1 &

```

### Running on Polaris

Run on Polaris:



```bash

qsub \

  -A  \

  -q debug-scaling \

  -l select=2 \

  -l walltime=12:00:00,filesystem=eagle:home:grand \

  -I

module load conda/2023-01-10-unstable

conda activate base

git clone https://www.github.com/argonne-lcf/mlprof

cd mlprof

mkdir -p venvs/polaris/2023-01-10

python3 -m venv venvs/polaris/2023-01-10 --system-site-packages

source venvs/polaris/2023-01-10

python3 -m pip install --upgrade pip setuptools wheel

python3 -m pip install -e .

cd src/mlprof

# TO TRAIN:

./train.sh backend=deepspeed > train.log 2>&1 &

# TO VIEW OUTPUT:

tail -f train.log $(tail -1 logs/latest)

```



> **Warning**

> 
_Running with DeepSpeed_  

>

> If you're using DeepSpeed directly to launch the multi-node training, you will need to ensure the following environment variables are defined in your `.deepspeed_env` file.  

>  

> The contents of this file should be one environment variable per line, formatted as `KEY=VALUE`.  

> Each of these environment variables will be explicitly set on every worker node using DeepSpeed.

> ```bash

> # -------------------------------------------------------------

> # the following are necessary when using the DeepSpeed backend

> export CFLAGS="-I${CONDA_PREFIX}/include/"

> export LDFLAGS="-L${CONDA_PREFIX}/lib/" 

> echo "PATH=${PATH}" > .deepspeed_env 

> echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> .deepspeed_env

> echo "https_proxy=${https_proxy}" >> .deepspeed_env

> echo "http_proxy=${http_proxy}" >> .deepspeed_env 

> echo "CFLAGS=${CFLAGS}" >> .deepspeed_env

> echo "LDFLAGS=${LDFLAGS}" >> .deepspeed_env

> # -------------------------------------------------------------

> ```

### Profiling

To run an experiment with `mpitrace` enabled, on Polaris, we can explicitly set the `LD_PRELOAD` environment variable, e.g.

```bash

LD_PRELOAD=/soft/perftools/mpitrace/lib/libmpitrace.so ./train.sh > train.log 2>&1 &

```

which will write MPI Profiling information to a `mpi_profile.XXXXXX.Y` file containing the following information:

MPI Profile Results



```bash

Data for MPI rank 0 of 8:

Times from MPI_Init() to MPI_Finalize().

-----------------------------------------------------------------------

MPI Routine                        #calls     avg. bytes      time(sec)

-----------------------------------------------------------------------

MPI_Comm_rank                           3            0.0          0.000

MPI_Comm_size                           1            0.0          0.000

MPI_Bcast                               2           16.5          0.000

-----------------------------------------------------------------------

total communication time = 0.000 seconds.

total elapsed time       = 232.130 seconds.

user cpu time            = 122.013 seconds.

system time              = 96.950 seconds.

max resident set size    = 4064.422 MiB.

-----------------------------------------------------------------

Message size distributions:

MPI_Bcast                 #calls    avg. bytes      time(sec)

                               1           4.0          0.000

                               1          29.0          0.000

-----------------------------------------------------------------

Summary for all tasks:

  Rank 0 reported the largest memory utilization : 4064.42 MiB

  Rank 0 reported the largest elapsed time : 232.13 sec

  minimum communication time = 0.000 sec for task 6

  median  communication time = 0.000 sec for task 5

  maximum communication time = 0.000 sec for task 4

MPI timing summary for all ranks:

taskid             host    cpu    comm(s)  elapsed(s)     user(s)   system(s)   size(MiB)    switches

     0   x3210c0s37b1n0      0       0.00      232.13      122.01       96.95     4064.42   240460957

     1   x3210c0s37b1n0      1       0.00      227.60      126.06       95.88     4001.15   231353798

     2   x3210c0s37b1n0      2       0.00      227.63      135.59       85.93     3965.89   230507191

     3   x3210c0s37b1n0      3       0.00      227.63      126.33       95.75     4003.07   230342296

     4    x3210c0s7b0n0      0       0.00      227.66      137.07       83.80     4039.70   209534784

     5    x3210c0s7b0n0      1       0.00      227.64      125.65       96.13     4004.05   230622703

     6    x3210c0s7b0n0      2       0.00      227.64      134.53       87.16     3968.59   229010244

     7    x3210c0s7b0n0      3       0.00      227.67      125.24       96.90     4004.26   233186459

```