Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/facebookresearch/HolisticTraceAnalysis

A library to analyze PyTorch traces.
https://github.com/facebookresearch/HolisticTraceAnalysis

Last synced: 9 days ago
JSON representation

A library to analyze PyTorch traces.

Host: GitHub
URL: https://github.com/facebookresearch/HolisticTraceAnalysis
Owner: facebookresearch
License: mit
Created: 2022-11-29T20:55:25.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-07-07T04:10:35.000Z (4 months ago)
Last Synced: 2024-07-08T09:59:23.947Z (4 months ago)
Language: Python
Homepage: http://hta.readthedocs.io
Size: 56.7 MB
Stars: 251
Watchers: 17
Forks: 33
Open Issues: 21
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

[![CircleCI](https://circleci.com/gh/facebookresearch/HolisticTraceAnalysis.svg?style=shield)](https://app.circleci.com/pipelines/github/facebookresearch/HolisticTraceAnalysis)
[![codecov](https://codecov.io/github/facebookresearch/holistictraceanalysis/branch/main/graph/badge.svg?token=R44P6M3RJN)](https://codecov.io/github/facebookresearch/holistictraceanalysis)
[![Docs](https://readthedocs.org/projects/hta/badge/?version=latest)](https://hta.readthedocs.io/en/latest/?badge=latest)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/LICENSE)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/CONTRIBUTING.md)

# Holistic Trace Analysis

Holistic Trace Analysis (HTA), is a performance analysis tool to identify performance bottlenecks in
distributed training workloads. HTA achieves this by analyzing traces collected through the [PyTorch
Profiler](https://github.com/pytorch/kineto) a.k.a. Kineto.

## Features

HTA provides the following features:

1. __Temporal Breakdown__ - Breakdown of time taken by the GPUs in terms of time spent in
computation, communication, memory events, and idle time across all ranks.
1. __Kernel Breakdown__ - Finds kernels with the longest duration on each rank.
1. __Kernel Duration Distribution__ - Distribution of average time taken by longest kernels across
different ranks.
1. __Idle Time Breakdown__ - Breakdown of GPU idle time into waiting for the host, waiting for
another kernel or attribution to an unknown cause.
1. __Communication Computation Overlap__ - Calculate the percentage of time when communication
overlaps computation.
1. __Frequent CUDA Kernel Patterns__ - Find the CUDA kernels most frequently launched by any given
PyTorch or user defined operator.
1. __CUDA Kernel Launch Statistics__ - Distributions of GPU kernels with very small duration, large
duration, and excessive launch time.
1. __Augmented Counters (Queue length, Memory bandwidth)__ - Augmented trace files which provide
insights into memory bandwidth utilized and number of outstanding operations on each CUDA stream.
1. __Trace Comparison__ - A trace comparison tool to identify and visualize the differences between
traces.
1. __CUPTI Counter Analysis__ - An experimental API to get GPU performance counters. By attributing
performance measurements from kernels to PyTorch operators roofline analysis can be performed and
kernels can be optimized.

## Installation

HTA runs on Linux and Mac with Python >= 3.8.

### Setup a Conda environment (optional)

See [here](https://docs.conda.io/en/latest/miniconda.html) to install Miniconda.

Create the environment `env_name`
``` bash
conda create -n env_name
```

Activate the environment
``` bash
conda activate env_name
```

Deactivate the environment
``` bash
conda deactivate
```

### Install using PyPI (stable)

```
pip install HolisticTraceAnalysis
```

### Install from source

```
git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git
cd HolisticTraceAnalysis
git submodule update --init
pip install -r requirements.txt
pip install -e .
```

## Documentation

Learn more about the features and the API from our [documentation](https://hta.readthedocs.io/en/latest/index.html).

## Usage

### Data Preparation
All traces collected from a job must reside in a unique folder.

### Analysis in a Jupyter notebook

Activate the Conda environment and launch a Jupyter notebook.
```
conda activate env_name
jupyter notebook
```

Import HTA, and create a `TraceAnalysis` object
``` python
from hta.trace_analysis import TraceAnalysis
analyzer = TraceAnalysis(trace_dir = "/path/to/folder/containing/the/traces")
```

#### Basic Usage

``` python
# Temporal breakdown
temporal_breakdown_df = analyzer.get_temporal_breakdown()

# Kernel breakdown
kernel_breakdown_df = analyzer.get_gpu_kernel_breakdown()

# Idle time breakdown
idle_time_df = analyzer.get_idle_time_breakdown()

# Communication computation overlap
comm_comp_overlap_df = analyzer.get_comm_comp_overlap()

# Frequent CUDA kernel patterns
frequent_patterns_df = analyzer.get_frequent_cuda_kernel_patterns(operator_name="aten::linear", output_dir="/new/trace/path")

# CUDA kernel launch statistics
cuda_launch_kernel_stats = analyzer.get_cuda_kernel_launch_stats()

# Memory bandwidth time series
memory_bw_series = analyzer.get_memory_bw_time_series()

# Memory bandwidth summary
memory_bw_summary = analyzer.get_memory_bw_summary()

# Queue length time series
ql_series = analyzer.get_queue_length_time_series()

# Queue length summary
ql_summary = analyzer.get_queue_length_summary()
```

For a detailed demo run the `trace_analysis_demo` and `trace_diff_demo` notebooks in the examples folder.

#### Advanced Usage

__Logging Level__

Logging level is set through a configuration file in HTA. The default logging level is set in
`hta/configs/logging.config` and can be changed in the `[logger_hta]` section of the file.
If needed, a different logging file can be configured to use by modifying
`hta/configs/trace_analyzer.json`.

#### Repo Map

```
├── examples # folder containing demo notebooks
│ ├── ...
├── hta
│ ├── analyzers # core logic for each analysis
│ │ ├── ...
│ ├── common # code common to multiple analysis
│ │ ├── ...
│ ├── configs # config files
│ │ ├── ...
│ ├── trace_analysis.py # entrypoint for TraceAnalysis API
│ ├── trace_diff.py # entrypoint for TraceDiff API
│ └── utils # utility files
│ └── ...
├── scripts # generic tools for traces
│ └── ...
│── tests # unittests
│ └── ...
```

## Contributing
We welcome new contributions. If you plan to contribute new features or extensions, please first
open an [issue](https://github.com/facebookresearch/HolisticTraceAnalysis/issues) and discuss the feature with
us. To learn more about how to contribute, see our [contributing guidelines](https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/CONTRIBUTING.md).

Please let us know if you encounter a bug by filing an [issue](https://github.com/facebookresearch/HolisticTraceAnalysis/issues).

## The Team
HTA is currently maintained by: [Anupam Bhatnagar](https://github.com/anupambhatnagar), [Brian Coutinho](https://github.com/briancoutinho),
[Xizhou Feng](https://github.com/fengxizhou), [Yifan Liu](https://github.com/yifanliu112), [Sung-Han Lin](https://github.com/sunghlin) and
[Louis Feng](https://github.com/louisfeng). Past contributors include [Michael Acar](https://github.com/mjacar) and [Yuzhen Huang](https://github.com/Yuzhen11).

## License
Holistic Trace Analysis is licensed under the [MIT License](https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/LICENSE).