https://github.com/benediktalkin/kappaprofiler

lightweight simple profiling for python/pytorch
https://github.com/benediktalkin/kappaprofiler

cuda profiler python pytorch

Last synced: 11 months ago
JSON representation

lightweight simple profiling for python/pytorch

Host: GitHub
URL: https://github.com/benediktalkin/kappaprofiler
Owner: BenediktAlkin
License: mit
Created: 2022-06-26T14:02:56.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2023-03-19T08:37:39.000Z (over 3 years ago)
Last Synced: 2025-05-12T15:54:41.085Z (about 1 year ago)
Topics: cuda, profiler, python, pytorch
Language: Python
Homepage:
Size: 44.9 KB
Stars: 6
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # KappaProfiler

[![publish](https://github.com/BenediktAlkin/KappaProfiler/actions/workflows/publish.yaml/badge.svg)](https://github.com/BenediktAlkin/KappaProfiler/actions/workflows/publish.yaml)

Lightweight profiling utilities for identifying bottlenecks and timing program parts in your python application. 

Also supports [async profiling for cuda](https://github.com/BenediktAlkin/KappaProfiler#time-async-operations).

# Setup

- new install: `pip install kappaprofiler`

- uprade to new version: `pip install kappaprofiler --upgrade` 

# Usage

## Time your whole application

### With decorators

```

import kappaprofiler as kp

import time

@kp.profile

def main():

  time.sleep(0.3)  # simulate some operation

  some_method()

 

@kp.profile

def some_method():

  time.sleep(0.5)  # simulate some operation

if __name__ == "__main__":

  main()

  print(kp.profiler.to_string())

```

The result will be (time.sleep calls are not 100% accurate)

```

0.82 main

0.51 main.some_method

```

### With contextmanagers

```

import kappaprofiler as kp

import time

def main():

  with kp.named_profile("main"):

    time.sleep(0.3)  # simulate some operation

    with kp.named_profile("method"):

        some_method()

  with kp.named_profile("main2"):

    time.sleep(0.2)  # simulate some operation

 

def some_method():

  time.sleep(0.5)  # simulate some operation

if __name__ == "__main__":

  main()

  print(kp.profiler.to_string())

```

The result will be (time.sleep calls are not 100% accurate)

```

0.82 main

0.51 main.method

0.20 main2

```

## Query nodes

Each profiling entry is represented by a node from which detailed information can be retrieved

```

query = "main.some_method"

node = kp.profiler.get_node(query)

print(f"{query} was called {node.count} time and took {node.to_string()} seconds in total")

```

`main.some_method was called 1 time and took 0.51 seconds in total`

## Time only a part of your program

```

import kappaprofiler as kp

with kp.Stopwatch() as sw:

    # some operation

    ...

print(f"operation took {sw.elapsed_milliseconds} milliseconds")

print(f"operation took {sw.elapsed_seconds} seconds")

```

#### Time subparts

```

import kappaprofiler as kp

import time

sw1 = kp.Stopwatch()

sw2 = kp.Stopwatch()

for i in range(1, 3):

    with sw1:

        # operation1

        time.sleep(0.1 * i)

    with sw2:

        # operation2

        time.sleep(0.2 * i)

print(f"operation1 took {sw1.elapsed_seconds:.2f} seconds (average {sw1.average_lap_time:.2f})")

print(f"operation2 took {sw2.elapsed_seconds:.2f} seconds (average {sw2.average_lap_time:.2f})")

```

```

operation1 took 0.32 seconds (average 0.16)

operation2 took 0.61 seconds (average 0.30)

```

## Time async operations

Showcase: timing [cuda](https://developer.nvidia.com/cuda-toolkit) operations in 

[pytorch](https://github.com/pytorch/pytorch)

Asynchronous operations can only be timed properly when the asynchronous call is awaited or a synchronization point is

created after the timing should end. Natively in pytorch this would look something like this:

```

# submit a start event to the event stream

start_event = torch.cuda.Event(enable_timing=True)

start_event.record()

# submit a async operation to the event stream

...

# submit a end event to the event stream

end_event = torch.cuda.Event(enable_timing=True)

end_event.record()

# synchronize

torch.cuda.synchronize()

print(start_event.elapsed_time(end_event))

```

which is quite a lot of boilerplate for timing one operation.

With kappaprofiler it looks like this:

```

import kappaprofiler as kp

import torch

def main():

    device = torch.device("cuda")

    x = torch.randn(15000, 15000, device=device)

    with kp.named_profile("matmul_wrong"):

        # matrix multiplication (@) is asynchronous

        _ = x @ x

    # the timing for "matmul_wrong" is only the time it took to

    # submit the x @ x operation to the cuda event stream

    # not the actual time the x @ x operation took

    with kp.named_profile_async("matmul_right"):

        _ = x @ x

    matmul_method(x)

@kp.profile_async

def matmul_method(x):

    _ = x @ x

def start_async():

    start_event = torch.cuda.Event(enable_timing=True)

    start_event.record()

    return start_event

def end_async(start_event):

    end_event = torch.cuda.Event(enable_timing=True)

    end_event.record()

    torch.cuda.synchronize()

    # torch.cuda.Event.elapsed_time returns milliseconds but kappaprofiler expects seconds

    return start_event.elapsed_time(end_event) / 1000

if __name__ == "__main__":

    kp.setup_async(start_async, end_async)

    main()

    print(kp.profiler.to_string())

```

```

0.56 matmul_wrong

4.69 matmul_right

4.72 matmul_method

```

NOTE: Synchronization points slow down overall program execution, so they should only be used for investigating 

bottlenecks/runtimes

To remove all synchronization points in your program either:

- remove the `kp.setup_async` call -> `kp.named_profile_async`/`kp.profile_async` will default to a noop (NOTE: this

  removes the node completely, so it's also not possible to query it)

- replace the `kp.setup_async` call with `kp.setup_async_as_sync` to make the asynchronous calls behave just like the 

  synchronous calls. This will make the async times wrong (like `matmul_wrong` above) but still creates a node for the 

  operation (e.g. for querying how often it was called).

### Multi-process pytorch profiling

Only synchronizing cuda operations is not sufficient when multiple processes are used (e.g. for multi-gpu training).

In addition to cuda synchronization, the processes have to be synced up.

```

import torch.distributed as dist

def end_async(start_event):

    if dist.is_available() and dist.is_initialized():

        torch.cuda.synchronize()

        dist.barrier()

    end_event = torch.cuda.Event(enable_timing=True)

    end_event.record()

    torch.cuda.synchronize()

    # torch.cuda.Event.elapsed_time returns milliseconds but kappaprofiler expects seconds

    return start_event.elapsed_time(end_event) / 1000

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/benediktalkin/kappaprofiler

Awesome Lists containing this project

README