An open API service indexing awesome lists of open source software.

https://github.com/benediktalkin/kappaprofiler

lightweight simple profiling for python/pytorch
https://github.com/benediktalkin/kappaprofiler

cuda profiler python pytorch

Last synced: 11 months ago
JSON representation

lightweight simple profiling for python/pytorch

Awesome Lists containing this project

README

          

# KappaProfiler
[![publish](https://github.com/BenediktAlkin/KappaProfiler/actions/workflows/publish.yaml/badge.svg)](https://github.com/BenediktAlkin/KappaProfiler/actions/workflows/publish.yaml)

Lightweight profiling utilities for identifying bottlenecks and timing program parts in your python application.

Also supports [async profiling for cuda](https://github.com/BenediktAlkin/KappaProfiler#time-async-operations).

# Setup
- new install: `pip install kappaprofiler`
- uprade to new version: `pip install kappaprofiler --upgrade`

# Usage
## Time your whole application
### With decorators
```
import kappaprofiler as kp
import time

@kp.profile
def main():
time.sleep(0.3) # simulate some operation
some_method()

@kp.profile
def some_method():
time.sleep(0.5) # simulate some operation

if __name__ == "__main__":
main()
print(kp.profiler.to_string())
```
The result will be (time.sleep calls are not 100% accurate)
```
0.82 main
0.51 main.some_method
```
### With contextmanagers
```
import kappaprofiler as kp
import time

def main():
with kp.named_profile("main"):
time.sleep(0.3) # simulate some operation
with kp.named_profile("method"):
some_method()
with kp.named_profile("main2"):
time.sleep(0.2) # simulate some operation

def some_method():
time.sleep(0.5) # simulate some operation

if __name__ == "__main__":
main()
print(kp.profiler.to_string())
```
The result will be (time.sleep calls are not 100% accurate)
```
0.82 main
0.51 main.method
0.20 main2
```

## Query nodes
Each profiling entry is represented by a node from which detailed information can be retrieved
```
query = "main.some_method"
node = kp.profiler.get_node(query)
print(f"{query} was called {node.count} time and took {node.to_string()} seconds in total")
```
`main.some_method was called 1 time and took 0.51 seconds in total`

## Time only a part of your program
```
import kappaprofiler as kp
with kp.Stopwatch() as sw:
# some operation
...
print(f"operation took {sw.elapsed_milliseconds} milliseconds")
print(f"operation took {sw.elapsed_seconds} seconds")
```

#### Time subparts
```
import kappaprofiler as kp
import time

sw1 = kp.Stopwatch()
sw2 = kp.Stopwatch()

for i in range(1, 3):
with sw1:
# operation1
time.sleep(0.1 * i)
with sw2:
# operation2
time.sleep(0.2 * i)

print(f"operation1 took {sw1.elapsed_seconds:.2f} seconds (average {sw1.average_lap_time:.2f})")
print(f"operation2 took {sw2.elapsed_seconds:.2f} seconds (average {sw2.average_lap_time:.2f})")
```
```
operation1 took 0.32 seconds (average 0.16)
operation2 took 0.61 seconds (average 0.30)
```

## Time async operations
Showcase: timing [cuda](https://developer.nvidia.com/cuda-toolkit) operations in
[pytorch](https://github.com/pytorch/pytorch)

Asynchronous operations can only be timed properly when the asynchronous call is awaited or a synchronization point is
created after the timing should end. Natively in pytorch this would look something like this:
```
# submit a start event to the event stream
start_event = torch.cuda.Event(enable_timing=True)
start_event.record()

# submit a async operation to the event stream
...

# submit a end event to the event stream
end_event = torch.cuda.Event(enable_timing=True)
end_event.record()

# synchronize
torch.cuda.synchronize()

print(start_event.elapsed_time(end_event))
```
which is quite a lot of boilerplate for timing one operation.

With kappaprofiler it looks like this:
```
import kappaprofiler as kp
import torch

def main():
device = torch.device("cuda")
x = torch.randn(15000, 15000, device=device)
with kp.named_profile("matmul_wrong"):
# matrix multiplication (@) is asynchronous
_ = x @ x
# the timing for "matmul_wrong" is only the time it took to
# submit the x @ x operation to the cuda event stream
# not the actual time the x @ x operation took

with kp.named_profile_async("matmul_right"):
_ = x @ x
matmul_method(x)

@kp.profile_async
def matmul_method(x):
_ = x @ x

def start_async():
start_event = torch.cuda.Event(enable_timing=True)
start_event.record()
return start_event

def end_async(start_event):
end_event = torch.cuda.Event(enable_timing=True)
end_event.record()
torch.cuda.synchronize()
# torch.cuda.Event.elapsed_time returns milliseconds but kappaprofiler expects seconds
return start_event.elapsed_time(end_event) / 1000

if __name__ == "__main__":
kp.setup_async(start_async, end_async)
main()
print(kp.profiler.to_string())
```
```
0.56 matmul_wrong
4.69 matmul_right
4.72 matmul_method
```

NOTE: Synchronization points slow down overall program execution, so they should only be used for investigating
bottlenecks/runtimes

To remove all synchronization points in your program either:
- remove the `kp.setup_async` call -> `kp.named_profile_async`/`kp.profile_async` will default to a noop (NOTE: this
removes the node completely, so it's also not possible to query it)
- replace the `kp.setup_async` call with `kp.setup_async_as_sync` to make the asynchronous calls behave just like the
synchronous calls. This will make the async times wrong (like `matmul_wrong` above) but still creates a node for the
operation (e.g. for querying how often it was called).

### Multi-process pytorch profiling
Only synchronizing cuda operations is not sufficient when multiple processes are used (e.g. for multi-gpu training).
In addition to cuda synchronization, the processes have to be synced up.
```
import torch.distributed as dist
def end_async(start_event):
if dist.is_available() and dist.is_initialized():
torch.cuda.synchronize()
dist.barrier()
end_event = torch.cuda.Event(enable_timing=True)
end_event.record()
torch.cuda.synchronize()
# torch.cuda.Event.elapsed_time returns milliseconds but kappaprofiler expects seconds
return start_event.elapsed_time(end_event) / 1000
```