An open API service indexing awesome lists of open source software.

https://github.com/intelav/gpu-agent-opt

AI Agent Framework for GPU Kernel Autotuning & Optimization. Automate CUDA kernel exploration, profiling, and tuning with AI-driven agents for deep learning, geospatial AI, and HPC workloads.
https://github.com/intelav/gpu-agent-opt

ai-agents autotuning cuda deep-l edge-ai geospatial gpu hpc nvidia optimization performance pytorch

Last synced: 1 day ago
JSON representation

AI Agent Framework for GPU Kernel Autotuning & Optimization. Automate CUDA kernel exploration, profiling, and tuning with AI-driven agents for deep learning, geospatial AI, and HPC workloads.

Awesome Lists containing this project

README

          

# ๐Ÿง  **gpu-agent-opt**

**Unified AI Agent Framework for GPU Kernel Profiling, Scientific Computing, and CUDA Exploration**

`gpu-agent-opt` is a Python package designed to orchestrate **agentic workflows** for **Triton, CUDA, CuPy, cuDF**, and advanced GPU programming patterns โ€” combining **kernel discovery**, **profiling**, and **analysis** with a knowledge-driven loop:

๐Ÿ‘‰ **Sense โ†’ Think โ†’ Act โ†’ Learn โ†’ Reflect**

The current focus is to build a **one-stop GPU research & profiling layer** that integrates:
- Deep learning graph compilers (PyTorch Inductor / XLA)
- Scientific computing (CuPy / cuDF)
- Low-level CUDA primitives (e.g., coalesced memory, warp shuffle, tensor cores)

---

## โœจ **Core Capabilities**

### ๐Ÿง  Agentic Kernel Profiler
- Discovers active GPU kernels during script execution using **Nsight Systems**.
- Selects top kernels for detailed **Nsight Compute** profiling.
- Generates structured summary reports (JSON) with SM and DRAM efficiency metrics.

### ๐Ÿงช Multi-Backend Context
- โœ… **Triton kernels** (via PyTorch Inductor or custom)
- โœ… **Raw CUDA kernels** (NVRTC / PyCUDA / C++ extensions)
- โœ… **CuPy & cuDF** scientific kernels
- ๐Ÿšง **Planned:** CUDA Graphs, Cooperative Groups, Tensor Cores, async copies, MIG.

### ๐Ÿ”ฌ Profiler Integration
- Nsight Systems โ†’ Kernel discovery
- Nsight Compute โ†’ Per-kernel profiling (SM & DRAM metrics)
- Exports both per-kernel CSV and aggregated `summary.json`.

### ๐Ÿ“š Knowledge Base / Reflection
- `reflect_history.json` stores efficiency trends across runs.
- Helps identify consistently low-performing kernels over time.

---

## ๐Ÿ›ฐ **Target Use Cases**
- Geospatial AI auto-annotation pipelines (DINOv2, SAM2, YOLO, NDWI/LBP preprocessing)
- Deep learning inference/training profiling through PyTorch + Nsight
- Scientific/HPC workloads (FFT, FDTD3D, conjugate gradient, Monte Carlo, etc.)
- CUDA educational benchmarking (transpose, reduction, memory hierarchy, etc.)
- Embedded GPU pipelines (Jetson Orin / RB5)

---

## ๐Ÿ“Š **Agentic Profiling Snapshot**

The framework executes a **five-stage loop** to profile real GPU workloads:

| Stage | Description |
|---------|----------------------------------|
| Sense | Discover kernels |
| Think | Select top kernels |
| Act | Profile with Nsight Compute |
| Learn | Analyze & classify bottlenecks |
| Reflect | Track efficiency trends |

### ๐Ÿ“ธ Example output from profiling a geospatial annotation pipeline

Below is a snapshot from a real profiling run on DINOv2 + SAM2:

![Profiling Snapshot](assets/snapshot2.png)

The results are stored in:

- `runs/profile_logs/.../summary.json` โ†’ per-run aggregated metrics
- `reflect_history.json` โ†’ longitudinal trend tracking

These form the basis for future **agentic actions**, such as:
- Replacing inefficient PyTorch kernels with custom CUDA/Triton implementations
- Adjusting launch configurations or fusing operators
- Triggering code-generation agents

---

## ๐Ÿ”ฅ **CUDA Samples Integration**

The agent provides a Pythonic layer over classic CUDA patterns (via official samples):

- **Memory & Data Movement**
`bandwidthTest`, `transpose`, `globalToShmemAsyncCopy`, `UnifiedMemoryStreams`

- **Computation Kernels**
`reduction`, `scan`, GEMM tensor core examples

- **Advanced Features**
CUDA Graphs, Cooperative Groups, Async API

- **Linear Algebra & Solvers**
cuBLAS, cuSolver

- **Signal & Image Processing**
CUFFT, DCT, NPP routines

- **Miscellaneous / Educational**
`deviceQuery`, `inlinePTX`, `cudaOpenMP`, NVRTC runtime compilation

---

## ๐Ÿงช **Scientific + DL Interoperability**

- CuPy / cuDF kernels can be profiled alongside Triton / CUDA kernels.
- PyTorch Inductor graphs can be analyzed to identify subgraphs for replacement.
- Goal: Combine **high-level DL graphs** with **low-level profiling data**.

---

## ๐Ÿ“ฆ **Installation**

**TestPyPI**:
๐Ÿ‘‰ [https://test.pypi.org/project/gpu-agent-opt/](https://test.pypi.org/project/gpu-agent-opt/)

```bash
pip install gpu-agent-opt