https://github.com/intelav/gpu-agent-opt
AI Agent Framework for GPU Kernel Autotuning & Optimization. Automate CUDA kernel exploration, profiling, and tuning with AI-driven agents for deep learning, geospatial AI, and HPC workloads.
https://github.com/intelav/gpu-agent-opt
ai-agents autotuning cuda deep-l edge-ai geospatial gpu hpc nvidia optimization performance pytorch
Last synced: 1 day ago
JSON representation
AI Agent Framework for GPU Kernel Autotuning & Optimization. Automate CUDA kernel exploration, profiling, and tuning with AI-driven agents for deep learning, geospatial AI, and HPC workloads.
- Host: GitHub
- URL: https://github.com/intelav/gpu-agent-opt
- Owner: intelav
- License: other
- Created: 2025-10-02T12:02:58.000Z (8 days ago)
- Default Branch: main
- Last Pushed: 2025-10-02T12:08:38.000Z (8 days ago)
- Last Synced: 2025-10-02T14:18:20.137Z (8 days ago)
- Topics: ai-agents, autotuning, cuda, deep-l, edge-ai, geospatial, gpu, hpc, nvidia, optimization, performance, pytorch
- Language: Python
- Homepage: https://aifusion.in
- Size: 9.77 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ง **gpu-agent-opt**
**Unified AI Agent Framework for GPU Kernel Profiling, Scientific Computing, and CUDA Exploration**
`gpu-agent-opt` is a Python package designed to orchestrate **agentic workflows** for **Triton, CUDA, CuPy, cuDF**, and advanced GPU programming patterns โ combining **kernel discovery**, **profiling**, and **analysis** with a knowledge-driven loop:
๐ **Sense โ Think โ Act โ Learn โ Reflect**
The current focus is to build a **one-stop GPU research & profiling layer** that integrates:
- Deep learning graph compilers (PyTorch Inductor / XLA)
- Scientific computing (CuPy / cuDF)
- Low-level CUDA primitives (e.g., coalesced memory, warp shuffle, tensor cores)---
## โจ **Core Capabilities**
### ๐ง Agentic Kernel Profiler
- Discovers active GPU kernels during script execution using **Nsight Systems**.
- Selects top kernels for detailed **Nsight Compute** profiling.
- Generates structured summary reports (JSON) with SM and DRAM efficiency metrics.### ๐งช Multi-Backend Context
- โ **Triton kernels** (via PyTorch Inductor or custom)
- โ **Raw CUDA kernels** (NVRTC / PyCUDA / C++ extensions)
- โ **CuPy & cuDF** scientific kernels
- ๐ง **Planned:** CUDA Graphs, Cooperative Groups, Tensor Cores, async copies, MIG.### ๐ฌ Profiler Integration
- Nsight Systems โ Kernel discovery
- Nsight Compute โ Per-kernel profiling (SM & DRAM metrics)
- Exports both per-kernel CSV and aggregated `summary.json`.### ๐ Knowledge Base / Reflection
- `reflect_history.json` stores efficiency trends across runs.
- Helps identify consistently low-performing kernels over time.---
## ๐ฐ **Target Use Cases**
- Geospatial AI auto-annotation pipelines (DINOv2, SAM2, YOLO, NDWI/LBP preprocessing)
- Deep learning inference/training profiling through PyTorch + Nsight
- Scientific/HPC workloads (FFT, FDTD3D, conjugate gradient, Monte Carlo, etc.)
- CUDA educational benchmarking (transpose, reduction, memory hierarchy, etc.)
- Embedded GPU pipelines (Jetson Orin / RB5)---
## ๐ **Agentic Profiling Snapshot**
The framework executes a **five-stage loop** to profile real GPU workloads:
| Stage | Description |
|---------|----------------------------------|
| Sense | Discover kernels |
| Think | Select top kernels |
| Act | Profile with Nsight Compute |
| Learn | Analyze & classify bottlenecks |
| Reflect | Track efficiency trends |### ๐ธ Example output from profiling a geospatial annotation pipeline
Below is a snapshot from a real profiling run on DINOv2 + SAM2:

The results are stored in:
- `runs/profile_logs/.../summary.json` โ per-run aggregated metrics
- `reflect_history.json` โ longitudinal trend trackingThese form the basis for future **agentic actions**, such as:
- Replacing inefficient PyTorch kernels with custom CUDA/Triton implementations
- Adjusting launch configurations or fusing operators
- Triggering code-generation agents---
## ๐ฅ **CUDA Samples Integration**
The agent provides a Pythonic layer over classic CUDA patterns (via official samples):
- **Memory & Data Movement**
`bandwidthTest`, `transpose`, `globalToShmemAsyncCopy`, `UnifiedMemoryStreams`- **Computation Kernels**
`reduction`, `scan`, GEMM tensor core examples- **Advanced Features**
CUDA Graphs, Cooperative Groups, Async API- **Linear Algebra & Solvers**
cuBLAS, cuSolver- **Signal & Image Processing**
CUFFT, DCT, NPP routines- **Miscellaneous / Educational**
`deviceQuery`, `inlinePTX`, `cudaOpenMP`, NVRTC runtime compilation---
## ๐งช **Scientific + DL Interoperability**
- CuPy / cuDF kernels can be profiled alongside Triton / CUDA kernels.
- PyTorch Inductor graphs can be analyzed to identify subgraphs for replacement.
- Goal: Combine **high-level DL graphs** with **low-level profiling data**.---
## ๐ฆ **Installation**
**TestPyPI**:
๐ [https://test.pypi.org/project/gpu-agent-opt/](https://test.pypi.org/project/gpu-agent-opt/)```bash
pip install gpu-agent-opt