Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome-GPU

Awesome resources for GPUs
https://github.com/Jokeren/Awesome-GPU

Reducing Energy in GPGPUs through Approximate Trivial Bypassing
Locality-Aware CTA Clustering for Modern GPUs
Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
Dynamic GPGPU Power Management Using Adaptive Model Predictive Control
Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems
Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls
Controlled Kernel Launch for Dynamic Parallelism in GPUs
COOPERATIVE GROUPS
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
Virtual Thread Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
Understanding Latency Hiding on GPUs
APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
Adaptive and Transparent Cache Bypassing for GPUs
Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU
In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing
Umpire: Application-Focused Management and Coordination of Complex Hierarchical Memory
Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization
NVIDIA H100 Tensor Core GPU Architecture
NVIDIA A100 Tensor Core GPU Architecture
NVIDIA TURING GPU ARCHITECTURE
NVIDIA TESLA V100
NVIDIA TESLA P100
NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
NVIDIA’s Next Generation CUDA Compute Architecture: Fermi
INTRODUCING AMD CDNA 2 ARCHITECTURE
INTRODUCING AMD CDNA ARCHITECTURE
DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100
Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
A Coordinated Tiling and Batching Framework for Efficient GEMM on GPU
CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES
AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
On Optimizing Complex Stencils on GPUs
Register Optimizations for Stencils on GPUs
Single-pass Parallel Prefix Scan with Decoupled Look-back
Understanding and bridging the gaps in current GNN performance optimizations
E.T.: re-thinking self-attention for transformer models on GPUs
GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs
Sparse GPU Kernels for Deep Learning
SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
Demystifying GPU Microarchitecture through Microbenchmarking
Instruction Roofline An insightful visual performance model for GPUs
Performance Tuning of Scientific Codes with the Roofline Model
VOLTA Architecture and performance optimization
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
Fundamental_Optimizations
Visualizing Complex Dynamics in Many-Core Accelerator Architectures
Analyzing CUDA Workloads Using a Detailed GPU Simulator
GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis
CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
Exposing Hidden Performance Opportunities in High Performance GPU Applications
Monitoring Heterogeneous Applications with the OpenMP Tools Interface
Identifying Optimization Opportunities Within Kernel Execution in GPU Codes
Effective sampling-driven performance tools for GPU-accelerated supercomputers
Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs
**Vampir|Score-P**
**TAU**
**PAPI**
**Allinea MAP**
**Open|SpeedShop**
**HPCToolkit**
**NVIDIA Nsight Systems**
**NVIDIA Nsight Compute**
**SASSI**
**NVBit**
CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems
cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs
Generating GPU Compiler Heuristics using Reinforcement Learning
Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation
Implementing implicit OpenMP data sharing on GPUs
gpucc: An Open-Source GPGPU Compiler
Offloading Support for OpenMP in Clang and LLVM
Performance Analysis of OpenMP on a GPU using a CORAL Proxy Application
Integrating GPU Support for OpenMP Ofﬂoading Directives into Clang
Coordinating GPU Threads for OpenMP 4.0 in LLVM
C-for-metal: high performance SIMD programming on intel GPUs
Novel Methodologies for Predictable CPU-To-GPU Command Offloading
Paraprox: Pattern-Based Approximation for Data Parallel Applications
Cooperative Profile Guided Optimizations
Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)
Decoding CUDA binary
Flexible software profiling of GPU architectures

About — Blog — API — Status
GitHub — Open Collective — Twitter — Mastodon
Code: AGPL-3 — Data: CC BY-SA 4.0