Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-GPU
Awesome resources for GPUs
https://github.com/Jokeren/Awesome-GPU
- Reducing Energy in GPGPUs through Approximate Trivial Bypassing
- Locality-Aware CTA Clustering for Modern GPUs
- Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
- Dynamic GPGPU Power Management Using Adaptive Model Predictive Control
- Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems
- Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls
- Controlled Kernel Launch for Dynamic Parallelism in GPUs
- COOPERATIVE GROUPS
- LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
- Virtual Thread Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
- Understanding Latency Hiding on GPUs
- APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
- Adaptive and Transparent Cache Bypassing for GPUs
- Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU
- In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing
- Umpire: Application-Focused Management and Coordination of Complex Hierarchical Memory
- Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization
- NVIDIA H100 Tensor Core GPU Architecture
- NVIDIA A100 Tensor Core GPU Architecture
- NVIDIA TURING GPU ARCHITECTURE
- NVIDIA TESLA V100
- NVIDIA TESLA P100
- NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
- NVIDIA’s Next Generation CUDA Compute Architecture: Fermi
- INTRODUCING AMD CDNA 2 ARCHITECTURE
- INTRODUCING AMD CDNA ARCHITECTURE
- DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100
- Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
- A Coordinated Tiling and Batching Framework for Efficient GEMM on GPU
- CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES
- AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
- On Optimizing Complex Stencils on GPUs
- Register Optimizations for Stencils on GPUs
- Single-pass Parallel Prefix Scan with Decoupled Look-back
- Understanding and bridging the gaps in current GNN performance optimizations
- E.T.: re-thinking self-attention for transformer models on GPUs
- GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs
- Sparse GPU Kernels for Deep Learning
- SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
- Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures
- Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
- Demystifying GPU Microarchitecture through Microbenchmarking
- Instruction Roofline An insightful visual performance model for GPUs
- Performance Tuning of Scientific Codes with the Roofline Model
- VOLTA Architecture and performance optimization
- Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
- Fundamental_Optimizations
- Visualizing Complex Dynamics in Many-Core Accelerator Architectures
- Analyzing CUDA Workloads Using a Detailed GPU Simulator
- GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis
- CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
- Exposing Hidden Performance Opportunities in High Performance GPU Applications
- Monitoring Heterogeneous Applications with the OpenMP Tools Interface
- Identifying Optimization Opportunities Within Kernel Execution in GPU Codes
- Effective sampling-driven performance tools for GPU-accelerated supercomputers
- Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures
- Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs
- **Vampir|Score-P**
- **TAU**
- **PAPI**
- **Allinea MAP**
- **Open|SpeedShop**
- **HPCToolkit**
- **NVIDIA Nsight Systems**
- **NVIDIA Nsight Compute**
- **SASSI**
- **NVBit**
- CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems
- cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs
- Generating GPU Compiler Heuristics using Reinforcement Learning
- Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation
- Implementing implicit OpenMP data sharing on GPUs
- gpucc: An Open-Source GPGPU Compiler
- Offloading Support for OpenMP in Clang and LLVM
- Performance Analysis of OpenMP on a GPU using a CORAL Proxy Application
- Integrating GPU Support for OpenMP Offloading Directives into Clang
- Coordinating GPU Threads for OpenMP 4.0 in LLVM
- C-for-metal: high performance SIMD programming on intel GPUs
- Novel Methodologies for Predictable CPU-To-GPU Command Offloading
- Paraprox: Pattern-Based Approximation for Data Parallel Applications
- Cooperative Profile Guided Optimizations
- Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)
- Decoding CUDA binary
- Flexible software profiling of GPU architectures