https://github.com/arthurfeeney/awesome-hpc

Another "awesome-*" list, for random things I find interesting in HPC / ML
https://github.com/arthurfeeney/awesome-hpc

Last synced: 3 months ago
JSON representation

Another "awesome-*" list, for random things I find interesting in HPC / ML

Host: GitHub
URL: https://github.com/arthurfeeney/awesome-hpc
Owner: arthurfeeney
Created: 2025-03-17T20:14:50.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-03-26T20:10:10.000Z (3 months ago)
Last Synced: 2025-03-26T21:23:13.255Z (3 months ago)
Homepage:
Size: 2.93 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

ultimate-awesome - awesome-hpc - Another "awesome-*" list, for random things I find interesting in HPC / ML. (Other Lists / Julia Lists)

README

        # Awesome HPC

A collection of interesting HPC Resources. Not really including papers--there's 

lots of good info in blog posts and inside documentation, but it's impossible to remember

all of the different links. I think it's necessary to maintain a list like this.

This is a mishmash of CPU / GPU / fun hardware / Distributed.   

## Aggregators

- [GPUMode](https://github.com/gpu-mode), community of GPU people (mostly ML.)

- [HGPU](https://hgpu.org), like arxiv for GPU-related papers.

## Conference Proceedings

- [ASPLOS 2025](https://dl.acm.org/doi/proceedings/10.1145/3669940)

- [ASPLOS 2024](https://dl.acm.org/doi/proceedings/10.1145/3620665)

- [ICPP 2024](https://icpp2024.org/index.php?option=com_content&view=article&id=6&Itemid=114)

- [MLSys 2024](https://proceedings.mlsys.org/paper_files/paper/2024)

- [MLSys 2023](https://mlsys.org/virtual/2023/papers.html?filter=titles)

- [PPoPP 2025](https://ppopp25.sigplan.org/program/program-PPoPP-2025/)

- [SC 2024](https://dl.acm.org/doi/proceedings/10.5555/3703596)

## Blogs

- [Colfax Research](https://research.colfax-intl.com)

- [Lei Mao Blog](https://leimao.github.io)

## Lectures

- [PMPP Recordings](https://www.youtube.com/@pmpp-book)

## Matrix Multiplication and Linear Algebra

References and tools regarding efficient implementation of GEMM and other BLAS-style kernels

on CPUs and GPUs

- [CULASS Efficient GEMM](https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md)

- [BLIS](https://github.com/flame/blis)

- [Reverse Engineering cuBLAS](https://fabianschuetze.github.io/category/articles.html)

- [DeepSeek DeepGEMM](https://github.com/deepseek-ai/DeepGEMM)

## ML Performance

### Frameworks

- [PyTorch 2](https://pytorch.org/assets/pytorch2-2.pdf)

- [ASPLOS Pytorch 2](https://github.com/pytorch/workshops/blob/master/ASPLOS_2024/README.md)

- [CUTLASS](https://github.com/NVIDIA/cutlass/tree/main)

- [ThunderKittens](https://github.com/HazyResearch/ThunderKittens/tree/e5cb89f29e1abb9498ebf8bc878015f9699ee846)

- [triton](https://github.com/triton-lang/triton)

### Efficient Implementations

#### Flash Attention

The premise of FlashAttention is fairly simple. This includes the main

papers and some references for the "backbone" ideas. I.e., IO complexity,

softmax normalization, etc.

- [Data movement is all you need](https://arxiv.org/abs/2007.00072)

- [The Hardware Lottery](https://arxiv.org/abs/2009.06489)

- [IO Complexity of sorting and related problems](https://dl.acm.org/doi/10.1145/48529.48535)

- [Online Softmax Normalizer](https://arxiv.org/abs/1805.02867)

- [Self-attention does not need $O(n^2)$ memory](https://arxiv.org/abs/2112.05682)

- [FlashAttention 1](https://arxiv.org/abs/2205.14135)

- [FlashAttention 2](https://arxiv.org/abs/2307.08691)

- [FlashAttention 3, for H100. Uses Asynchrony and low precision](https://arxiv.org/abs/2407.08608)

## NVIDIA

### Performance

- [GTC 2010 Lower Occupancy performance](https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf) Very old GTC talk about getting higher performance with lower occupancy. Compute bound applications (like GEMM) do not need high occupancy to hit peak performance.

- [NSight Compute GPUMode](https://www.youtube.com/watch?v=F_BazucyCMw)

### Architecture

### Cuda

### PTX

- [PTX Docs](https://docs.nvidia.com/cuda/parallel-thread-execution/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arthurfeeney/awesome-hpc

Awesome Lists containing this project

README