awesome-tensor-compilers
A list of awesome compiler projects and papers for tensor computation and deep learning.
https://github.com/merrymercy/awesome-tensor-compilers
Last synced: 2 days ago
JSON representation
-
Open Source Projects
- XLA: Optimizing Compiler for Machine Learning
- TVM: An End to End Machine Learning Compiler Framework
- Halide: A Language for Fast, Portable Computation on Images and Tensors
- Speedster: Automatically apply SOTA optimization techniques to achieve the maximum inference speed-up on your hardware
- MLIR: Multi-Level Intermediate Representation
- Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
- TACO: The Tensor Algebra Compiler
- NN-512: A Compiler That Generates C99 Code for Neural Net Inference
- Glow: Compiler for Neural Network Hardware Accelerators
- PlaidML: A Platform for Making Deep Learning Work Everywhere
- BladeDISC: An End-to-End DynamIc Shape Compiler for Machine Learning Workloads
- AITemplate: A Python framework which renders neural network into high performance CUDA/HIP C++ code
- nnfusion: A Flexible and Efficient Deep Neural Network Compiler
- Hummingbird: Compiling Trained ML Models into Tensor Computation
- Hidet: A Compilation-based Deep Learning Framework
- TensorComprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
- Nebulgym: Easy-to-use Library to Accelerate AI Training
- DaCeML: A Data-Centric Compiler for Machine Learning
- Mirage: A Multi-level Superoptimizer for Tensor Algebra
- Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
- Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
- TACO: The Tensor Algebra Compiler
- Nebulgym: Easy-to-use Library to Accelerate AI Training
-
Papers
-
Survey
-
Distributed Computing
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- SpDISTAL: Compiling Distributed Sparse Tensor Computations
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization
- Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning
- DISTAL: The Distributed Tensor Algebra Compiler
- GSPMD: General and Scalable Parallelization for ML Computation Graphs
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch
- Beyond Data and Model Parallelism for Deep Neural Networks
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning
- Distributed Halide
-
Compiler and IR Design
- Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs
- TensorIR: An Abstraction for Automatic Tensorized Program Optimization
- DaCeML: A Data-Centric Compiler for Machine Learning
- Roller: Fast and Efficient Tensor Compilation for Deep Learning
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
- MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
- A Tensor Compiler for Unified Machine Learning Prediction Serving
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
- Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures - Nun et al., SC 2019
- Tiramisu: A polyhedral compiler for expressing fast and portable code
- Relay: A High-Level Compiler for Deep Learning
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
- Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning
- Glow: Graph Lowering Compiler Techniques for Neural Networks
- DLVM: A modern compiler infrastructure for deep learning systems
- BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach
- AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures
- TASO: The Tensor Algebra SuperOptimizer for Deep Learning
- (De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphisms
- Exocompilation for Productive Programming of Hardware Accelerators
- FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs
- Triton: an intermediate language and compiler for tiled neural network computations
- Diesel: DSL for linear algebra and neural net computations on GPUs
- Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines - Kelley et al., PLDI 2013
-
Auto-tuning and Auto-scheduling
- Tensor Program Optimization with Probabilistic Programs
- Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance
- Value Learning for Throughput Optimization of Deep Neural Networks
- A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers
- Ansor: Generating High-Performance Tensor Programs for Deep Learning
- ProTuner: Tuning Programs with Monte Carlo Tree Search - Ali et al., arXiv 2020
- AdaTune: Adaptive tensor program compilation made efficient
- Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
- Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation
- A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra
- Learning to Optimize Halide with Tree Search and Random Programs
- Learning to Optimize Tensor Programs
- Automatically Scheduling Halide Image Processing Pipelines
- Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
- Autoscheduling for sparse tensor algebra with an asymptotic cost model
- FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
- The Droplet Search Algorithm for Kernel Scheduling
- One-shot tuner for deep learning compilers
- Accelerated Auto-Tuning of GPU Kernels for Tensor Computations
- A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators
- Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU
- Lorien: Efficient Deep Learning Workloads Delivery
- Schedule Synthesis for Halide Pipelines on GPUs
- AdaTune: Adaptive tensor program compilation made efficient
- Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
- Automatically Scheduling Halide Image Processing Pipelines
-
Cost Model
- TLP: A Deep Learning-based Cost Model for Tensor Program Tuning
- An Asymptotic Cost Model for Autoscheduling Sparse Tensor Programs
- TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers
- A Deep Learning Based Cost Model for Automatic Code Optimization
- A Learned Performance Model for the Tensor Processing Unit
- DYNATUNE: Dynamic Tensor Program Optimization in Deep Neural Network Compilation
- MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
- Expedited Tensor Program Compilation Based on LightGBM
-
CPU and GPU Optimization
- DeepCuts: A deep learning optimization framework for versatile GPU workloads
- UNIT: Unifying Tensorized Instruction Compilation
- PolyDL: Polyhedral Optimizations for Creation of HighPerformance DL primitives
- Automatic Kernel Generation for Volta Tensor Cores
- Optimizing CNN Model Inference on CPUs
- Swizzle Inventor: Data Movement Synthesis for GPU Kernels
- Analytical characterization and design space exploration for optimization of CNNs
- Fireiron: A Data-Movement-Aware Scheduling Language for GPUs
- Analytical cache modeling and tilesize optimization for tensor contractions
-
NPU Optimization
- AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction
- Towards the Co-design of Neural Networks and Accelerators
- AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations
- Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators
-
Graph-level Optimization
- POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging
- Collage: Seamless Integration of Deep Learning Backends with Automatic Placement
- Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization
- IOS: An Inter-Operator Scheduler for CNN Acceleration
- Transferable Graph Optimizers for ML Compilers
- FusionStitching: Boosting Memory IntensiveComputations for Deep Learning Workloads
- Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
- Optimizing DNN Computation Graph using Graph Substitutions
-
Program Rewriting
-
Dynamic Model
- Axon: A Language for Dynamic Shapes in Deep Learning Graphs
- DietCode: Automatic Optimization for Dynamic Tensor Programs
- The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding
- Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference
- DISC: A Dynamic Shape Compiler for Machine Learning Workloads
- Cortex: A Compiler for Recursive Deep Learning Models
-
Graph Neural Networks
-
Quantization
-
Sparse
- The Sparse Abstract Machine
- SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
- Looplets: A Language For Structured Coiteration
- Code Synthesis for Sparse Tensor Format Conversion and Optimization
- Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture
- SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute
- Compiler Support for Sparse Tensor Computations in MLIR
- A High Performance Sparse Tensor Algebra Compiler in MLIR - HPC 2021
- Dynamic Sparse Tensor Algebra Compilation
- TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning
- The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code
- ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism
- A Framework for Sparse Matrix Code Synthesis from High-level Specifications
- Automatic Nonzero Structure Analysis
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- Automatic Data Structure Selection and Transformation for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- The Tensor Algebra Compiler
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SparseLNR: Accelerating Sparse Tensor Computations Using Loop Nest Restructuring
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- Next-generation Generic Programming and its Application to Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program
- Unified Compilation for Lossless Compression and Sparse Computing
- Compilation of Sparse Array Programming Models
- Automatic Generation of Efficient Sparse Tensor Format Conversion Routines
- Tensor Algebra Compilation with Workspaces
- Sparse Computation Data Dependence Simplification for Efficient Compiler-Generated Inspectors
- Format Abstraction for Sparse Tensor Algebra Compilers
- Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
- Compilation Techniques for Sparse Matrix Computations
-
Verification and Testing
- End-to-End Translation Validation for the Halide Language
- NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers
- Coverage-guided tensor compiler fuzzing with joint IR-pass mutation
- A comprehensive study of deep learning compiler bugs
- Verifying and Improving Halide’s Term Rewriting System with Program Synthesis
-
-
Tutorials
-
Verification and Testing
-
Categories
Sub Categories
Keywords