{"id":26650692,"url":"https://github.com/coderonion/awesome-cuda-and-hpc","last_synced_at":"2025-03-25T02:02:29.410Z","repository":{"id":129484054,"uuid":"605602834","full_name":"coderonion/awesome-cuda-and-hpc","owner":"coderonion","description":"🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.","archived":false,"fork":false,"pushed_at":"2025-03-22T04:02:05.000Z","size":53,"stargazers_count":223,"open_issues_count":0,"forks_count":27,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-24T11:03:12.028Z","etag":null,"topics":["awesome","blas","cublas","cuda","cudnn","cutlass","deepseek","gemm","gpu","hpc","llama","llm","mlir","openblas","ptx","pytorch","tensorrt","tensorrt-llm","triton","tvm"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/coderonion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-23T14:07:23.000Z","updated_at":"2025-03-23T16:41:36.000Z","dependencies_parsed_at":"2023-11-14T15:30:17.926Z","dependency_job_id":"c120ec1f-ec21-4008-afcc-c4d73c8e91a6","html_url":"https://github.com/coderonion/awesome-cuda-and-hpc","commit_stats":{"total_commits":11,"total_committers":1,"mean_commits":11.0,"dds":0.0,"last_synced_commit":"0f6566a45a4e4b5012096c6b3a29884a48093c26"},"previous_names":["codingonion/awesome-fpga-list","codingonion/awesome-hpc-cuda-fpga","codingonion/awesome-hpc-and-cuda","codingonion/awesome-cuda-tensorrt-fpga","coderonion/awesome-cuda-and-hpc","coderonion/awesome-cuda-triton-hpc","coderonion/awesome-cuda-triton-mlir-hpc"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coderonion%2Fawesome-cuda-and-hpc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coderonion%2Fawesome-cuda-and-hpc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coderonion%2Fawesome-cuda-and-hpc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coderonion%2Fawesome-cuda-and-hpc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/coderonion","download_url":"https://codeload.github.com/coderonion/awesome-cuda-and-hpc/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245383037,"owners_count":20606265,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["awesome","blas","cublas","cuda","cudnn","cutlass","deepseek","gemm","gpu","hpc","llama","llm","mlir","openblas","ptx","pytorch","tensorrt","tensorrt-llm","triton","tvm"],"created_at":"2025-03-25T02:02:28.364Z","updated_at":"2025-03-25T02:02:29.390Z","avatar_url":"https://github.com/coderonion.png","language":null,"funding_links":[],"categories":["Other Lists","⭐ Acknowledgements"],"sub_categories":["TeX Lists","Learning Tools"],"readme":"# Awesome-CUDA-and-HPC\r\n[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)\r\n\r\n🚀🚀🚀 This repository lists some awesome public [CUDA](https://developer.nvidia.com/cuda-zone), [cuda-python](https://github.com/NVIDIA/cuda-python), [cuBLAS](https://developer.nvidia.com/cublas), [cuDNN](https://developer.nvidia.com/cudnn), [CUTLASS](https://github.com/NVIDIA/cutlass), [TensorRT](https://developer.nvidia.com/tensorrt), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [Triton](https://github.com/triton-lang/triton), [TVM](https://tvm.apache.org/), [MLIR](https://mlir.llvm.org/), [PTX](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html) and High Performance Computing (HPC) projects.\r\n\r\n## Contents\r\n- [Awesome-CUDA-and-HPC](#awesome-cuda-and-hpc)\r\n  - [Official Version](#official-version)\r\n  - [Awesome List](#awesome-list)\r\n  - [Learning Resources](#learning-resources)\r\n    - [CUDA Learning](#cuda-learning)\r\n    - [TensorRT Learning](#tensorrt-learning)\r\n    - [Triton Learning](#triton-learning)\r\n    - [TVM Learning](#tvm-learning)\r\n    - [MLIR Learning](#mlir-learning)\r\n    - [HPC Learning](#hpc-learning)\r\n  - [Frameworks](#frameworks)\r\n    - [CUDA Frameworks](#cuda-frameworks)\r\n        - [GPU Interface](#gpu-interface)\r\n            - [CPP Version](#cpp-version)\r\n            - [Python version](#python-version)\r\n            - [Rust Version](#rust-version)\r\n            - [Julia Version](#julia-version)\r\n        - [Performance Benchmark](#performance-benchmark)\r\n        - [Scientific Computing Framework](#scientific-computing-framework)\r\n        - [Attention and Transformer Framework](#attention-and-transformer-framework)\r\n        - [Machine Learning Framework](#machine-learning-framework)\r\n        - [AI Inference Framework](#ai-inference-framework)\r\n            - [LLM Inference and Serving Engine](#llm-inference-and-serving-engine)\r\n            - [High Performance Kernel Library](#high-performance-kernel-library)\r\n            - [C Implementation](#c-implementation)\r\n            - [CPP Implementation](#cpp-implementation)\r\n            - [Mojo Implementation](#mojo-implementation)\r\n            - [Rust Implementation](#rust-implementation)\r\n            - [zig Implementation](#zig-implementation)\r\n            - [Go Implementation](#go-implementation)\r\n        - [Distributed and Multi-GPU Framework](#distributed-and-multi-gpu-framework)\r\n        - [Robotics Framework](#robotics-framework)\r\n        - [ZKP and Web3 Framework](#zkp-and-web3-framework)\r\n    - [Triton Frameworks](#triton-frameworks)\r\n        - [Triton Machine Learning Framework](#triton-machine-learning-framework)\r\n        - [Triton High Performance Kernel Library](#triton-high-performance-kernel-library)\r\n    - [MLIR Frameworks](#mlir-frameworks)\r\n        - [MLIR GPU Programming](#mlir-gpu-programming)\r\n        - [MLIR FFI Bindings](#mlir-ffi-bindings)\r\n        - [MLIR Machine learning Framework](#mlir-machine-learning-framework)\r\n    - [HPC Frameworks](#hpc-frameworks)\r\n  - [Applications](#applications)\r\n    - [CUDA Applications](#cuda-applications)\r\n        - [Image Preprocess](#image-preprocess)\r\n        - [Object Detection](#object-detection)\r\n  - [Blogs](#blogs)\r\n    - [CUDA and TensorRT Blogs](#cuda-and-tensorrt-blogs)\r\n    - [Triton Blogs](#triton-blogs)\r\n    - [TVM Blogs](#tvm-blogs)\r\n    - [MLIR Blogs](#mlir-blogs)\r\n    - [HPC Blogs](#hpc-blogs)\r\n  - [Videos](#videos)\r\n  - [Interview](#interview)\r\n\r\n## Official Version\r\n\r\n  - [CUDA](https://developer.nvidia.com/cuda-zone) : CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).\r\n\r\n  - [NVIDIA/cuda-python](https://github.com/NVIDIA/cuda-python) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/cuda-python?style=social\"/\u003e : CUDA Python: Performance meets Productivity. [nvidia.github.io/cuda-python/](https://nvidia.github.io/cuda-python/)\r\n\r\n  - [cuBLAS](https://developer.nvidia.com/cublas) : Basic Linear Algebra on NVIDIA GPUs. NVIDIA cuBLAS is a GPU-accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.\r\n\r\n  - [cuDNN](https://developer.nvidia.com/cudnn) : The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.\r\n\r\n  - [CUTLASS](https://github.com/NVIDIA/cutlass) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/cutlass?style=social\"/\u003e : CUDA Templates for Linear Algebra Subroutines. CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement [cuBLAS](https://developer.nvidia.com/cublas) and [cuDNN](https://developer.nvidia.com/cudnn).\r\n\r\n  - [TensorRT](https://github.com/NVIDIA/TensorRT) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/TensorRT?style=social\"/\u003e : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. [developer.nvidia.com/tensorrt](https://developer.nvidia.com/tensorrt)\r\n\r\n  - [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM?style=social\"/\u003e : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. [nvidia.github.io/TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM)\r\n\r\n  - [Triton](https://github.com/triton-lang/triton) \u003cimg src=\"https://img.shields.io/github/stars/triton-lang/triton?style=social\"/\u003e : Triton is a language and compiler for parallel programming. It aims to provide a Python-based programming environment for productively writing custom DNN compute kernels capable of running at maximal throughput on modern GPU hardware. [triton-lang.org/](https://triton-lang.org/)\r\n\r\n  - [TVM](https://github.com/apache/tvm) \u003cimg src=\"https://img.shields.io/github/stars/apache/tvm?style=social\"/\u003e : Open deep learning compiler stack for cpu, gpu and specialized accelerators. [tvm.apache.org/](https://tvm.apache.org/)\r\n\r\n  - [TileLang](https://github.com/tile-ai/tilelang) \u003cimg src=\"https://img.shields.io/github/stars/tile-ai/tilelang?style=social\"/\u003e : Domain-specific language designed to streamline the development of high-performance GPU/CPU kernels.\r\n\r\n  - [MLIR](https://mlir.llvm.org/) : Multi-Level Intermediate Representation Compiler Framework. The MLIR project is a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together.\r\n\r\n  - [PTX](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html) : PTX, a low-level parallel thread execution virtual machine and instruction set architecture (ISA).\r\n\r\n\r\n\r\n## Awesome List\r\n\r\n  - [awesome-cuda-and-hpc](https://github.com/coderonion/awesome-cuda-and-hpc) \u003cimg src=\"https://img.shields.io/github/stars/coderonion/awesome-cuda-and-hpc?style=social\"/\u003e : some awesome public [CUDA](https://developer.nvidia.com/cuda-zone), [cuda-python](https://github.com/NVIDIA/cuda-python), [cuBLAS](https://developer.nvidia.com/cublas), [cuDNN](https://developer.nvidia.com/cudnn), [CUTLASS](https://github.com/NVIDIA/cutlass), [TensorRT](https://developer.nvidia.com/tensorrt), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [Triton](https://github.com/triton-lang/triton), [TVM](https://tvm.apache.org/), [MLIR](https://mlir.llvm.org/), [PTX](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html) and High Performance Computing (HPC) projects.\r\n\r\n  - [Erkaman/Awesome-CUDA](https://github.com/Erkaman/Awesome-CUDA) \u003cimg src=\"https://img.shields.io/github/stars/Erkaman/Awesome-CUDA?style=social\"/\u003e : This is a list of useful libraries and resources for CUDA development.\r\n\r\n  - [jslee02/awesome-gpgpu](https://github.com/jslee02/awesome-gpgpu) \u003cimg src=\"https://img.shields.io/github/stars/jslee02/awesome-gpgpu?style=social\"/\u003e : 😎 A curated list of awesome GPGPU (CUDA/OpenCL/Vulkan) resources.\r\n\r\n  - [mikeroyal/CUDA-Guide](https://github.com/mikeroyal/CUDA-Guide) \u003cimg src=\"https://img.shields.io/github/stars/mikeroyal/CUDA-Guide?style=social\"/\u003e : A guide covering CUDA including the applications and tools that will make you a better and more efficient CUDA developer.\r\n\r\n  - [rkinas/triton-resources](https://github.com/rkinas/triton-resources) \u003cimg src=\"https://img.shields.io/github/stars/rkinas/triton-resources?style=social\"/\u003e : A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.\r\n\r\n\r\n\r\n\r\n## Learning Resources\r\n\r\n\r\n  - [chenzomi12/AISystem](https://github.com/chenzomi12/AISystem) \u003cimg src=\"https://img.shields.io/github/stars/chenzomi12/AISystem?style=social\"/\u003e : AISystem 主要是指AI系统，包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术。\r\n\r\n  - [chenzomi12/AIFoundation](https://github.com/chenzomi12/AIFoundation) \u003cimg src=\"https://img.shields.io/github/stars/chenzomi12/AIFoundation?style=social\"/\u003e : AIFoundation 主要是指AI系统遇到大模型，从底层到上层如何系统级地支持大模型训练和推理，全栈的核心技术。\r\n\r\n\r\n\r\n\r\n  - ### CUDA Learning\r\n\r\n    - [NVIDIA CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/) : CUDA Toolkit Documentation.\r\n\r\n    - [NVIDIA CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) : CUDA C++ Programming Guide.\r\n\r\n    - [NVIDIA CUDA C++ Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) : CUDA C++ Best Practices Guide.\r\n\r\n    - [NVIDIA PTX(Parallel Thread Execution) Programming Guide](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html) : NVIDIA PTX (Parallel Thread Execution) Programming Guide.\r\n\r\n    - [NVIDIA/cuda-samples](https://github.com/NVIDIA/cuda-samples) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/cuda-samples?style=social\"/\u003e : Samples for CUDA Developers which demonstrates features in CUDA Toolkit.\r\n\r\n    - [NVIDIA/CUDALibrarySamples](https://github.com/NVIDIA/CUDALibrarySamples) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/CUDALibrarySamples?style=social\"/\u003e : CUDA Library Samples.\r\n\r\n    - [NVIDIA/cuda-python](https://github.com/NVIDIA/cuda-python) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/cuda-python?style=social\"/\u003e : CUDA Python: Performance meets Productivity. [nvidia.github.io/cuda-python/](https://nvidia.github.io/cuda-python/)\r\n\r\n    - [CuPy](https://github.com/cupy/cupy) \u003cimg src=\"https://img.shields.io/github/stars/cupy/cupy?style=social\"/\u003e : CuPy : NumPy \u0026 SciPy for GPU. [cupy.dev](https://cupy.dev/). [CuPy User Guide](https://docs.cupy.dev/en/stable/user_guide/)\r\n\r\n    - [NVIDIA-developer-blog/code-samples](https://github.com/NVIDIA-developer-blog/code-samples) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA-developer-blog/code-samples?style=social\"/\u003e : Source code examples from the [Parallel Forall Blog](http://developer.nvidia.com/parallel-forall).\r\n\r\n    - [HeKun-NVIDIA/CUDA-Programming-Guide-in-Chinese](https://github.com/HeKun-NVIDIA/CUDA-Programming-Guide-in-Chinese) \u003cimg src=\"https://img.shields.io/github/stars/HeKun-NVIDIA/CUDA-Programming-Guide-in-Chinese?style=social\"/\u003e : This is a Chinese translation of the CUDA programming guide. 本项目为 CUDA C Programming Guide 的中文翻译版。\r\n\r\n    - [brucefan1983/CUDA-Programming](https://github.com/brucefan1983/CUDA-Programming) \u003cimg src=\"https://img.shields.io/github/stars/brucefan1983/CUDA-Programming?style=social\"/\u003e : Sample codes for my CUDA programming book.\r\n\r\n    - [YouQixiaowu/CUDA-Programming-with-Python](https://github.com/YouQixiaowu/CUDA-Programming-with-Python) \u003cimg src=\"https://img.shields.io/github/stars/YouQixiaowu/CUDA-Programming-with-Python?style=social\"/\u003e :  关于书籍CUDA Programming使用了pycuda模块的Python版本的示例代码。\r\n\r\n    - [QINZHAOYU/CudaSteps](https://github.com/QINZHAOYU/CudaSteps) \u003cimg src=\"https://img.shields.io/github/stars/QINZHAOYU/CudaSteps?style=social\"/\u003e : 基于《cuda编程-基础与实践》（樊哲勇 著）的cuda学习之路。\r\n\r\n    - [MAhaitao999/CUDA_Programming](https://github.com/MAhaitao999/CUDA_Programming) \u003cimg src=\"https://img.shields.io/github/stars/MAhaitao999/CUDA_Programming?style=social\"/\u003e : 《CUDA编程基础与实践》一书的代码。\r\n\r\n    - [DefTruth//CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) \u003cimg src=\"https://img.shields.io/github/stars/DefTruth/CUDA-Learn-Notes?style=social\"/\u003e : 📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).\r\n\r\n    - [BBuf/how-to-optim-algorithm-in-cuda](https://github.com/BBuf/how-to-optim-algorithm-in-cuda) \u003cimg src=\"https://img.shields.io/github/stars/BBuf/how-to-optim-algorithm-in-cuda?style=social\"/\u003e : how to optimize some algorithm in cuda.\r\n\r\n    - [RussWong/CUDATutorial](https://github.com/RussWong/CUDATutorial) \u003cimg src=\"https://img.shields.io/github/stars/RussWong/CUDATutorial?style=social\"/\u003e : A CUDA tutorial to make people learn CUDA program from 0.\r\n\r\n    - [PaddleJitLab/CUDATutorial](https://github.com/PaddleJitLab/CUDATutorial) \u003cimg src=\"https://img.shields.io/github/stars/PaddleJitLab/CUDATutorial?style=social\"/\u003e : A self-learning tutorail for CUDA High Performance Programing. 从零开始学习 CUDA 高性能编程。\r\n\r\n    - [bertmaher/simplegemm](https://github.com/bertmaher/simplegemm) \u003cimg src=\"https://img.shields.io/github/stars/bertmaher/simplegemm?style=social\"/\u003e : Pingpong GEMM from scratch. I've been really excited to learn the lowest-level details of GPU matrix multiplication recently, so I was really inspired to read Pranjal Shankhdhar's fantastic blog post [Outperforming cuBLAS on H100](https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog), which implements a fast gemm from first principles in CUDA, and actually outperforms cuBLAS. In a similar vein, I wanted to understand the [pingpong](https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md#hopper-warp-specialization) gemm algorithm in detail. So, I used [https://github.com/pranjalssh/fast.cu](https://github.com/pranjalssh/fast.cu) as a starting point, and wrote this kernel to see if I could match CUTLASS's pingpong implementation myself, using hand-written CUDA.\r\n\r\n    - [pranjalssh/fast.cu](https://github.com/pranjalssh/fast.cu) \u003cimg src=\"https://img.shields.io/github/stars/pranjalssh/fast.cu?style=social\"/\u003e : Fastest GPU kernels, written from scratch. Matrix multiplication of square bf16 matrices, accumulated in fp32. Explanation in [https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog](https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog)\r\n\r\n    - [gpu-mode/lectures](https://github.com/gpu-mode/lectures) \u003cimg src=\"https://img.shields.io/github/stars/gpu-mode/lectures?style=social\"/\u003e : Material for gpu-mode lectures. [www.youtube.com/@GPUMODE](https://www.youtube.com/@GPUMODE)\r\n\r\n    - [gpu-mode/resource-stream](https://github.com/gpu-mode/resource-stream) \u003cimg src=\"https://img.shields.io/github/stars/cuda-mode/resource-stream?style=social\"/\u003e :GPU programming related news and material links. [discord.gg/gpumode](https://discord.gg/gpumode)\r\n\r\n    - [ifromeast/cuda_learning](https://github.com/ifromeast/cuda_learning) \u003cimg src=\"https://img.shields.io/github/stars/ifromeast/cuda_learning?style=social\"/\u003e : learning how CUDA works.\r\n\r\n    - [a-hamdi/cuda](https://github.com/a-hamdi/cuda) \u003cimg src=\"https://img.shields.io/github/stars/a-hamdi/cuda?style=social\"/\u003e : 100 days of building Cuda kernels! This document serves as a log of the progress and knowledge I gained while working on CUDA programming and studying the PMPP (Parallel Programming and Optimization) book. Mentor: [https://github.com/hkproj/](https://github.com/hkproj/). Bro in the 100 days challenge: [https://github.com/1y33/100Days](https://github.com/1y33/100Days).\r\n\r\n    - [SwekeR-463/100kernels](https://github.com/SwekeR-463/100kernels) \u003cimg src=\"https://img.shields.io/github/stars/SwekeR-463/100kernels?style=social\"/\u003e : 100 days of learning \u0026 making kernels in cuda / triton.\r\n\r\n    - [Tongkaio/CUDA_Kernel_Samples](https://github.com/Tongkaio/CUDA_Kernel_Samples) \u003cimg src=\"https://img.shields.io/github/stars/Tongkaio/CUDA_Kernel_Samples?style=social\"/\u003e : CUDA 算子手撕与面试指南。\r\n\r\n    - [leimao/CUDA-GEMM-Optimization](https://github.com/leimao/CUDA-GEMM-Optimization) \u003cimg src=\"https://img.shields.io/github/stars/leimao/CUDA-GEMM-Optimization?style=social\"/\u003e : [CUDA Matrix Multiplication Optimization](https://leimao.github.io/article/CUDA-Matrix-Multiplication-Optimization/). This repository contains the CUDA kernels for general matrix-matrix multiplication (GEMM) and the corresponding performance analysis.\r\n\r\n    - [interestingLSY/CUDA-From-Correctness-To-Performance-Code](https://github.com/interestingLSY/CUDA-From-Correctness-To-Performance-Code) \u003cimg src=\"https://img.shields.io/github/stars/interestingLSY/CUDA-From-Correctness-To-Performance-Code?style=social\"/\u003e : Codes \u0026 examples for \"CUDA - From Correctness to Performance\". The lecture can be found at [https://wiki.lcpu.dev/zh/hpc/from-scratch/cuda](https://wiki.lcpu.dev/zh/hpc/from-scratch/cuda).\r\n\r\n    - [Liu-xiandong/How_to_optimize_in_GPU](https://github.com/Liu-xiandong/How_to_optimize_in_GPU) \u003cimg src=\"https://img.shields.io/github/stars/Liu-xiandong/How_to_optimize_in_GPU?style=social\"/\u003e : This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.\r\n\r\n    - [tpoisonooo/how-to-optimize-gemm](https://github.com/tpoisonooo/how-to-optimize-gemm) \u003cimg src=\"https://img.shields.io/github/stars/tpoisonooo/how-to-optimize-gemm?style=social\"/\u003e : row-major matmul optimization. [zhuanlan.zhihu.com/p/65436463](https://zhuanlan.zhihu.com/p/65436463).\r\n\r\n    - [Bruce-Lee-LY/matrix_multiply](https://github.com/Bruce-Lee-LY/matrix_multiply) \u003cimg src=\"https://img.shields.io/github/stars/Bruce-Lee-LY/matrix_multiply?style=social\"/\u003e : Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.\r\n\r\n    - [Bruce-Lee-LY/cuda_hgemm](https://github.com/Bruce-Lee-LY/cuda_hgemm) \u003cimg src=\"https://img.shields.io/github/stars/Bruce-Lee-LY/cuda_hgemm?style=social\"/\u003e : Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.\r\n\r\n    - [Bruce-Lee-LY/cuda_hgemv](https://github.com/Bruce-Lee-LY/cuda_hgemv) \u003cimg src=\"https://img.shields.io/github/stars/Bruce-Lee-LY/cuda_hgemv?style=social\"/\u003e : Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.\r\n\r\n    - [enp1s0/ozIMMU](https://github.com/enp1s0/ozIMMU) \u003cimg src=\"https://img.shields.io/github/stars/enp1s0/ozIMMU?style=social\"/\u003e : FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme. [arxiv.org/abs/2306.11975](https://arxiv.org/abs/2306.11975)\r\n\r\n    - [Cjkkkk/CUDA_gemm](https://github.com/Cjkkkk/CUDA_gemm) \u003cimg src=\"https://img.shields.io/github/stars/Cjkkkk/CUDA_gemm?style=social\"/\u003e : A simple high performance CUDA GEMM implementation.\r\n\r\n    - [AyakaGEMM/Hands-on-GEMM](https://github.com/AyakaGEMM/Hands-on-GEMM) \u003cimg src=\"https://img.shields.io/github/stars/AyakaGEMM/Hands-on-GEMM?style=social\"/\u003e : A GEMM tutorial.\r\n\r\n    - [zpzim/MSplitGEMM](https://github.com/zpzim/MSplitGEMM) \u003cimg src=\"https://img.shields.io/github/stars/zpzim/MSplitGEMM?style=social\"/\u003e : Large matrix multiplication in CUDA.\r\n\r\n    - [jundaf2/CUDA-INT8-GEMM](https://github.com/jundaf2/CUDA-INT8-GEMM) \u003cimg src=\"https://img.shields.io/github/stars/jundaf2/CUDA-INT8-GEMM?style=social\"/\u003e : CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API.\r\n\r\n    - [chanzhennan/cuda_gemm_benchmark](https://github.com/chanzhennan/cuda_gemm_benchmark) \u003cimg src=\"https://img.shields.io/github/stars/chanzhennan/cuda_gemm_benchmark?style=social\"/\u003e : Base on gtest/benchmark, refer to [https://github.com/Liu-xiandong/How_to_optimize_in_GPU](https://github.com/Liu-xiandong/How_to_optimize_in_GPU).\r\n\r\n    - [YuxueYang1204/CudaDemo](https://github.com/YuxueYang1204/CudaDemo) \u003cimg src=\"https://img.shields.io/github/stars/YuxueYang1204/CudaDemo?style=social\"/\u003e : Implement custom operators in PyTorch with cuda/c++.\r\n\r\n    - [CoffeeBeforeArch/cuda_programming](https://github.com/CoffeeBeforeArch/cuda_programming) \u003cimg src=\"https://img.shields.io/github/stars/CoffeeBeforeArch/cuda_programming?style=social\"/\u003e : Code from the \"CUDA Crash Course\" YouTube series by CoffeeBeforeArch.\r\n\r\n    - [rbaygildin/learn-gpgpu](https://github.com/rbaygildin/learn-gpgpu) \u003cimg src=\"https://img.shields.io/github/stars/rbaygildin/learn-gpgpu?style=social\"/\u003e : Algorithms implemented in CUDA + resources about GPGPU.\r\n\r\n    - [godweiyang/NN-CUDA-Example](https://github.com/godweiyang/NN-CUDA-Example) \u003cimg src=\"https://img.shields.io/github/stars/godweiyang/NN-CUDA-Example?style=social\"/\u003e : Several simple examples for popular neural network toolkits calling custom CUDA operators.\r\n\r\n    - [yhwang-hub/Matrix_Multiplication_Performance_Optimization](https://github.com/yhwang-hub/Matrix_Multiplication_Performance_Optimization) \u003cimg src=\"https://img.shields.io/github/stars/yhwang-hub/Matrix_Multiplication_Performance_Optimization?style=social\"/\u003e : Matrix Multiplication Performance Optimization.\r\n\r\n    - [caiwanxianhust/ClusteringByCUDA](https://github.com/caiwanxianhust/ClusteringByCUDA) \u003cimg src=\"https://img.shields.io/github/stars/caiwanxianhust/ClusteringByCUDA?style=social\"/\u003e : 使用 CUDA C++ 实现的一系列聚类算法。\r\n\r\n    - [ulrichstern/cuda-convnet](https://github.com/ulrichstern/cuda-convnet) \u003cimg src=\"https://img.shields.io/github/stars/ulrichstern/cuda-convnet?style=social\"/\u003e : Alex Krizhevsky's original code from Google Code. \"微信公众号「人工智能大讲堂」《[找到了AlexNet当年的源代码，没用框架，从零手撸CUDA/C++](https://mp.weixin.qq.com/s/plxXG8y5QlxSionyjyPXqw)》\"。\r\n\r\n    - [PacktPublishing/Learn-CUDA-Programming](https://github.com/PacktPublishing/Learn-CUDA-Programming) \u003cimg src=\"https://img.shields.io/github/stars/PacktPublishing/Learn-CUDA-Programming?style=social\"/\u003e : Learn CUDA Programming, published by Packt.\r\n\r\n    - [PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA](https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA) \u003cimg src=\"https://img.shields.io/github/stars/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA?style=social\"/\u003e : Hands-On GPU Programming with Python and CUDA, published by Packt.\r\n\r\n    - [PacktPublishing/Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA](https://github.com/PacktPublishing/Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA) \u003cimg src=\"https://img.shields.io/github/stars/PacktPublishing/Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA?style=social\"/\u003e : Hands-On GPU Accelerated Computer Vision with OpenCV and CUDA, published by Packt.\r\n\r\n    - [BobMcDear/neural-network-cuda](https://github.com/BobMcDear/neural-network-cuda) \u003cimg src=\"https://img.shields.io/github/stars/BobMcDear/neural-network-cuda?style=social\"/\u003e : Neural network from scratch in CUDA/C++.\r\n\r\n    - [zjhellofss/KuiperLLama](https://github.com/zjhellofss/KuiperLLama) \u003cimg src=\"https://img.shields.io/github/stars/zjhellofss/KuiperLLama?style=social\"/\u003e : 《动手自制大模型推理框架》。KuiperLLama 动手自制大模型推理框架，支持LLama2/3和Qwen2.5。校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。\r\n\r\n    - [zjhellofss/KuiperInfer](https://github.com/zjhellofss/KuiperInfer) \u003cimg src=\"https://img.shields.io/github/stars/zjhellofss/KuiperInfer?style=social\"/\u003e :  校招、秋招、春招、实习好项目！带你从零实现一个高性能的深度学习推理库，支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step。\r\n\r\n    - [zjhellofss/kuiperdatawhale](https://github.com/zjhellofss/kuiperdatawhale) \u003cimg src=\"https://img.shields.io/github/stars/zjhellofss/kuiperdatawhale?style=social\"/\u003e :  从零自制深度学习推理框架。\r\n\r\n    - [MarioSieg/magnetron](https://github.com/MarioSieg/magnetron) \u003cimg src=\"https://img.shields.io/github/stars/MarioSieg/magnetron?style=social\"/\u003e :  (WIP) A small but powerful, homemade PyTorch from scratch. Minimalistic homemade PyTorch alternative, written in C99 and Python.\r\n\r\n    - [lucasdelimanogueira/PyNorch](https://github.com/lucasdelimanogueira/PyNorch) \u003cimg src=\"https://img.shields.io/github/stars/lucasdelimanogueira/PyNorch?style=social\"/\u003e :  Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)\r\n\r\n    - [xgqdut2016/cuda_code](https://github.com/xgqdut2016/cuda_code) \u003cimg src=\"https://img.shields.io/github/stars/xgqdut2016/cuda_code?style=social\"/\u003e : easy cuda code. CUDA代码简单入门。\r\n\r\n    - [xgqdut2016/hpc_project](https://github.com/xgqdut2016/hpc_project) \u003cimg src=\"https://img.shields.io/github/stars/xgqdut2016/hpc_project?style=social\"/\u003e : some hpc project for learning.\r\n\r\n    - [xgqdut2016/hpc2torch](https://github.com/xgqdut2016/hpc2torch) \u003cimg src=\"https://img.shields.io/github/stars/xgqdut2016/hpc2torch?style=social\"/\u003e : 这个仓库打算搭建一个高性能底层库的测试框架，将会针对onnx的算子编写相关的高性能kernel，作为pytorch的补充，从python端对比手写kernel和pytorch库函数的性能以及精度对比。\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  - ### TensorRT Learning\r\n\r\n    - [NVIDIA TensorRT Docs](https://docs.nvidia.com/deeplearning/tensorrt/) : NVIDIA Deep Learning TensorRT Documentation.\r\n\r\n    - [TensorRT](https://github.com/NVIDIA/TensorRT) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/TensorRT?style=social\"/\u003e : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. [developer.nvidia.com/tensorrt](https://developer.nvidia.com/tensorrt)\r\n\r\n    - [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM?style=social\"/\u003e : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. [nvidia.github.io/TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM)\r\n\r\n    - [HeKun-NVIDIA/TensorRT-Developer_Guide_in_Chinese](https://github.com/HeKun-NVIDIA/TensorRT-Developer_Guide_in_Chinese) \u003cimg src=\"https://img.shields.io/github/stars/HeKun-NVIDIA/TensorRT-Developer_Guide_in_Chinese?style=social\"/\u003e : 本项目是NVIDIA TensorRT的中文版开发手册， 有个人翻译并添加自己的理解。\r\n\r\n    - [kalfazed/tensorrt_starter](https://github.com/kalfazed/tensorrt_starter) \u003cimg src=\"https://img.shields.io/github/stars/kalfazed/tensorrt_starter?style=social\"/\u003e : This repository give a guidline to learn CUDA and TensorRT from the beginning.\r\n\r\n    - [LitLeo/TensorRT_Tutorial](https://github.com/LitLeo/TensorRT_Tutorial) \u003cimg src=\"https://img.shields.io/github/stars/LitLeo/TensorRT_Tutorial?style=social\"/\u003e : TensorRT_Tutorial.\r\n\r\n\r\n\r\n\r\n\r\n  - ### Triton Learning\r\n\r\n    - [Triton](https://github.com/triton-lang/triton) \u003cimg src=\"https://img.shields.io/github/stars/triton-lang/triton?style=social\"/\u003e : Development repository for the Triton language and compiler. [triton-lang.org/](https://triton-lang.org/)\r\n\r\n    - [Triton Docs](https://triton-lang.org/main/index.html) : Triton Documentation.\r\n\r\n    - [hyperai/triton-cn](https://github.com/hyperai/triton-cn) \u003cimg src=\"https://img.shields.io/github/stars/hyperai/triton-cn?style=social\"/\u003e : Triton Documentation in Chinese Simplified / Triton 中文文档. [triton.hyper.ai](https://triton.hyper.ai/)\r\n\r\n\r\n  - ### TVM Learning\r\n\r\n    - [Apache TVM 中文站](https://tvm.hyper.ai/) : Apache TVM 中文文档！\r\n\r\n\r\n\r\n  - ### MLIR Learning\r\n\r\n    - [LLVM Docs](https://llvm.org/docs/) : LLVM Documentation.\r\n\r\n    - [MLIR Docs](https://mlir.llvm.org/docs/) : MLIR Code Documentation.\r\n\r\n    - [BBuf/tvm_mlir_learn](https://github.com/BBuf/tvm_mlir_learn) \u003cimg src=\"https://img.shields.io/github/stars/BBuf/tvm_mlir_learn?style=social\"/\u003e : compiler learning resources collect.\r\n\r\n    - [j2kun/mlir-tutorial](https://github.com/j2kun/mlir-tutorial) \u003cimg src=\"https://img.shields.io/github/stars/j2kun/mlir-tutorial?style=social\"/\u003e : This is the code repository for a series of articles on the [MLIR framework](https://mlir.llvm.org/) for building compilers.\r\n\r\n    - [KEKE046/mlir-tutorial](https://github.com/KEKE046/mlir-tutorial) \u003cimg src=\"https://img.shields.io/github/stars/KEKE046/mlir-tutorial?style=social\"/\u003e : Hands-On Practical MLIR Tutorial.\r\n\r\n    - [AyakaGEMM/Hands-on-MLIR](https://github.com/AyakaGEMM/Hands-on-MLIR) \u003cimg src=\"https://img.shields.io/github/stars/AyakaGEMM/Hands-on-MLIR?style=social\"/\u003e : Hands-on-MLIR.\r\n\r\n    - [yao-jiashu/KernelCodeGen](https://github.com/yao-jiashu/KernelCodeGen) \u003cimg src=\"https://img.shields.io/github/stars/yao-jiashu/KernelCodeGen?style=social\"/\u003e : GEMM/Conv2d CUDA/HIP kernel code generation using MLIR.\r\n\r\n\r\n\r\n  - ### HPC Learning\r\n\r\n    - [LAFF-On-PfHP](https://www.cs.utexas.edu/~flame/laff/pfhp/LAFF-On-PfHP.html) : LAFF-On Programming for High Performance.\r\n\r\n    - [flame/how-to-optimize-gemm](https://github.com/flame/how-to-optimize-gemm) \u003cimg src=\"https://img.shields.io/github/stars/flame/how-to-optimize-gemm?style=social\"/\u003e : How To Optimize Gemm wiki pages. [https://github.com/flame/how-to-optimize-gemm/wiki](https://github.com/flame/how-to-optimize-gemm/wiki)\r\n\r\n    - [flame/blislab](https://github.com/flame/blislab) \u003cimg src=\"https://img.shields.io/github/stars/flame/blislab?style=social\"/\u003e : BLISlab: A Sandbox for Optimizing GEMM. Check the [tutorial](https://github.com/flame/blislab/blob/master/tutorial.pdf) for more details.\r\n\r\n    - [tpoisonooo/how-to-optimize-gemm](https://github.com/tpoisonooo/how-to-optimize-gemm) \u003cimg src=\"https://img.shields.io/github/stars/tpoisonooo/how-to-optimize-gemm?style=social\"/\u003e : row-major matmul optimization. [zhuanlan.zhihu.com/p/65436463](https://zhuanlan.zhihu.com/p/65436463).\r\n\r\n    - [YichengDWu/matmul.mojo](https://github.com/YichengDWu/matmul.mojo) \u003cimg src=\"https://img.shields.io/github/stars/YichengDWu/matmul.mojo?style=social\"/\u003e : High Performance Matrix Multiplication in Pure Mojo 🔥\r\n\r\n\r\n\r\n\r\n\r\n## Frameworks\r\n\r\n  - ### CUDA Frameworks\r\n\r\n    - #### GPU Interface\r\n      ##### GPU接口\r\n\r\n        - ##### CPP Version\r\n\r\n            - [CCCL](https://github.com/NVIDIA/cccl) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/cccl?style=social\"/\u003e : CUDA C++ Core Libraries. The concept for the CUDA C++ Core Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.\r\n\r\n            - [HIP](https://github.com/ROCm/HIP) \u003cimg src=\"https://img.shields.io/github/stars/ROCm/HIP?style=social\"/\u003e : HIP: C++ Heterogeneous-Compute Interface for Portability. HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code. [rocmdocs.amd.com/projects/HIP/](https://rocmdocs.amd.com/projects/HIP/)\r\n\r\n\r\n        - ##### Python Version\r\n\r\n            - [NVIDIA/cuda-python](https://github.com/NVIDIA/cuda-python) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/cuda-python?style=social\"/\u003e : CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. CUDA Python Low-level Bindings. [nvidia.github.io/cuda-python/](https://nvidia.github.io/cuda-python/latest/)\r\n\r\n            - [CuPy](https://github.com/cupy/cupy) \u003cimg src=\"https://img.shields.io/github/stars/cupy/cupy?style=social\"/\u003e : CuPy : NumPy \u0026 SciPy for GPU. [cupy.dev](https://cupy.dev/)\r\n\r\n            - [PyCUDA](https://github.com/inducer/pycuda) \u003cimg src=\"https://img.shields.io/github/stars/inducer/pycuda?style=social\"/\u003e : PyCUDA: Pythonic Access to CUDA, with Arrays and Algorithms. [mathema.tician.de/software/pycuda](http://mathema.tician.de/software/pycuda)\r\n\r\n\r\n\r\n        - ##### Rust Version\r\n\r\n            - [jessfraz/advent-of-cuda](https://github.com/jessfraz/advent-of-cuda) \u003cimg src=\"https://img.shields.io/github/stars/jessfraz/advent-of-cuda?style=social\"/\u003e : Doing advent of code with CUDA and rust.\r\n\r\n            - [Bend](https://github.com/HigherOrderCO/Bend) \u003cimg src=\"https://img.shields.io/github/stars/HigherOrderCO/Bend?style=social\"/\u003e : A massively parallel, high-level programming language.[higherorderco.com](https://higherorderco.com/)\r\n\r\n            - [HVM](https://github.com/HigherOrderCO/HVM) \u003cimg src=\"https://img.shields.io/github/stars/HigherOrderCO/HVM?style=social\"/\u003e : A massively parallel, optimal functional runtime in Rust.[higherorderco.com](https://higherorderco.com/)\r\n\r\n            - [ZLUDA](https://github.com/vosen/ZLUDA) \u003cimg src=\"https://img.shields.io/github/stars/vosen/ZLUDA?style=social\"/\u003e : CUDA on AMD GPUs.\r\n\r\n            - [Rust-CUDA](https://github.com/Rust-GPU/Rust-CUDA) \u003cimg src=\"https://img.shields.io/github/stars/Rust-GPU/Rust-CUDA?style=social\"/\u003e : Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.\r\n\r\n            - [cudarc](https://github.com/coreylowman/cudarc) \u003cimg src=\"https://img.shields.io/github/stars/coreylowman/cudarc?style=social\"/\u003e : cudarc: minimal and safe api over the cuda toolkit.\r\n\r\n            - [bindgen_cuda](https://github.com/Narsil/bindgen_cuda) \u003cimg src=\"https://img.shields.io/github/stars/Narsil/bindgen_cuda?style=social\"/\u003e : Similar crate than [bindgen](https://github.com/rust-lang/rust-bindgen) in philosophy. It will help create automatic bindgen to cuda kernels source files and make them easier to use directly from Rust.\r\n\r\n            - [cuda-driver](https://github.com/YdrMaster/cuda-driver) \u003cimg src=\"https://img.shields.io/github/stars/YdrMaster/cuda-driver?style=social\"/\u003e : 基于 CUDA Driver API 的 cuda 运行时环境。\r\n\r\n            - [async-cuda](https://github.com/oddity-ai/async-cuda) \u003cimg src=\"https://img.shields.io/github/stars/oddity-ai/async-cuda?style=social\"/\u003e : Asynchronous CUDA for Rust.\r\n\r\n            - [async-tensorrt](https://github.com/oddity-ai/async-tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/oddity-ai/async-tensorrt?style=social\"/\u003e : Asynchronous TensorRT for Rust.\r\n\r\n            - [krnl](https://github.com/charles-r-earp/krnl) \u003cimg src=\"https://img.shields.io/github/stars/charles-r-earp/krnl?style=social\"/\u003e : Safe, portable, high performance compute (GPGPU) kernels.\r\n\r\n            - [custos](https://github.com/elftausend/custos) \u003cimg src=\"https://img.shields.io/github/stars/elftausend/custos?style=social\"/\u003e : A minimal OpenCL, CUDA, WGPU and host CPU array manipulation engine / framework.\r\n\r\n            - [spinorml/nvlib](https://github.com/spinorml/nvlib) \u003cimg src=\"https://img.shields.io/github/stars/spinorml/nvlib?style=social\"/\u003e : Rust interoperability with NVIDIA CUDA NVRTC and Driver.\r\n\r\n            - [DoeringChristian/cuda-rs](https://github.com/DoeringChristian/cuda-rs) \u003cimg src=\"https://img.shields.io/github/stars/DoeringChristian/cuda-rs?style=social\"/\u003e : Cuda Bindings for rust generated with bindgen-cli (similar to cust_raw).\r\n\r\n            - [romankoblov/rust-nvrtc](https://github.com/romankoblov/rust-nvrtc) \u003cimg src=\"https://img.shields.io/github/stars/romankoblov/rust-nvrtc?style=social\"/\u003e : NVRTC bindings for RUST.\r\n\r\n            - [solkitten/astro-cuda](https://github.com/solkitten/astro-cuda) \u003cimg src=\"https://img.shields.io/github/stars/solkitten/astro-cuda?style=social\"/\u003e : CUDA Driver API bindings for Rust.\r\n\r\n            - [bokutotu/curs](https://github.com/bokutotu/curs) \u003cimg src=\"https://img.shields.io/github/stars/bokutotu/curs?style=social\"/\u003e : cuda\u0026cublas\u0026cudnn wrapper for Rust.\r\n\r\n            - [rust-cuda/cuda-sys](https://github.com/rust-cuda/cuda-sys) \u003cimg src=\"https://img.shields.io/github/stars/rust-cuda/cuda-sys?style=social\"/\u003e : Rust binding to CUDA APIs.\r\n\r\n            - [bheisler/RustaCUDA](https://github.com/bheisler/RustaCUDA) \u003cimg src=\"https://img.shields.io/github/stars/bheisler/RustaCUDA?style=social\"/\u003e : Rusty wrapper for the CUDA Driver API.\r\n\r\n            - [tmrob2/cuda2rust_sandpit](https://github.com/tmrob2/cuda2rust_sandpit) \u003cimg src=\"https://img.shields.io/github/stars/tmrob2/cuda2rust_sandpit?style=social\"/\u003e : Minimal examples to get CUDA linear algebra programs working with Rust using CC \u0026 FFI.\r\n\r\n            - [PhDP/rust-cuda-template](https://github.com/PhDP/rust-cuda-template) \u003cimg src=\"https://img.shields.io/github/stars/PhDP/rust-cuda-template?style=social\"/\u003e : Simple template for Rust + CUDA.\r\n\r\n            - [neka-nat/cuimage](https://github.com/neka-nat/cuimage) \u003cimg src=\"https://img.shields.io/github/stars/neka-nat/cuimage?style=social\"/\u003e : Rust implementation of image processing library with CUDA.\r\n\r\n            - [yanghaku/cuda-driver-sys](https://github.com/yanghaku/cuda-driver-sys) \u003cimg src=\"https://img.shields.io/github/stars/yanghaku/cuda-driver-sys?style=social\"/\u003e : Rust binding to CUDA Driver APIs.\r\n\r\n            - [Canyon-ml/canyon-sys](https://github.com/Canyon-ml/canyon-sys) \u003cimg src=\"https://img.shields.io/github/stars/Canyon-ml/canyon-sys?style=social\"/\u003e : Rust Bindings for Cuda, CuDNN.\r\n\r\n            - [cea-hpc/HARP](https://github.com/cea-hpc/HARP) \u003cimg src=\"https://img.shields.io/github/stars/cea-hpc/HARP?style=social\"/\u003e : Small tool for profiling the performance of hardware-accelerated Rust code using OpenCL and CUDA.\r\n\r\n            - [Conqueror712/CUDA-Simulator](https://github.com/Conqueror712/CUDA-Simulator) \u003cimg src=\"https://img.shields.io/github/stars/Conqueror712/CUDA-Simulator?style=social\"/\u003e : A self-developed version of the user-mode CUDA emulator project and a learning repository for Rust.\r\n\r\n            - [cszach/rust-cuda-template](https://github.com/cszach/rust-cuda-template) \u003cimg src=\"https://img.shields.io/github/stars/cszach/rust-cuda-template?style=social\"/\u003e : A Rust CUDA template with detailed instructions.\r\n\r\n            - [exor2008/fluid-simulator](https://github.com/exor2008/fluid-simulator) \u003cimg src=\"https://img.shields.io/github/stars/exor2008/fluid-simulator?style=social\"/\u003e : Rust CUDA fluid simulator.\r\n\r\n            - [chichieinstein/rustycuda](https://github.com/chichieinstein/rustycuda) \u003cimg src=\"https://img.shields.io/github/stars/chichieinstein/rustycuda?style=social\"/\u003e : Convenience functions for generic handling of CUDA resources on the Rust side.\r\n\r\n            - [Jafagervik/cruda](https://github.com/Jafagervik/cruda) \u003cimg src=\"https://img.shields.io/github/stars/Jafagervik/cruda?style=social\"/\u003e : CRUDA - Writing rust with cuda.\r\n\r\n            - [lennyerik/cutransform](https://github.com/lennyerik/cutransform) \u003cimg src=\"https://img.shields.io/github/stars/lennyerik/cutransform?style=social\"/\u003e : CUDA kernels in any language supported by LLVM.\r\n\r\n           - [cjordan/hip-sys](https://github.com/cjordan/hip-sys) \u003cimg src=\"https://img.shields.io/github/stars/cjordan/hip-sys?style=social\"/\u003e : Rust bindings for HIP.\r\n\r\n            - [rust-gpu](https://github.com/EmbarkStudios/rust-gpu) \u003cimg src=\"https://img.shields.io/github/stars/EmbarkStudios/rust-gpu?style=social\"/\u003e : 🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧 [shader.rs](https://shader.rs/)\r\n\r\n            - [wgpu](https://github.com/gfx-rs/wgpu) \u003cimg src=\"https://img.shields.io/github/stars/gfx-rs/wgpu?style=social\"/\u003e : Safe and portable GPU abstraction in Rust, implementing WebGPU API. [wgpu.rs](https://wgpu.rs/)\r\n\r\n            - [Vulkano](https://github.com/vulkano-rs/vulkano) \u003cimg src=\"https://img.shields.io/github/stars/vulkano-rs/vulkano?style=social\"/\u003e : Safe and rich Rust wrapper around the Vulkan API. Vulkano is a Rust wrapper around [the Vulkan graphics API](https://www.vulkan.org/). It follows the Rust philosophy, which is that as long as you don't use unsafe code you shouldn't be able to trigger any undefined behavior. In the case of Vulkan, this means that non-unsafe code should always conform to valid API usage.\r\n\r\n            - [Ash](https://github.com/ash-rs/ash) \u003cimg src=\"https://img.shields.io/github/stars/ash-rs/ash?style=social\"/\u003e : Vulkan bindings for Rust.\r\n\r\n            - [ocl](https://github.com/cogciprocate/ocl) \u003cimg src=\"https://img.shields.io/github/stars/cogciprocate/ocl?style=social\"/\u003e : OpenCL for Rust.\r\n\r\n            - [opencl3](https://github.com/kenba/opencl3) \u003cimg src=\"https://img.shields.io/github/stars/kenba/opencl3?style=social\"/\u003e : A Rust implementation of the Khronos [OpenCL 3.0](https://registry.khronos.org/OpenCL/) API.\r\n\r\n\r\n\r\n\r\n\r\n        - ##### Julia Version\r\n\r\n            - [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) \u003cimg src=\"https://img.shields.io/github/stars/JuliaGPU/CUDA.jl?style=social\"/\u003e : CUDA programming in Julia. [juliagpu.org/](https://juliagpu.org/)\r\n\r\n            - [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) \u003cimg src=\"https://img.shields.io/github/stars/JuliaGPU/AMDGPU.jl?style=social\"/\u003e : AMD GPU (ROCm) programming in Julia.\r\n\r\n\r\n    - #### Performance Benchmark\r\n\r\n        - [FlagPerf](https://github.com/FlagOpen/FlagPerf) \u003cimg src=\"https://img.shields.io/github/stars/FlagOpen/FlagPerf?style=social\"/\u003e : FlagPerf is an open-source software platform for benchmarking AI chips. FlagPerf是智源研究院联合AI硬件厂商共建的一体化AI硬件评测引擎，旨在建立以产业实践为导向的指标体系，评测AI硬件在软件栈组合（模型+框架+编译器）下的实际能力。\r\n\r\n        - [te42kyfo/gpu-benches](https://github.com/te42kyfo/gpu-benches) \u003cimg src=\"https://img.shields.io/github/stars/te42kyfo/gpu-benches?style=social\"/\u003e : collection of benchmarks to measure basic GPU capabilities.\r\n\r\n\r\n\r\n    - #### Scientific Computing Framework\r\n      ##### 科学计算框架\r\n\r\n        - [cuBLAS](https://developer.nvidia.com/cublas) : Basic Linear Algebra on NVIDIA GPUs. NVIDIA cuBLAS is a GPU-accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.\r\n\r\n        - [CUTLASS](https://github.com/NVIDIA/cutlass) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/cutlass?style=social\"/\u003e : CUDA Templates for Linear Algebra Subroutines.\r\n\r\n        - [MatX](https://github.com/NVIDIA/MatX) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/MatX?style=social\"/\u003e : MatX - GPU-Accelerated Numerical Computing in Modern C++. An efficient C++17 GPU numerical computing library with Python-like syntax. [nvidia.github.io/MatX](https://nvidia.github.io/MatX)\r\n\r\n        - [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) \u003cimg src=\"https://img.shields.io/github/stars/deepseek-ai/DeepGEMM?style=social\"/\u003e : DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling.\r\n\r\n        - [MUTLASS](https://github.com/MooreThreads/mutlass) \u003cimg src=\"https://img.shields.io/github/stars/MooreThreads/mutlass?style=social\"/\u003e : MUSA Templates for Linear Algebra Subroutines.\r\n\r\n        - [CuPy](https://github.com/cupy/cupy) \u003cimg src=\"https://img.shields.io/github/stars/cupy/cupy?style=social\"/\u003e : CuPy : NumPy \u0026 SciPy for GPU. [cupy.dev](https://cupy.dev/)\r\n\r\n        - [GenericLinearAlgebra.jl](https://github.com/JuliaLinearAlgebra/GenericLinearAlgebra.jl) \u003cimg src=\"https://img.shields.io/github/stars/JuliaLinearAlgebra/GenericLinearAlgebra.jl?style=social\"/\u003e : Generic numerical linear algebra in Julia.\r\n\r\n        - [custos-math](https://github.com/elftausend/custos-math) \u003cimg src=\"https://img.shields.io/github/stars/elftausend/custos-math?style=social\"/\u003e : This crate provides CUDA, OpenCL, CPU (and Stack) based matrix operations using [custos](https://github.com/elftausend/custos).\r\n\r\n\r\n    - #### Attention and Transformer Framework\r\n\r\n        - [FlashAttention](https://github.com/Dao-AILab/flash-attention) \u003cimg src=\"https://img.shields.io/github/stars/Dao-AILab/flash-attention?style=social\"/\u003e : Fast and memory-efficient exact attention. \"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness\". (**[arXiv 2022](https://arxiv.org/abs/2205.14135)**).\r\n\r\n        - [fla-org/flash-linear-attention](https://github.com/fla-org/flash-linear-attention) \u003cimg src=\"https://img.shields.io/github/stars/fla-org/flash-linear-attention?style=social\"/\u003e : 🚀 Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton.\r\n\r\n        - [66RING/tiny-flash-attention](https://github.com/66RING/tiny-flash-attention) \u003cimg src=\"https://img.shields.io/github/stars/66RING/tiny-flash-attention?style=social\"/\u003e : [flash attention](https://github.com/Dao-AILab/flash-attention) tutorial written in python, triton, cuda, cutlass.\r\n\r\n        - [weishengying/tiny-flash-attention](https://github.com/weishengying/tiny-flash-attention) \u003cimg src=\"https://img.shields.io/github/stars/weishengying/tiny-flash-attention?style=social\"/\u003e : 使用 cutlass 实现 flash-attention 精简版，具有教学意义。\r\n\r\n        - [jepeake/tiny-flash-attention](https://github.com/jepeake/tiny-flash-attention) \u003cimg src=\"https://img.shields.io/github/stars/jepeake/tiny-flash-attention?style=social\"/\u003e : flash attention in ~20 lines.\r\n\r\n\r\n\r\n\r\n    - #### Machine Learning Framework\r\n\r\n        - [cuDNN](https://developer.nvidia.com/cudnn) : The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for [deep neural networks](https://developer.nvidia.com/deep-learning). cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.\r\n\r\n        - [PyTorch](https://github.com/pytorch/pytorch) \u003cimg src=\"https://img.shields.io/github/stars/pytorch/pytorch?style=social\"/\u003e : Tensors and Dynamic neural networks in Python with strong GPU acceleration. [pytorch.org](https://pytorch.org/)\r\n\r\n        - [MooreThreads/torch_musa](https://github.com/MooreThreads/torch_musa) \u003cimg src=\"https://img.shields.io/github/stars/MooreThreads/torch_musa?style=social\"/\u003e : torch_musa is an open source repository based on PyTorch, which can make full use of the super computing power of MooreThreads graphics cards.\r\n\r\n        - [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) \u003cimg src=\"https://img.shields.io/github/stars/PaddlePaddle/Paddle?style=social\"/\u003e : PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习\u0026机器学习高性能单机、分布式训练和跨平台部署）. [www.paddlepaddle.org/](http://www.paddlepaddle.org/)\r\n\r\n        - [flashlight/flashlight](https://github.com/flashlight/flashlight) \u003cimg src=\"https://img.shields.io/github/stars/flashlight/flashlight?style=social\"/\u003e : A C++ standalone library for machine learning. [fl.readthedocs.io/en/latest/](https://fl.readthedocs.io/en/latest/)\r\n\r\n        - [yhwang-hub/dl_model_infer](https://github.com/yhwang-hub/dl_model_infer) \u003cimg src=\"https://img.shields.io/github/stars/yhwang-hub/dl_model_infer?style=social\"/\u003e : his is a c++ version of the AI reasoning library. Currently, it only supports the reasoning of the tensorrt model. The follow-up plan supports the c++ reasoning of frameworks such as Openvino, NCNN, and MNN. There are two versions for pre- and post-processing, c++ version and cuda version. It is recommended to use the cuda version., This repository provides accelerated deployment cases of deep learning CV popular models, and cuda c supports dynamic-batch image process, infer, decode, NMS.\r\n\r\n        - [NVlabs/tiny-cuda-nn](https://github.com/NVlabs/tiny-cuda-nn) \u003cimg src=\"https://img.shields.io/github/stars/NVlabs/tiny-cuda-nn?style=social\"/\u003e : Lightning fast C++/CUDA neural network framework.\r\n\r\n        - [zjhellofss/KuiperLLama](https://github.com/zjhellofss/KuiperLLama) \u003cimg src=\"https://img.shields.io/github/stars/zjhellofss/KuiperLLama?style=social\"/\u003e : 《动手自制大模型推理框架》。KuiperLLama 动手自制大模型推理框架，支持LLama2/3和Qwen2.5。校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。\r\n\r\n        - [zjhellofss/KuiperInfer](https://github.com/zjhellofss/KuiperInfer) \u003cimg src=\"https://img.shields.io/github/stars/zjhellofss/KuiperInfer?style=social\"/\u003e :  校招、秋招、春招、实习好项目！带你从零实现一个高性能的深度学习推理库，支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step。\r\n\r\n        - [zjhellofss/kuiperdatawhale](https://github.com/zjhellofss/kuiperdatawhale) \u003cimg src=\"https://img.shields.io/github/stars/zjhellofss/kuiperdatawhale?style=social\"/\u003e :  从零自制深度学习推理框架。\r\n\r\n        - [MarioSieg/magnetron](https://github.com/MarioSieg/magnetron) \u003cimg src=\"https://img.shields.io/github/stars/MarioSieg/magnetron?style=social\"/\u003e :  (WIP) A small but powerful, homemade PyTorch from scratch. Minimalistic homemade PyTorch alternative, written in C99 and Python.\r\n\r\n        - [lucasdelimanogueira/PyNorch](https://github.com/lucasdelimanogueira/PyNorch) \u003cimg src=\"https://img.shields.io/github/stars/lucasdelimanogueira/PyNorch?style=social\"/\u003e :  Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)\r\n\r\n\r\n    - #### AI Inference Framework\r\n      ##### AI推理框架\r\n\r\n\r\n\r\n\r\n\r\n        - ##### LLM Inference and Serving Engine\r\n\r\n            - [TensorRT](https://github.com/NVIDIA/TensorRT) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/TensorRT?style=social\"/\u003e : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. [developer.nvidia.com/tensorrt](https://developer.nvidia.com/tensorrt)\r\n\r\n            - [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM?style=social\"/\u003e : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. [nvidia.github.io/TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM)\r\n\r\n            - [NVIDIA/TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/TensorRT-Model-Optimizer?style=social\"/\u003e : TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs. [nvidia.github.io/TensorRT-Model-Optimizer](https://nvidia.github.io/TensorRT-Model-Optimizer/)\r\n\r\n            - [Ollama](https://github.com/ollama/ollama) \u003cimg src=\"https://img.shields.io/github/stars/ollama/ollama?style=social\"/\u003e : Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models. [ollama.com](https://ollama.com/)\r\n\r\n            - [vLLM](https://github.com/vllm-project/vllm) \u003cimg src=\"https://img.shields.io/github/stars/vllm-project/vllm?style=social\"/\u003e : A high-throughput and memory-efficient inference and serving engine for LLMs. [docs.vllm.ai](https://docs.vllm.ai/)\r\n\r\n            - [SGLang](https://github.com/sgl-project/sglang) \u003cimg src=\"https://img.shields.io/github/stars/sgl-project/sglang?style=social\"/\u003e : SGLang is a fast serving framework for large language models and vision language models. [docs.sglang.ai/](https://docs.sglang.ai/)\r\n\r\n            - [MLC LLM](https://github.com/mlc-ai/mlc-llm) \u003cimg src=\"https://img.shields.io/github/stars/mlc-ai/mlc-llm?style=social\"/\u003e : Universal LLM Deployment Engine with ML Compilation. [llm.mlc.ai/](https://llm.mlc.ai/)\r\n\r\n            - [KTransformers](https://github.com/kvcache-ai/ktransformers) \u003cimg src=\"https://img.shields.io/github/stars/kvcache-ai/ktransformers?style=social\"/\u003e : A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations. [kvcache-ai.github.io/ktransformers/](https://kvcache-ai.github.io/ktransformers/)\r\n\r\n            - [Chitu（赤兔）](https://github.com/thu-pacman/chitu) \u003cimg src=\"https://img.shields.io/github/stars/thu-pacman/chitu?style=social\"/\u003e : High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.\r\n\r\n            - [GPUStack](https://github.com/gpustack/gpustack) \u003cimg src=\"https://img.shields.io/github/stars/gpustack/gpustack?style=social\"/\u003e : GPUStack is an open-source GPU cluster manager for running AI models. Manage GPU clusters for running AI models. [gpustack.ai](https://gpustack.ai/)\r\n\r\n            - [Lamini](https://github.com/lamini-ai/lamini) \u003cimg src=\"https://img.shields.io/github/stars/lamini-ai/lamini?style=social\"/\u003e : The Official Python Client for Lamini's API. [lamini.ai/](https://lamini.ai/)\r\n\r\n            - [datawhalechina/self-llm](https://github.com/datawhalechina/self-llm) \u003cimg src=\"https://img.shields.io/github/stars/datawhalechina/self-llm?style=social\"/\u003e :  《开源大模型食用指南》基于Linux环境快速部署开源大模型，更适合中国宝宝的部署教程。\r\n\r\n            - [ninehills/llm-inference-benchmark](https://github.com/ninehills/llm-inference-benchmark) \u003cimg src=\"https://img.shields.io/github/stars/ninehills/llm-inference-benchmark?style=social\"/\u003e : LLM Inference benchmark.\r\n\r\n            - [csbench/csbench](https://github.com/csbench/csbench) \u003cimg src=\"https://img.shields.io/github/stars/csbench/csbench?style=social\"/\u003e : \"CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery\". (**[arXiv 2024](https://arxiv.org/abs/2406.08587)**).\r\n\r\n            - [MooreThreads/vllm_musa](https://github.com/MooreThreads/vllm_musa) \u003cimg src=\"https://img.shields.io/github/stars/MooreThreads/vllm_musa?style=social\"/\u003e : A high-throughput and memory-efficient inference and serving engine for LLMs. [docs.vllm.ai](https://docs.vllm.ai/)\r\n\r\n\r\n\r\n        - ##### High Performance Kernel Library\r\n\r\n            - [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) \u003cimg src=\"https://img.shields.io/github/stars/deepseek-ai/DeepGEMM?style=social\"/\u003e : DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling.\r\n\r\n            - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) \u003cimg src=\"https://img.shields.io/github/stars/flashinfer-ai/flashinfer?style=social\"/\u003e : FlashInfer: Kernel Library for LLM Serving . [flashinfer.ai](flashinfer.ai)\r\n\r\n            - [FlashMLA](https://github.com/deepseek-ai/FlashMLA) \u003cimg src=\"https://img.shields.io/github/stars/deepseek-ai/FlashMLA?style=social\"/\u003e : FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs.\r\n\r\n            - [DeepEP](https://github.com/deepseek-ai/DeepEP) \u003cimg src=\"https://img.shields.io/github/stars/deepseek-ai/DeepEP?style=social\"/\u003e : DeepEP: an efficient expert-parallel communication library.\r\n\r\n\r\n\r\n\r\n        - ##### C Implementation\r\n\r\n            - [llm.c](https://github.com/karpathy/llm.c) \u003cimg src=\"https://img.shields.io/github/stars/karpathy/llm.c?style=social\"/\u003e : LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.\r\n\r\n            - [llama2.c](https://github.com/karpathy/llama2.c) \u003cimg src=\"https://img.shields.io/github/stars/karpathy/llama2.c?style=social\"/\u003e : Inference Llama 2 in one file of pure C. Train the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file (run.c).\r\n\r\n\r\n        - ##### CPP Implementation\r\n\r\n            - [gemma.cpp](https://github.com/google/gemma.cpp) \u003cimg src=\"https://img.shields.io/github/stars/google/gemma.cpp?style=social\"/\u003e :  gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google.\r\n\r\n            - [llama.cpp](https://github.com/ggerganov/llama.cpp) \u003cimg src=\"https://img.shields.io/github/stars/ggerganov/llama.cpp?style=social\"/\u003e : Inference of [LLaMA](https://github.com/facebookresearch/llama) model in pure C/C++.\r\n\r\n            - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) \u003cimg src=\"https://img.shields.io/github/stars/ggerganov/whisper.cpp?style=social\"/\u003e : High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model.\r\n\r\n            - [ChatGLM.cpp](https://github.com/li-plus/chatglm.cpp) \u003cimg src=\"https://img.shields.io/github/stars/li-plus/chatglm.cpp?style=social\"/\u003e : C++ implementation of [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) and [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B).\r\n\r\n            - [MegEngine/InferLLM](https://github.com/MegEngine/InferLLM) \u003cimg src=\"https://img.shields.io/github/stars/MegEngine/InferLLM?style=social\"/\u003e : InferLLM is a lightweight LLM model inference framework that mainly references and borrows from the llama.cpp project.\r\n\r\n            - [DeployAI/nndeploy](https://github.com/DeployAI/nndeploy) \u003cimg src=\"https://img.shields.io/github/stars/DeployAI/nndeploy?style=social\"/\u003e : nndeploy是一款模型端到端部署框架。以多端推理以及基于有向无环图模型部署为内核，致力为用户提供跨平台、简单易用、高性能的模型部署体验。[nndeploy-zh.readthedocs.io/zh/latest/](https://nndeploy-zh.readthedocs.io/zh/latest/)\r\n\r\n            - [zjhellofss/KuiperInfer (自制深度学习推理框架)](https://github.com/zjhellofss/KuiperInfer) \u003cimg src=\"https://img.shields.io/github/stars/zjhellofss/KuiperInfer?style=social\"/\u003e :  带你从零实现一个高性能的深度学习推理库，支持llama 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step.\r\n\r\n            - [skeskinen/llama-lite](https://github.com/skeskinen/llama-lite) \u003cimg src=\"https://img.shields.io/github/stars/skeskinen/llama-lite?style=social\"/\u003e : Embeddings focused small version of Llama NLP model.\r\n\r\n            - [Const-me/Whisper](https://github.com/Const-me/Whisper) \u003cimg src=\"https://img.shields.io/github/stars/Const-me/Whisper?style=social\"/\u003e : High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model.\r\n\r\n            - [wangzhaode/ChatGLM-MNN](https://github.com/wangzhaode/ChatGLM-MNN) \u003cimg src=\"https://img.shields.io/github/stars/wangzhaode/ChatGLM-MNN?style=social\"/\u003e : Pure C++, Easy Deploy ChatGLM-6B.\r\n\r\n            - [ztxz16/fastllm](https://github.com/ztxz16/fastllm) \u003cimg src=\"https://img.shields.io/github/stars/ztxz16/fastllm?style=social\"/\u003e : 纯c++实现，无第三方依赖的大模型库，支持CUDA加速，目前支持国产大模型ChatGLM-6B，MOSS; 可以在安卓设备上流畅运行ChatGLM-6B。\r\n\r\n            - [davidar/eigenGPT](https://github.com/davidar/eigenGPT) \u003cimg src=\"https://img.shields.io/github/stars/davidar/eigenGPT?style=social\"/\u003e : Minimal C++ implementation of GPT2.\r\n\r\n            - [Tlntin/Qwen-TensorRT-LLM](https://github.com/Tlntin/Qwen-TensorRT-LLM) \u003cimg src=\"https://img.shields.io/github/stars/Tlntin/Qwen-TensorRT-LLM?style=social\"/\u003e : 使用TRT-LLM完成对Qwen-7B-Chat实现推理加速。\r\n\r\n            - [FeiGeChuanShu/trt2023](https://github.com/FeiGeChuanShu/trt2023) \u003cimg src=\"https://img.shields.io/github/stars/FeiGeChuanShu/trt2023?style=social\"/\u003e : NVIDIA TensorRT Hackathon 2023复赛选题：通义千问Qwen-7B用TensorRT-LLM模型搭建及优化。\r\n\r\n            - [TRT2022/trtllm-llama](https://github.com/TRT2022/trtllm-llama) \u003cimg src=\"https://img.shields.io/github/stars/TRT2022/trtllm-llama?style=social\"/\u003e : ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化。\r\n\r\n\r\n\r\n        - ##### Mojo Implementation\r\n\r\n            - [llama2.mojo](https://github.com/tairov/llama2.mojo) \u003cimg src=\"https://img.shields.io/github/stars/tairov/llama2.mojo?style=social\"/\u003e : Inference Llama 2 in one file of pure 🔥\r\n\r\n            - [dorjeduck/llm.mojo](https://github.com/dorjeduck/llm.mojo) \u003cimg src=\"https://img.shields.io/github/stars/dorjeduck/llm.mojo?style=social\"/\u003e : port of Andrjey Karpathy's llm.c to Mojo.\r\n\r\n\r\n        - ##### Rust Implementation\r\n\r\n            - [Candle](https://github.com/huggingface/candle) \u003cimg src=\"https://img.shields.io/github/stars/huggingface/candle?style=social\"/\u003e : Minimalist ML framework for Rust.\r\n\r\n            - [Safetensors](https://github.com/huggingface/safetensors) \u003cimg src=\"https://img.shields.io/github/stars/huggingface/safetensors?style=social\"/\u003e : Simple, safe way to store and distribute tensors. [huggingface.co/docs/safetensors](https://huggingface.co/docs/safetensors/index)\r\n\r\n            - [Tokenizers](https://github.com/huggingface/tokenizers) \u003cimg src=\"https://img.shields.io/github/stars/huggingface/tokenizers?style=social\"/\u003e : 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production. [huggingface.co/docs/tokenizers](https://huggingface.co/docs/tokenizers/index)\r\n\r\n            - [Burn](https://github.com/burn-rs/burn) \u003cimg src=\"https://img.shields.io/github/stars/burn-rs/burn?style=social\"/\u003e : Burn - A Flexible and Comprehensive Deep Learning Framework in Rust. [burn-rs.github.io/](https://burn-rs.github.io/)\r\n\r\n            - [dfdx](https://github.com/coreylowman/dfdx) \u003cimg src=\"https://img.shields.io/github/stars/coreylowman/dfdx?style=social\"/\u003e : Deep learning in Rust, with shape checked tensors and neural networks.\r\n\r\n            - [luminal](https://github.com/jafioti/luminal) \u003cimg src=\"https://img.shields.io/github/stars/jafioti/luminal?style=social\"/\u003e : Deep learning at the speed of light. [www.luminalai.com/](https://www.luminalai.com/)\r\n\r\n            - [crabml](https://github.com/crabml/crabml) \u003cimg src=\"https://img.shields.io/github/stars/crabml/crabml?style=social\"/\u003e : crabml is focusing on the reimplementation of GGML using the Rust programming language.\r\n\r\n            - [TensorFlow Rust](https://github.com/tensorflow/rust) \u003cimg src=\"https://img.shields.io/github/stars/tensorflow/rust?style=social\"/\u003e : Rust language bindings for TensorFlow.\r\n\r\n            - [tch-rs](https://github.com/LaurentMazare/tch-rs) \u003cimg src=\"https://img.shields.io/github/stars/LaurentMazare/tch-rs?style=social\"/\u003e : Rust bindings for the C++ api of PyTorch.\r\n\r\n            - [rustai-solutions/candle_demo_openchat_35](https://github.com/rustai-solutions/candle_demo_openchat_35) \u003cimg src=\"https://img.shields.io/github/stars/rustai-solutions/candle_demo_openchat_35?style=social\"/\u003e : candle_demo_openchat_35.\r\n\r\n            - [llama2.rs](https://github.com/srush/llama2.rs) \u003cimg src=\"https://img.shields.io/github/stars/srush/llama2.rs?style=social\"/\u003e : A fast llama2 decoder in pure Rust.\r\n\r\n            - [Llama2-burn](https://github.com/Gadersd/llama2-burn) \u003cimg src=\"https://img.shields.io/github/stars/Gadersd/llama2-burn?style=social\"/\u003e : Llama2 LLM ported to Rust burn.\r\n\r\n            - [gaxler/llama2.rs](https://github.com/gaxler/llama2.rs) \u003cimg src=\"https://img.shields.io/github/stars/gaxler/llama2.rs?style=social\"/\u003e : Inference Llama 2 in one file of pure Rust 🦀\r\n\r\n            - [whisper-burn](https://github.com/Gadersd/whisper-burn) \u003cimg src=\"https://img.shields.io/github/stars/Gadersd/whisper-burn?style=social\"/\u003e : A Rust implementation of OpenAI's Whisper model using the burn framework.\r\n\r\n            - [stable-diffusion-burn](https://github.com/Gadersd/stable-diffusion-burn) \u003cimg src=\"https://img.shields.io/github/stars/Gadersd/stable-diffusion-burn?style=social\"/\u003e : Stable Diffusion v1.4 ported to Rust's burn framework.\r\n\r\n            - [coreylowman/llama-dfdx](https://github.com/coreylowman/llama-dfdx) \u003cimg src=\"https://img.shields.io/github/stars/coreylowman/llama-dfdx?style=social\"/\u003e : [LLaMa 7b](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) with CUDA acceleration implemented in rust. Minimal GPU memory needed!\r\n\r\n            - [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) \u003cimg src=\"https://img.shields.io/github/stars/tazz4843/whisper-rs?style=social\"/\u003e : Rust bindings to [whisper.cpp](https://github.com/ggerganov/whisper.cpp).\r\n\r\n            - [rustformers/llm](https://github.com/rustformers/llm) \u003cimg src=\"https://img.shields.io/github/stars/rustformers/llm?style=social\"/\u003e : Run inference for Large Language Models on CPU, with Rust 🦀🚀🦙.\r\n\r\n            - [Chidori](https://github.com/ThousandBirdsInc/chidori) \u003cimg src=\"https://img.shields.io/github/stars/ThousandBirdsInc/chidori?style=social\"/\u003e : A reactive runtime for building durable AI agents. [docs.thousandbirds.ai](https://docs.thousandbirds.ai/).\r\n\r\n            - [llm-chain](https://github.com/sobelio/llm-chain) \u003cimg src=\"https://img.shields.io/github/stars/sobelio/llm-chain?style=social\"/\u003e : llm-chain is a collection of Rust crates designed to help you work with Large Language Models (LLMs) more effectively. [llm-chain.xyz](https://llm-chain.xyz/)\r\n\r\n            - [Atome-FE/llama-node](https://github.com/Atome-FE/llama-node) \u003cimg src=\"https://img.shields.io/github/stars/Atome-FE/llama-node?style=social\"/\u003e : Believe in AI democratization. llama for nodejs backed by llama-rs and llama.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna model. [www.npmjs.com/package/llama-node](https://www.npmjs.com/package/llama-node)\r\n\r\n            - [Noeda/rllama](https://github.com/Noeda/rllama) \u003cimg src=\"https://img.shields.io/github/stars/Noeda/rllama?style=social\"/\u003e : Rust+OpenCL+AVX2 implementation of LLaMA inference code.\r\n\r\n            - [lencx/ChatGPT](https://github.com/lencx/ChatGPT) \u003cimg src=\"https://img.shields.io/github/stars/lencx/ChatGPT?style=social\"/\u003e : 🔮 ChatGPT Desktop Application (Mac, Windows and Linux). [NoFWL](https://app.nofwl.com/).\r\n\r\n            - [Synaptrix/ChatGPT-Desktop](https://github.com/Synaptrix/ChatGPT-Desktop) \u003cimg src=\"https://img.shields.io/github/stars/Synaptrix/ChatGPT-Desktop?style=social\"/\u003e : Fuel your productivity with ChatGPT-Desktop - Blazingly fast and supercharged!\r\n\r\n            - [Poordeveloper/chatgpt-app](https://github.com/Poordeveloper/chatgpt-app) \u003cimg src=\"https://img.shields.io/github/stars/Poordeveloper/chatgpt-app?style=social\"/\u003e : A ChatGPT App for all platforms. Built with Rust + Tauri + Vue + Axum.\r\n\r\n            - [mxismean/chatgpt-app](https://github.com/mxismean/chatgpt-app) \u003cimg src=\"https://img.shields.io/github/stars/mxismean/chatgpt-app?style=social\"/\u003e : Tauri 项目：ChatGPT App.\r\n\r\n            - [sonnylazuardi/chat-ai-desktop](https://github.com/sonnylazuardi/chat-ai-desktop) \u003cimg src=\"https://img.shields.io/github/stars/sonnylazuardi/chat-ai-desktop?style=social\"/\u003e : Chat AI Desktop App. Unofficial ChatGPT desktop app for Mac \u0026 Windows menubar using Tauri \u0026 Rust.\r\n\r\n            - [yetone/openai-translator](https://github.com/yetone/openai-translator) \u003cimg src=\"https://img.shields.io/github/stars/yetone/openai-translator?style=social\"/\u003e : The translator that does more than just translation - powered by OpenAI.\r\n\r\n            - [m1guelpf/browser-agent](https://github.com/m1guelpf/browser-agent) \u003cimg src=\"https://img.shields.io/github/stars/m1guelpf/browser-agent?style=social\"/\u003e : A browser AI agent, using GPT-4. [docs.rs/browser-agent](https://docs.rs/browser-agent/latest/browser_agent/)\r\n\r\n            - [sigoden/aichat](https://github.com/sigoden/aichat) \u003cimg src=\"https://img.shields.io/github/stars/sigoden/aichat?style=social\"/\u003e : Using ChatGPT/GPT-3.5/GPT-4 in the terminal.\r\n\r\n            - [uiuifree/rust-openai-chatgpt-api](https://github.com/uiuifree/rust-openai-chatgpt-api) \u003cimg src=\"https://img.shields.io/github/stars/uiuifree/rust-openai-chatgpt-api?style=social\"/\u003e : \"rust-openai-chatgpt-api\" is a Rust library for accessing the ChatGPT API, a powerful NLP platform by OpenAI. The library provides a simple and efficient interface for sending requests and receiving responses, including chat. It uses reqwest and serde for HTTP requests and JSON serialization.\r\n\r\n            - [1595901624/gpt-aggregated-edition](https://github.com/1595901624/gpt-aggregated-edition) \u003cimg src=\"https://img.shields.io/github/stars/1595901624/gpt-aggregated-edition?style=social\"/\u003e : 聚合ChatGPT官方版、ChatGPT免费版、文心一言、Poe、chatchat等多平台，支持自定义导入平台。\r\n\r\n            - [Cormanz/smartgpt](https://github.com/Cormanz/smartgpt) \u003cimg src=\"https://img.shields.io/github/stars/Cormanz/smartgpt?style=social\"/\u003e : A program that provides LLMs with the ability to complete complex tasks using plugins.\r\n\r\n            - [femtoGPT](https://github.com/keyvank/femtoGPT) \u003cimg src=\"https://img.shields.io/github/stars/keyvank/femtoGPT?style=social\"/\u003e : femtoGPT is a pure Rust implementation of a minimal Generative Pretrained Transformer. [discord.gg/wTJFaDVn45](https://github.com/keyvank/femtoGPT)\r\n\r\n            - [shafishlabs/llmchain-rs](https://github.com/shafishlabs/llmchain-rs) \u003cimg src=\"https://img.shields.io/github/stars/shafishlabs/llmchain-rs?style=social\"/\u003e : 🦀Rust + Large Language Models - Make AI Services Freely and Easily. Inspired by LangChain.\r\n\r\n            - [flaneur2020/llama2.rs](https://github.com/flaneur2020/llama2.rs) \u003cimg src=\"https://img.shields.io/github/stars/flaneur2020/llama2.rs?style=social\"/\u003e : An rust reimplementatin of [https://github.com/karpathy/llama2.c](https://github.com/karpathy/llama2.c).\r\n\r\n            - [Heng30/chatbox](https://github.com/Heng30/chatbox) \u003cimg src=\"https://img.shields.io/github/stars/Heng30/chatbox?style=social\"/\u003e : A Chatbot for OpenAI ChatGPT. Based on Slint-ui and Rust.\r\n\r\n            - [fairjm/dioxus-openai-qa-gui](https://github.com/fairjm/dioxus-openai-qa-gui) \u003cimg src=\"https://img.shields.io/github/stars/fairjm/dioxus-openai-qa-gui?style=social\"/\u003e : a simple openai qa desktop app built with dioxus.\r\n\r\n            - [purton-tech/bionicgpt](https://github.com/purton-tech/bionicgpt) \u003cimg src=\"https://img.shields.io/github/stars/purton-tech/bionicgpt?style=social\"/\u003e : Accelerate LLM adoption in your organisation. Chat with your confidential data safely and securely. [bionic-gpt.com](https://bionic-gpt.com/)\r\n\r\n\r\n\r\n\r\n        - #### Zig Implementation\r\n\r\n            - [llama2.zig](https://github.com/cgbur/llama2.zig) \u003cimg src=\"https://img.shields.io/github/stars/cgbur/llama2.zig?style=social\"/\u003e : Inference Llama 2 in one file of pure Zig.\r\n\r\n            - [renerocksai/gpt4all.zig](https://github.com/renerocksai/gpt4all.zig) \u003cimg src=\"https://img.shields.io/github/stars/renerocksai/gpt4all.zig?style=social\"/\u003e : ZIG build for a terminal-based chat client for an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa.\r\n\r\n            - [EugenHotaj/zig_inference](https://github.com/EugenHotaj/zig_inference) \u003cimg src=\"https://img.shields.io/github/stars/EugenHotaj/zig_inference?style=social\"/\u003e : Neural Network Inference Engine in Zig.\r\n\r\n\r\n        - ##### Go Implementation\r\n\r\n            - [Ollama](https://github.com/ollama/ollama/) \u003cimg src=\"https://img.shields.io/github/stars/ollama/ollama?style=social\"/\u003e : Get up and running with Llama 2, Mistral, Gemma, and other large language models. [ollama.com](https://ollama.com/)\r\n\r\n\r\n            - [go-skynet/LocalAI](https://github.com/go-skynet/LocalAI) \u003cimg src=\"https://img.shields.io/github/stars/go-skynet/LocalAI?style=social\"/\u003e : 🤖 Self-hosted, community-driven, local OpenAI-compatible API. Drop-in replacement for OpenAI running LLMs on consumer-grade hardware. Free Open Source OpenAI alternative. No GPU required. LocalAI is an API to run ggml compatible models: llama, gpt4all, rwkv, whisper, vicuna, koala, gpt4all-j, cerebras, falcon, dolly, starcoder, and many other. [localai.io](https://localai.io/)\r\n\r\n\r\n\r\n    - #### Distributed and Multi-GPU Framework\r\n      ##### 分布式以及多GPU框架\r\n\r\n        - [NVIDIA/nccl](https://github.com/NVIDIA/nccl) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/nccl?style=social\"/\u003e : Optimized primitives for collective multi-GPU communication.\r\n\r\n        - [NVIDIA/multi-gpu-programming-models](https://github.com/NVIDIA/multi-gpu-programming-models) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA/multi-gpu-programming-models?style=social\"/\u003e : Examples demonstrating available options to program multiple GPUs in a single node or a cluster.\r\n\r\n        - [wilicc/gpu-burn](https://github.com/wilicc/gpu-burn) \u003cimg src=\"https://img.shields.io/github/stars/wilicc/gpu-burn?style=social\"/\u003e : Multi-GPU CUDA stress test.\r\n\r\n        - [SCUDA](https://github.com/kevmo314/scuda) \u003cimg src=\"https://img.shields.io/github/stars/kevmo314/scuda?style=social\"/\u003e : SCUDA: GPU-over-IP. SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.\r\n\r\n\r\n\r\n\r\n    - #### Robotics Framework\r\n      ##### 机器人框架\r\n\r\n\r\n        - [Cupoch](https://github.com/neka-nat/cupoch) \u003cimg src=\"https://img.shields.io/github/stars/neka-nat/cupoch?style=social\"/\u003e : Robotics with GPU computing.\r\n\r\n\r\n\r\n    - #### ZKP and Web3 Framework\r\n      ##### 零知识证明和Web3框架\r\n\r\n        - [Tachyon](https://github.com/kroma-network/tachyon) \u003cimg src=\"https://img.shields.io/github/stars/kroma-network/tachyon?style=social\"/\u003e : Modular ZK(Zero Knowledge) backend accelerated by GPU.\r\n\r\n        - [Blitzar](https://github.com/spaceandtimelabs/blitzar) \u003cimg src=\"https://img.shields.io/github/stars/spaceandtimelabs/blitzar?style=social\"/\u003e : Zero-knowledge proof acceleration with GPUs for C++ and Rust. [www.spaceandtime.io/](https://www.spaceandtime.io/)\r\n\r\n        - [blitzar-rs](https://github.com/spaceandtimelabs/blitzar-rs) \u003cimg src=\"https://img.shields.io/github/stars/spaceandtimelabs/blitzar-rs?style=social\"/\u003e : High-Level Rust wrapper for the blitzar-sys crate. [www.spaceandtime.io/](https://www.spaceandtime.io/)\r\n\r\n        - [ICICLE](https://github.com/ingonyama-zk/icicle) \u003cimg src=\"https://img.shields.io/github/stars/ingonyama-zk/icicle?style=social\"/\u003e : ICICLE is a library for ZK acceleration using CUDA-enabled GPUs.\r\n\r\n\r\n\r\n\r\n  - ### Triton Frameworks\r\n\r\n    - #### Triton Machine Learning Framework\r\n\r\n        - [BobMcDear/attorch](https://github.com/BobMcDear/attorch) \u003cimg src=\"https://img.shields.io/github/stars/BobMcDear/attorch?style=social\"/\u003e : A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.\r\n\r\n\r\n\r\n    - #### Triton High Performance Kernel Library\r\n\r\n        - [Liger-Kernel](https://github.com/linkedin/Liger-Kernel) \u003cimg src=\"https://img.shields.io/github/stars/linkedin/Liger-Kernel?style=social\"/\u003e : Efficient Triton Kernels for LLM Training. [arxiv.org/pdf/2410.10989](https://arxiv.org/pdf/2410.10989)\r\n\r\n        - [FlagGems](https://github.com/FlagOpen/FlagGems) \u003cimg src=\"https://img.shields.io/github/stars/FlagOpen/FlagGems?style=social\"/\u003e : FlagGems is a high-performance general operator library implemented in [OpenAI Triton](https://github.com/openai/triton). It aims to provide a suite of kernel functions to accelerate LLM training and inference.\r\n\r\n        - [linxihui/dkernel](https://github.com/linxihui/dkernel) \u003cimg src=\"https://img.shields.io/github/stars/linxihui/dkernel?style=social\"/\u003e : This repo contains customized CUDA kernels written in OpenAI Triton. As of now, it contains the sparse attention kernel used in [phi-3-small models](https://huggingface.co/microsoft/Phi-3-small-8k-instruct). The sparse attention is also supported in vLLM for efficient inference.\r\n\r\n\r\n\r\n    - #### Triton Inference Framework\r\n\r\n        - [harleyszhang/lite_llama](https://github.com/harleyszhang/lite_llama) \u003cimg src=\"https://img.shields.io/github/stars/harleyszhang/lite_llama?style=social\"/\u003e : The llama model inference lite framework by triton.\r\n\r\n\r\n  - ### MLIR Frameworks\r\n\r\n    - #### MLIR GPU Programming\r\n\r\n        - ['gpu' Dialect](https://mlir.llvm.org/docs/Dialects/GPU/) : This dialect provides middle-level abstractions for launching GPU kernels following a programming model similar to that of CUDA or OpenCL.\r\n\r\n        - ['amdgpu' Dialect](https://mlir.llvm.org/docs/Dialects/AMDGPU/) : The AMDGPU dialect provides wrappers around AMD-specific functionality and LLVM intrinsics.\r\n\r\n\r\n\r\n    - #### MLIR FFI Bindings\r\n\r\n        - [pyMLIR](https://github.com/spcl/pymlir) \u003cimg src=\"https://img.shields.io/github/stars/spcl/pymlir?style=social\"/\u003e : Python interface for MLIR - the Multi-Level Intermediate Representation. pyMLIR is a full Python interface to parse, process, and output [MLIR](https://mlir.llvm.org/) files according to the syntax described in the [MLIR documentation](https://github.com/llvm/llvm-project/tree/master/mlir/docs). pyMLIR supports the basic dialects and can be extended with other dialects.\r\n\r\n\r\n    - #### MLIR Machine Learning Framework\r\n\r\n        - [Torch-MLIR](https://github.com/llvm/torch-mlir) \u003cimg src=\"https://img.shields.io/github/stars/llvm/torch-mlir?style=social\"/\u003e : The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.\r\n\r\n        - [ONNX-MLIR](https://github.com/onnx/onnx-mlir) \u003cimg src=\"https://img.shields.io/github/stars/onnx/onnx-mlir?style=social\"/\u003e : Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure.\r\n\r\n        - [TPU-MLIR](https://github.com/sophgo/tpu-mlir) \u003cimg src=\"https://img.shields.io/github/stars/sophgo/tpu-mlir?style=social\"/\u003e : Machine learning compiler based on MLIR for Sophgo TPU. TPU-MLIR is an open-source machine-learning compiler based on MLIR for TPU. This project provides a complete toolchain, which can convert pre-trained neural networks from different frameworks into binary files bmodel that can be efficiently operated on TPUs.\r\n\r\n        - [IREE](https://github.com/iree-org/iree) \u003cimg src=\"https://img.shields.io/github/stars/iree-org/iree?style=social\"/\u003e : IREE: Intermediate Representation Execution Environment. A retargetable MLIR-based machine learning compiler and runtime toolkit. [iree.dev/](http://iree.dev/)\r\n\r\n        - [ByteIR](https://github.com/bytedance/byteir) \u003cimg src=\"https://img.shields.io/github/stars/bytedance/byteir?style=social\"/\u003e : The ByteIR Project is a ByteDance model compilation solution. ByteIR includes compiler, runtime, and frontends, and provides an end-to-end model compilation solution. [byteir.ai](https://byteir.ai/)\r\n\r\n        - [Xilinx/mlir-aie](https://github.com/Xilinx/mlir-aie) \u003cimg src=\"https://img.shields.io/github/stars/Xilinx/mlir-aie?style=social\"/\u003e : An MLIR-based toolchain for AMD AI Engine-enabled devices. This repository contains an MLIR-based toolchain for AI Engine-enabled devices, such as [AMD Ryzen™ AI](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html) and [Versal™](https://www.xilinx.com/products/technology/ai-engine.html).\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n  - ### HPC Frameworks\r\n\r\n    - [BLAS](https://www.netlib.org/blas/) : BLAS (Basic Linear Algebra Subprograms). The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations.\r\n\r\n    - [LAPACK](https://github.com/Reference-LAPACK/lapack) \u003cimg src=\"https://img.shields.io/github/stars/Reference-LAPACK/lapack?style=social\"/\u003e : LAPACK development repository. [LAPACK](https://www.netlib.org/lapack/) — Linear Algebra PACKage. LAPACK is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision.\r\n\r\n    - [OpenBLAS](https://github.com/OpenMathLib/OpenBLAS) \u003cimg src=\"https://img.shields.io/github/stars/OpenMathLib/OpenBLAS?style=social\"/\u003e : OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. [www.openblas.net](http://www.openblas.net/)\r\n\r\n    - [BLIS](https://github.com/flame/blis) \u003cimg src=\"https://img.shields.io/github/stars/flame/blis?style=social\"/\u003e : BLAS-like Library Instantiation Software Framework.\r\n\r\n    - [NumPy](https://github.com/numpy/numpy) \u003cimg src=\"https://img.shields.io/github/stars/numpy/numpy?style=social\"/\u003e : The fundamental package for scientific computing with Python. [numpy.org](https://numpy.org/)\r\n\r\n    - [SciPy](https://github.com/scipy/scipy) \u003cimg src=\"https://img.shields.io/github/stars/scipy/scipy?style=social\"/\u003e : SciPy library main repository. SciPy (pronounced \"Sigh Pie\") is an open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. [scipy.org](https://scipy.org/)\r\n\r\n    - [Gonum](https://github.com/gonum/gonum) \u003cimg src=\"https://img.shields.io/github/stars/gonum/gonum?style=social\"/\u003e : Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more. [www.gonum.org/](https://www.gonum.org/)\r\n\r\n    - [YichengDWu/matmul.mojo](https://github.com/YichengDWu/matmul.mojo) \u003cimg src=\"https://img.shields.io/github/stars/YichengDWu/matmul.mojo?style=social\"/\u003e : High Performance Matrix Multiplication in Pure Mojo 🔥. Matmul.🔥 is a high performance muilti-threaded implimentation of the [BLIS](https://en.wikipedia.org/wiki/BLIS_(software)) algorithm in pure Mojo 🔥.\r\n\r\n\r\n\r\n\r\n## Applications\r\n\r\n  - ### CUDA Applications\r\n\r\n\r\n    - #### Image Preprocess\r\n\r\n        - [emptysoal/cuda-image-preprocess](https://github.com/emptysoal/cuda-image-preprocess) \u003cimg src=\"https://img.shields.io/github/stars/emptysoal/cuda-image-preprocess?style=social\"/\u003e : Speed up image preprocess with cuda when handle image or tensorrt inference. Cuda编程加速图像预处理。\r\n\r\n\r\n\r\n    - #### Object Detection\r\n\r\n        - [laugh12321/TensorRT-YOLO](https://github.com/laugh12321/TensorRT-YOLO) \u003cimg src=\"https://img.shields.io/github/stars/laugh12321/TensorRT-YOLO?style=social\"/\u003e : 🚀 TensorRT-YOLO: Support YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, PP-YOLOE using TensorRT acceleration with EfficientNMS! TensorRT-YOLO 是一个支持 YOLOv3、YOLOv5、YOLOv6、YOLOv7、YOLOv8、YOLOv9、YOLOv10、PP-YOLOE 和 PP-YOLOE+ 的推理加速项目，使用 NVIDIA TensorRT 进行优化。项目不仅集成了 EfficientNMS TensorRT 插件以增强后处理效果，还使用了 CUDA 核函数来加速前处理过程。TensorRT-YOLO 提供了 C++ 和 Python 推理的支持，旨在提供快速而优化的目标检测解决方案。\r\n\r\n        - [l-sf/Linfer](https://github.com/l-sf/Linfer) \u003cimg src=\"https://img.shields.io/github/stars/l-sf/Linfer?style=social\"/\u003e : 基于TensorRT的C++高性能推理库，Yolov10, YoloPv2，Yolov5/7/X/8，RT-DETR，单目标跟踪OSTrack、LightTrack。\r\n\r\n        - [Melody-Zhou/tensorRT_Pro-YOLOv8](https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8) \u003cimg src=\"https://img.shields.io/github/stars/Melody-Zhou/tensorRT_Pro-YOLOv8?style=social\"/\u003e : This repository is based on [shouxieai/tensorRT_Pro](https://github.com/shouxieai/tensorRT_Pro), with adjustments to support YOLOv8. 目前已支持 YOLOv8、YOLOv8-Cls、YOLOv8-Seg、YOLOv8-OBB、YOLOv8-Pose、RT-DETR、ByteTrack、YOLOv9、YOLOv10、RTMO 高性能推理！！！🚀🚀🚀\r\n\r\n        - [shouxieai/tensorRT_Pro](https://github.com/shouxieai/tensorRT_Pro) \u003cimg src=\"https://img.shields.io/github/stars/shouxieai/tensorRT_Pro?style=social\"/\u003e : C++ library based on tensorrt integration.\r\n\r\n        - [shouxieai/infer](https://github.com/shouxieai/infer) \u003cimg src=\"https://img.shields.io/github/stars/shouxieai/infer?style=social\"/\u003e : A new tensorrt integrate. Easy to integrate many tasks.\r\n\r\n        - [kalfazed/tensorrt_starter](https://github.com/kalfazed/tensorrt_starter) \u003cimg src=\"https://img.shields.io/github/stars/kalfazed/tensorrt_starter?style=social\"/\u003e : This repository give a guidline to learn CUDA and TensorRT from the beginning.\r\n\r\n        - [hamdiboukamcha/yolov10-tensorrt](https://github.com/hamdiboukamcha/yolov10-tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/hamdiboukamcha/yolov10-tensorrt?style=social\"/\u003e : YOLOv10 C++ TensorRT : Real-Time End-to-End Object Detection.\r\n\r\n        - [triple-Mu/YOLOv8-TensorRT](https://github.com/triple-Mu/YOLOv8-TensorRT) \u003cimg src=\"https://img.shields.io/github/stars/triple-Mu/YOLOv8-TensorRT?style=social\"/\u003e : YOLOv8 using TensorRT accelerate !\r\n\r\n        - [FeiYull/TensorRT-Alpha](https://github.com/FeiYull/TensorRT-Alpha) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA-AI-IOT/torch2trt?style=social\"/\u003e : 🔥🔥🔥TensorRT for YOLOv8、YOLOv8-Pose、YOLOv8-Seg、YOLOv8-Cls、YOLOv7、YOLOv6、YOLOv5、YOLONAS......🚀🚀🚀CUDA IS ALL YOU NEED.🍎🍎🍎\r\n\r\n        - [cyrusbehr/YOLOv8-TensorRT-CPP](https://github.com/cyrusbehr/YOLOv8-TensorRT-CPP) \u003cimg src=\"https://img.shields.io/github/stars/cyrusbehr/YOLOv8-TensorRT-CPP?style=social\"/\u003e : YOLOv8 TensorRT C++ Implementation. A C++ Implementation of YoloV8 using TensorRT Supports object detection, semantic segmentation, and body pose estimation.\r\n\r\n        - [VIDIA-AI-IOT/torch2trt](https://github.com/NVIDIA-AI-IOT/torch2trt) \u003cimg src=\"https://img.shields.io/github/stars/NVIDIA-AI-IOT/torch2trt?style=social\"/\u003e : An easy to use PyTorch to TensorRT converter.\r\n\r\n        - [zhiqwang/yolort](https://github.com/zhiqwang/yolort) \u003cimg src=\"https://img.shields.io/github/stars/zhiqwang/yolort?style=social\"/\u003e : yolort is a runtime stack for yolov5 on specialized accelerators such as tensorrt, libtorch, onnxruntime, tvm and ncnn. [zhiqwang.com/yolort](https://zhiqwang.com/yolort/)\r\n\r\n        - [Linaom1214/TensorRT-For-YOLO-Series](https://github.com/Linaom1214/TensorRT-For-YOLO-Series) \u003cimg src=\"https://img.shields.io/github/stars/Linaom1214/TensorRT-For-YOLO-Series?style=social\"/\u003e : YOLO Series TensorRT Python/C++. tensorrt for yolo series (YOLOv8, YOLOv7, YOLOv6....), nms plugin support.\r\n\r\n        - [wang-xinyu/tensorrtx](https://github.com/wang-xinyu/tensorrtx) \u003cimg src=\"https://img.shields.io/github/stars/wang-xinyu/tensorrtx?style=social\"/\u003e : TensorRTx aims to implement popular deep learning networks with tensorrt network definition APIs.\r\n\r\n\r\n        - [DefTruth/lite.ai.toolkit](https://github.com/DefTruth/lite.ai.toolkit) \u003cimg src=\"https://img.shields.io/github/stars/DefTruth/lite.ai.toolkit?style=social\"/\u003e : 🛠 A lite C++ toolkit of awesome AI models with ONNXRuntime, NCNN, MNN and TNN. YOLOX, YOLOP, YOLOv6, YOLOR, MODNet, YOLOX, YOLOv7, YOLOv5. MNN, NCNN, TNN, ONNXRuntime. “🛠Lite.Ai.ToolKit: 一个轻量级的C++ AI模型工具箱，用户友好（还行吧），开箱即用。已经包括 100+ 流行的开源模型。这是一个根据个人兴趣整理的C++工具箱，, 涵盖目标检测、人脸检测、人脸识别、语义分割、抠图等领域。”\r\n\r\n        - [PaddlePaddle/FastDeploy](https://github.com/PaddlePaddle/FastDeploy) \u003cimg src=\"https://img.shields.io/github/stars/PaddlePaddle/FastDeploy?style=social\"/\u003e : ⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.\r\n\r\n        - [enazoe/yolo-tensorrt](https://github.com/enazoe/yolo-tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/enazoe/yolo-tensorrt?style=social\"/\u003e : TensorRT8.Support Yolov5n,s,m,l,x .darknet -\u003e tensorrt. Yolov4 Yolov3 use raw darknet *.weights and *.cfg fils. If the wrapper is useful to you,please Star it.\r\n\r\n        - [guojianyang/cv-detect-robot](https://github.com/guojianyang/cv-detect-robot) \u003cimg src=\"https://img.shields.io/github/stars/guojianyang/cv-detect-robot?style=social\"/\u003e : 🔥🔥🔥🔥🔥🔥Docker NVIDIA Docker2 YOLOV5 YOLOX YOLO Deepsort TensorRT ROS Deepstream Jetson Nano TX2 NX for High-performance deployment(高性能部署)。\r\n\r\n        - [BlueMirrors/Yolov5-TensorRT](https://github.com/BlueMirrors/Yolov5-TensorRT) \u003cimg src=\"https://img.shields.io/github/stars/BlueMirrors/Yolov5-TensorRT?style=social\"/\u003e : Yolov5 TensorRT Implementations.\r\n\r\n        - [lewes6369/TensorRT-Yolov3](https://github.com/lewes6369/TensorRT-Yolov3) \u003cimg src=\"https://img.shields.io/github/stars/lewes6369/TensorRT-Yolov3?style=social\"/\u003e : TensorRT for Yolov3.\r\n\r\n        - [CaoWGG/TensorRT-YOLOv4](https://github.com/CaoWGG/TensorRT-YOLOv4) \u003cimg src=\"https://img.shields.io/github/stars/CaoWGG/TensorRT-YOLOv4?style=social\"/\u003e :tensorrt5, yolov4, yolov3,yolov3-tniy,yolov3-tniy-prn.\r\n\r\n        - [isarsoft/yolov4-triton-tensorrt](https://github.com/isarsoft/yolov4-triton-tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/isarsoft/yolov4-triton-tensorrt?style=social\"/\u003e : YOLOv4 on Triton Inference Server with TensorRT.\r\n\r\n        - [TrojanXu/yolov5-tensorrt](https://github.com/TrojanXu/yolov5-tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/TrojanXu/yolov5-tensorrt?style=social\"/\u003e : A tensorrt implementation of yolov5.\r\n\r\n        - [tjuskyzhang/Scaled-YOLOv4-TensorRT](https://github.com/tjuskyzhang/Scaled-YOLOv4-TensorRT) \u003cimg src=\"https://img.shields.io/github/stars/tjuskyzhang/Scaled-YOLOv4-TensorRT?style=social\"/\u003e : Implement yolov4-tiny-tensorrt, yolov4-csp-tensorrt, yolov4-large-tensorrt(p5, p6, p7) layer by layer using TensorRT API.\r\n\r\n        - [Syencil/tensorRT](https://github.com/Syencil/tensorRT) \u003cimg src=\"https://img.shields.io/github/stars/Syencil/tensorRT?style=social\"/\u003e : TensorRT-7 Network Lib 包括常用目标检测、关键点检测、人脸检测、OCR等 可训练自己数据。\r\n\r\n        - [SeanAvery/yolov5-tensorrt](https://github.com/SeanAvery/yolov5-tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/SeanAvery/yolov5-tensorrt?style=social\"/\u003e : YOLOv5 in TensorRT.\r\n\r\n        - [Monday-Leo/YOLOv7_Tensorrt](https://github.com/Monday-Leo/YOLOv7_Tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/Monday-Leo/YOLOv7_Tensorrt?style=social\"/\u003e : A simple implementation of Tensorrt YOLOv7.\r\n\r\n        - [ibaiGorordo/ONNX-YOLOv6-Object-Detection](https://github.com/ibaiGorordo/ONNX-YOLOv6-Object-Detection) \u003cimg src=\"https://img.shields.io/github/stars/ibaiGorordo/ONNX-YOLOv6-Object-Detection?style=social\"/\u003e : Python scripts performing object detection using the YOLOv6 model in ONNX.\r\n\r\n        - [ibaiGorordo/ONNX-YOLOv7-Object-Detection](https://github.com/ibaiGorordo/ONNX-YOLOv7-Object-Detection) \u003cimg src=\"https://img.shields.io/github/stars/ibaiGorordo/ONNX-YOLOv7-Object-Detection?style=social\"/\u003e : Python scripts performing object detection using the YOLOv7 model in ONNX.\r\n\r\n        - [triple-Mu/yolov7](https://github.com/triple-Mu/yolov7) \u003cimg src=\"https://img.shields.io/github/stars/triple-Mu/yolov7?style=social\"/\u003e : End2end TensorRT YOLOv7.\r\n\r\n        - [hewen0901/yolov7_trt](https://github.com/hewen0901/yolov7_trt) \u003cimg src=\"https://img.shields.io/github/stars/hewen0901/yolov7_trt?style=social\"/\u003e : yolov7目标检测算法的c++ tensorrt部署代码。\r\n\r\n        - [tsutof/tiny_yolov2_onnx_cam](https://github.com/tsutof/tiny_yolov2_onnx_cam) \u003cimg src=\"https://img.shields.io/github/stars/tsutof/tiny_yolov2_onnx_cam?style=social\"/\u003e : Tiny YOLO v2 Inference Application with NVIDIA TensorRT.\r\n\r\n        - [Monday-Leo/Yolov5_Tensorrt_Win10](https://github.com/Monday-Leo/Yolov5_Tensorrt_Win10) \u003cimg src=\"https://img.shields.io/github/stars/Monday-Leo/Yolov5_Tensorrt_Win10?style=social\"/\u003e : A simple implementation of tensorrt yolov5 python/c++🔥\r\n\r\n        - [Wulingtian/yolov5_tensorrt_int8](https://github.com/Wulingtian/yolov5_tensorrt_int8) \u003cimg src=\"https://img.shields.io/github/stars/Wulingtian/yolov5_tensorrt_int8?style=social\"/\u003e : TensorRT int8 量化部署 yolov5s 模型，实测3.3ms一帧！\r\n\r\n        - [Wulingtian/yolov5_tensorrt_int8_tools](https://github.com/Wulingtian/yolov5_tensorrt_int8_tools) \u003cimg src=\"https://img.shields.io/github/stars/Wulingtian/yolov5_tensorrt_int8_tools?style=social\"/\u003e : tensorrt int8 量化yolov5 onnx模型。\r\n\r\n        - [MadaoFY/yolov5_TensorRT_inference](https://github.com/MadaoFY/yolov5_TensorRT_inference) \u003cimg src=\"https://img.shields.io/github/stars/MadaoFY/yolov5_TensorRT_inference?style=social\"/\u003e : 记录yolov5的TensorRT量化及推理代码，经实测可运行于Jetson平台。\r\n\r\n        - [ibaiGorordo/ONNX-YOLOv8-Object-Detection](https://github.com/ibaiGorordo/ONNX-YOLOv8-Object-Detection) \u003cimg src=\"https://img.shields.io/github/stars/ibaiGorordo/ONNX-YOLOv8-Object-Detection?style=social\"/\u003e : Python scripts performing object detection using the YOLOv8 model in ONNX.\r\n\r\n        - [we0091234/yolov8-tensorrt](https://github.com/we0091234/yolov8-tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/we0091234/yolov8-tensorrt?style=social\"/\u003e : yolov8 tensorrt 加速.\r\n\r\n        - [FeiYull/yolov8-tensorrt](https://github.com/FeiYull/yolov8-tensorrt) \u003cimg src=\"https://img.shields.io/github/stars/FeiYull/yolov8-tensorrt?style=social\"/\u003e : YOLOv8的TensorRT+CUDA加速部署，代码可在Win、Linux下运行。\r\n\r\n        - [cvdong/YOLO_TRT_SIM](https://github.com/cvdong/YOLO_TRT_SIM) \u003cimg src=\"https://img.shields.io/github/stars/cvdong/YOLO_TRT_SIM?style=social\"/\u003e : 🐇 一套代码同时支持YOLO X, V5, V6, V7, V8 TRT推理 ™️ 🔝 ,前后处理均由CUDA核函数实现 CPP/CUDA🚀\r\n\r\n        - [cvdong/YOLO_TRT_PY](https://github.com/cvdong/YOLO_TRT_PY) \u003cimg src=\"https://img.shields.io/github/stars/cvdong/YOLO_TRT_PY?style=social\"/\u003e : 🐰 一套代码同时支持YOLOV5, V6, V7, V8 TRT推理 ™️ PYTHON ✈️\r\n\r\n        - [Psynosaur/Jetson-SecVision](https://github.com/Psynosaur/Jetson-SecVision) \u003cimg src=\"https://img.shields.io/github/stars/Psynosaur/Jetson-SecVision?style=social\"/\u003e : Person detection for Hikvision DVR with AlarmIO ports, uses TensorRT and yolov4.\r\n\r\n        - [tatsuya-fukuoka/yolov7-onnx-infer](https://github.com/tatsuya-fukuoka/yolov7-onnx-infer) \u003cimg src=\"https://img.shields.io/github/stars/tatsuya-fukuoka/yolov7-onnx-infer?style=social\"/\u003e : Inference with yolov7's onnx model.\r\n\r\n        - [MadaoFY/yolov5_TensorRT_inference](https://github.com/MadaoFY/yolov5_TensorRT_inference) \u003cimg src=\"https://img.shields.io/github/stars/MadaoFY/yolov5_TensorRT_inference?style=social\"/\u003e : 记录yolov5的TensorRT量化及推理代码，经实测可运行于Jetson平台。\r\n\r\n        - [ervgan/yolov5_tensorrt_inference](https://github.com/ervgan/yolov5_tensorrt_inference) \u003cimg src=\"https://img.shields.io/github/stars/ervgan/yolov5_tensorrt_inference?style=social\"/\u003e : TensorRT cpp inference for Yolov5 model. Supports yolov5 v1.0, v2.0, v3.0, v3.1, v4.0, v5.0, v6.0, v6.2, v7.0.\r\n\r\n        - [AlbinZhu/easy-trt](https://github.com/AlbinZhu/easy-trt) \u003cimg src=\"https://img.shields.io/github/stars/AlbinZhu/easy-trt?style=social\"/\u003e : TensorRT for YOLOv10 with CUDA.\r\n\r\n\r\n\r\n\r\n\r\n## Blogs\r\n\r\n  - ### CUDA and TensorRT Blogs\r\n\r\n    - 微信公众号「NVIDIA英伟达」\r\n        - [2023-10-27，现已公开发布！欢迎使用 NVIDIA TensorRT-LLM 优化大语言模型推理](https://mp.weixin.qq.com/s/QaSbvyAmI6XXtr0y6W4LNQ)\r\n        - [2023-11-24，使用 NVIDIA IGX Orin 开发者套件在边缘部署大语言模型](https://mp.weixin.qq.com/s/TOTVc5ntQJfH-DJ4_8uNTQ)\r\n        - [2024-06-03，COMPUTEX 2024 | “加速一切”，NVIDIA CEO 黄仁勋在 COMPUTEX 开幕前发表主题演讲](https://mp.weixin.qq.com/s/usHo79-ssQiX0Rt5dvJ-sQ)\r\n        - [2024-06-19，NVIDIA CEO 黄仁勋寄语毕业生：“对非常规、未经探索的东西保持信仰”](https://mp.weixin.qq.com/s/L8Lv6pz9BIgzLdm6qZm6dQ)\r\n    - 微信公众号「NVIDIA英伟达企业解决方案」\r\n        - [2024-04-24，NVIDIA GPU 架构下的 FP8 训练与推理](https://mp.weixin.qq.com/s/KV4XC9WT-8mfpmEzflIuvw)\r\n        - [2024-06-14，初创加速计划 | 基于 NVIDIA Jetson 平台，国讯芯微实现大小脑端到端协同控制](https://mp.weixin.qq.com/s/R7U5JUgUCMK4rvtIpgStKQ)\r\n        - [2024-06-20，NVIDIA Isaac Sim 4.0 和 NVIDIA Isaac Lab 为机器人工作流和仿真提供强大助力](https://mp.weixin.qq.com/s/BYqLDexhHnPMVsQMPLWpOA)\r\n        - [2024-06-21，消除仿真与现实之间的差距：使用 NVIDIA Isaac Lab 训练 Spot 四足机器人运动](https://mp.weixin.qq.com/s/Nb4oMxijBofiidSAHkafag)\r\n        - [2024-07-01，NVIDIA 端到端解决方案助力理想汽车打造智能驾驶体验与个性化车内空间](https://mp.weixin.qq.com/s/gmkYFj5BcJZHO4GJ_b8pyQ)\r\n        - [2024-11-27，NVIDIA TensorRT-LLM Roadmap 现已在 GitHub 上公开发布！](https://mp.weixin.qq.com/s/zqAkxmWinwNMbcIBVA1hnA)\r\n    - 微信公众号「AI不止","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoderonion%2Fawesome-cuda-and-hpc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcoderonion%2Fawesome-cuda-and-hpc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoderonion%2Fawesome-cuda-and-hpc/lists"}