Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-cuda-triton-hpc
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
https://github.com/coderonion/awesome-cuda-triton-hpc
Last synced: 6 days ago
JSON representation
-
Blogs
- 2024-05-21,并行计算:超级大脑背后的魔术师
- 2024-10-09,深入解读tensorRT-LLM的关键技术 (未完待续)
- 2024-11-27,NVIDIA TensorRT-LLM Roadmap 现已在 GitHub 上公开发布!
- 2024-04-21,搞懂 NVIDIA GPU 性能指标 很容易弄混的一个概念: Utilization vs Saturation
- 2024-08-06,如何把 PyTorch 的 GPU 利用率提升到 100% ?
- 2024-08-13,TensorRT-LLM初探(三)最佳部署实践
- 2024-07-19,CUDA-MODE 第一课课后实战(下)
- 2024-11-29,在 Nvidia Jetson AGX Orin 上使用 TensorRT-LLM 运行 LLM
- 2024-03-24,CUDA之通用矩阵乘法:从入门到熟练!
- 2024-04-09,纯C语言手搓GPT-2,前OpenAI、特斯拉高管新项目火了
- 2023-09-06,GPU底层优化,如何让Transformer在GPU上跑得更快?
- 2024-08-06,【CUDA编程】cuBLAS 库中矩阵乘法参数设置问题
- 2024-04-13,CUDA模型部署实战,自己写的CUDA矩阵乘法能优化到多快?
- 2024-04-19,AI 推理:CPU 的崛起
- 2024-09-09,使用Nsight Profiling工具对大模型进行性能调优
- 2024-07-11,FP8 低精度训练:Transformer Engine 简析
- 2024-07-19,softmax算子开发介绍
- 2024-07-27,flash attention的CUDA编程
- 2024-07-30,CUDA实现规约的并行策略
- 2024-03-19,史上最强芯片推出!英伟达发布新一代BlackWell GPU
- 2024-07-31,搭载英伟达Jetson Orin的Allspark 2全新亮相,算力高达100TOPS!
- 2024-08-09,如何把 PyTorch 的 GPU 利用率提升到 100% ?
- 2024-03-19,Nvidia推出Blackwell B200 GPU,是目前最强的人工智能芯片
- 2024-03-19,能超越英伟达的只有英伟达
- 2024-03-20,NVIDIA 与 Blackwell 一起改写摩尔定律
- 2024-03-15,NVIDIA大语言模型落地的全流程解析
- 2024-07-09,智源打造基于Triton的大模型算子库,助力AI芯片软硬件生态建设
- 2024-09-06,智源打造基于Triton的大模型算子库,助力AI芯片软硬件生态建设
- 2024-11-20,Triton活动|Triton中国社区贡献者茶话会
- 2024-09-18,Triton大会@硅谷:芯片、AI大厂齐站台
- 2024-12-04,Triton中国社区贡献者茶话会圆满落地
- 2024-12-10,Triton入门实践 | 算子性能优化:自动调优的艺术
- 2024-07-18,摩尔线程 × 智源研究院|完成基于Triton的大模型算子库适配
- 2024-11-12,开源MUTLASS|摩尔线程加速基于国产GPU的算子开发以及算法创新
- 2024-10-14,首个完整 Triton 中文文档上线!开启 GPU 推理加速新时代
- 2024-08-22,OpenAI Triton 简介(一)
- 2024-11-05,开源vLLM-MUSA|摩尔线程持续加速基于国产GPU的AI大模型推理开发
- 2024-10-24,OpenAI Triton 简介(二)
- 2023-05-22,模型推理服务化框架Triton保姆式教程(一):快速入门
- 2023-06-02,模型推理服务化框架Triton保姆式教程(二):架构解析
- 2023-06-03,模型推理服务化框架Triton保姆式教程(三):开发实践
- 2024-01-22,【BBuf的CUDA笔记】十三,OpenAI Triton 入门笔记一
- 2024-10-08,【翻译】【PyTorch 奇技淫巧】FlexAttetion 基于Triton打造灵活度拉满的Attention
- 2024-09-06,PyTorch官宣:告别CUDA,GPU推理迎来Triton加速新时代
- 2024-09-08,PyTorch官宣:告别CUDA,GPU推理迎来Triton加速新时代
- 2024-09-10,不依赖CUDA的大模型推理已经实现
- 2024-07-28,CUDA-MODE课程笔记 第7课: Quantization Cuda vs Triton
- 2024-08-01,TRT-LLM中的Quantization GEMM(Ampere Mixed GEMM)CUTLASS 2.x 课程学习笔记
- 2024-08-05,CUDA-MODE课程笔记 第8课: CUDA性能检查清单
- 2024-10-08,NVIDIA Jetson平台助力Instacart,实现超市智能购物无缝体验
- 2024-11-28,TensorRT-LLM:开启Jetson平台上大语言模型推理的新篇章
- 2024-01-26,基于TensorRT-LLM的大模型部署(速通笔记)
- 2024-05-14,OpenAI Triton 入门
- 2025-01-06,深入Triton源码:揭开AI加速引擎的神秘面纱!
- 知乎「Soaring」
- 2024-05-21,并行计算:超级大脑背后的魔术师
- Modular Blog
- 2023-03-23,AI’s compute fragmentation: what matrix multiplication teaches us
- 2023-04-20,The world's fastest unified matrix multiplication
- 2023-05-02,A unified, extensible platform to superpower your AI
- 2023-08-18,How Mojo🔥 gets a 35,000x speedup over Python – Part 1
- 2023-08-28,How Mojo🔥 gets a 35,000x speedup over Python – Part 2
- 2023-09-06,Mojo🔥 - A journey to 68,000x speedup over Python - Part 3
- 2024-02-12,Mojo vs. Rust: is Mojo 🔥 faster than Rust 🦀 ?
- 2024-04-10,Row-major vs. column-major matrices: a performance analysis in Mojo and NumPy
- 2021-03-23,张先轶博士:OpenBLAS项目与矩阵乘法优化
- 2023-11-11, 朱懿:HPC之矩阵乘法高性能实验报告
- Modular Blog
- 2023-03-23,AI’s compute fragmentation: what matrix multiplication teaches us
- 2023-04-20,The world's fastest unified matrix multiplication
- 2023-05-02,A unified, extensible platform to superpower your AI
- 2023-08-18,How Mojo🔥 gets a 35,000x speedup over Python – Part 1
- 2023-08-28,How Mojo🔥 gets a 35,000x speedup over Python – Part 2
- 2023-09-06,Mojo🔥 - A journey to 68,000x speedup over Python - Part 3
- 2024-02-12,Mojo vs. Rust: is Mojo 🔥 faster than Rust 🦀 ?
- 2024-04-10,Row-major vs. column-major matrices: a performance analysis in Mojo and NumPy
- 2021-03-23,张先轶博士:OpenBLAS项目与矩阵乘法优化
- 2023-11-11, 朱懿:HPC之矩阵乘法高性能实验报告
- 2023-06-16,SIMD 指令集与数据并行程序
- 2024-05-21,并行计算:超级大脑背后的魔术师
- 2024-06-29,BLAS简介:基于Fortran的高性能矩阵计算基础库
- 2024-07-08,LAPACK简介:基于Fortran的高性能线性代数工具箱
- 2024-07-12,使用SIMD优化二叉搜索树
- 2024-06-21,YOLOv10在PyTorch和OpenVINO中推理对比
- 知乎「白牛」
- 2023-05-04,OpenBLAS gemm从零入门
- 知乎「庄碧晨」
- 2021-01-22,多线程 GEMM 论文 笔记
- 知乎「OeuFcoque」
- 2020-04-12,高性能计算简介(一):初步分析,BLAS,BLIS简介
- 知乎「赵小明12138」
- 2022-10-26,并行计算-canon算法:矩阵相乘
- 知乎「zero」
- 2021-12-18,稠密矩阵乘003(gemm)-OpenBLAS和BLIS分块策略
- 知乎「严忻恺」
- 2022-03-31,斯坦福CS217(三)GEMM计算加速
- 黎明灰烬 博客
- 2019-06-12,通用矩阵乘(GEMM)优化算法
- 2023-10-27,现已公开发布!欢迎使用 NVIDIA TensorRT-LLM 优化大语言模型推理
- 2023-11-24,使用 NVIDIA IGX Orin 开发者套件在边缘部署大语言模型
- 2024-06-03,COMPUTEX 2024 | “加速一切”,NVIDIA CEO 黄仁勋在 COMPUTEX 开幕前发表主题演讲
- 2024-06-19,NVIDIA CEO 黄仁勋寄语毕业生:“对非常规、未经探索的东西保持信仰”
- 2024-04-24,NVIDIA GPU 架构下的 FP8 训练与推理
- 2024-06-14,初创加速计划 | 基于 NVIDIA Jetson 平台,国讯芯微实现大小脑端到端协同控制
- 2024-06-20,NVIDIA Isaac Sim 4.0 和 NVIDIA Isaac Lab 为机器人工作流和仿真提供强大助力
- 2024-06-21,消除仿真与现实之间的差距:使用 NVIDIA Isaac Lab 训练 Spot 四足机器人运动
- 2024-07-01,NVIDIA 端到端解决方案助力理想汽车打造智能驾驶体验与个性化车内空间
- 2023-06-16,SIMD 指令集与数据并行程序
- 2024-05-21,并行计算:超级大脑背后的魔术师
- 2024-06-29,BLAS简介:基于Fortran的高性能矩阵计算基础库
- 2024-07-08,LAPACK简介:基于Fortran的高性能线性代数工具箱
- 2024-07-12,使用SIMD优化二叉搜索树
- 2024-06-21,YOLOv10在PyTorch和OpenVINO中推理对比
- 知乎「白牛」
- 2023-05-04,OpenBLAS gemm从零入门
- 知乎「庄碧晨」
- 2021-01-22,多线程 GEMM 论文 笔记
- 知乎「OeuFcoque」
- 2020-04-12,高性能计算简介(一):初步分析,BLAS,BLIS简介
- 知乎「赵小明12138」
- 2022-10-26,并行计算-canon算法:矩阵相乘
- 知乎「zero」
- 2021-12-18,稠密矩阵乘003(gemm)-OpenBLAS和BLIS分块策略
- 知乎「严忻恺」
- 2022-03-31,斯坦福CS217(三)GEMM计算加速
- 黎明灰烬 博客
- 2019-06-12,通用矩阵乘(GEMM)优化算法
- 2023-10-27,现已公开发布!欢迎使用 NVIDIA TensorRT-LLM 优化大语言模型推理
- 2023-11-24,使用 NVIDIA IGX Orin 开发者套件在边缘部署大语言模型
- 2024-06-03,COMPUTEX 2024 | “加速一切”,NVIDIA CEO 黄仁勋在 COMPUTEX 开幕前发表主题演讲
- 2024-06-19,NVIDIA CEO 黄仁勋寄语毕业生:“对非常规、未经探索的东西保持信仰”
- 2024-04-24,NVIDIA GPU 架构下的 FP8 训练与推理
- 2024-06-14,初创加速计划 | 基于 NVIDIA Jetson 平台,国讯芯微实现大小脑端到端协同控制
- 2024-06-20,NVIDIA Isaac Sim 4.0 和 NVIDIA Isaac Lab 为机器人工作流和仿真提供强大助力
- 2024-06-21,消除仿真与现实之间的差距:使用 NVIDIA Isaac Lab 训练 Spot 四足机器人运动
- 2024-07-01,NVIDIA 端到端解决方案助力理想汽车打造智能驾驶体验与个性化车内空间
- 2024-03-20,C++模板推导再炫技:统一AI各个device各个kernel的调用和分发
- 2024-04-09,全网首篇从tensorRT-LLM MoE CUDA kernel角度理解Mixtral-8x7b的推理加速及展望
- 2024-05-10,全面探究GPU SM内CUDA core-Tensor core能否同时计算?(上篇)
- 2024-05-16,全面探究GPU SM内CUDA core-Tensor core能否同时计算?(下篇)
- 2024-03-20,C++模板推导再炫技:统一AI各个device各个kernel的调用和分发
- 2024-04-09,全网首篇从tensorRT-LLM MoE CUDA kernel角度理解Mixtral-8x7b的推理加速及展望
- 2024-05-10,全面探究GPU SM内CUDA core-Tensor core能否同时计算?(上篇)
- 2024-05-16,全面探究GPU SM内CUDA core-Tensor core能否同时计算?(下篇)
- 2022-10-18,深入浅出GPU优化系列:reduce优化
- 2022-10-31,深入浅出GPU优化系列:spmv优化
- 2023-05-24,深入浅出GPU优化系列:gemv优化
- 2022-10-18,深入浅出GPU优化系列:reduce优化
- 2022-10-31,深入浅出GPU优化系列:spmv优化
- 2023-05-24,深入浅出GPU优化系列:gemv优化
- 2023-05-24,深入浅出GPU优化系列:GEMM优化(一)
- 2023-06-02,深入浅出GPU优化系列:GEMM优化(二)
- 2023-06-16,深入浅出GPU优化系列:GEMM优化(三)
- 2024-05-14,快速提升性能,如何更好地使用GPU(下)
- 2024-05-22,大模型精度(FP16,FP32,BF16)详解与实践
- 2023-05-24,深入浅出GPU优化系列:GEMM优化(一)
- 2023-06-02,深入浅出GPU优化系列:GEMM优化(二)
- 2023-06-16,深入浅出GPU优化系列:GEMM优化(三)
- 2023-06-26,深入浅出GPU优化系列:elementwise优化及CUDA工具链介绍
- 2023-06-27,漫谈高性能计算与性能优化:访存
- 2024-07-04,澎峰科技研发的高性能计算原语库PerfIPP库技术白皮书发布(附下载)
- 2024-03-11,图解Mixtral 8 * 7b推理优化原理与源码实现
- 2024-07-24,CUDA性能简易优化(一)背景知识
- 2024-01-09,LLM推理库TensorRT-LLM深入分析
- 2024-03-29,图解大模型计算加速系列之:vLLM核心技术PagedAttention原理
- 2024-04-06,图解大模型计算加速系列:vLLM源码解析1,整体架构
- 2024-04-12,图解大模型计算加速系列:vLLM源码解析2,调度器策略(Scheduler)
- 2024-04-19,从啥也不会到Cuda GEMM优化
- 2024-03-19,NVIDIA大语言模型落地的全流程解析
- 2024-03-20,TensorRT-LLM初探(二)简析了结构,用的更明白
- 2024-03-21,高性能 LLM 推理框架的设计与实现
- 2024-04-15,[深入分析CUTLASS系列
- 2024-04-21,搞懂 NVIDIA GPU 性能指标 很容易弄混的一个概念: Utilization vs Saturation
- 2024-04-22,快速提升性能,如何更好地使用GPU(上)
- 2024-04-10,一文上手 Tensor Core指令级编程
- 2024-04-23,大语言模型量化
- 2024-04-25,动手实现混合精度矩阵乘CUDA内核
- 2024-04-26,一文了解CUDA矩阵乘编程
- 2024-04-20,Tensor Cores 使用介绍
- 2024-05-27,[并行训练
- 2024-06-20, FP8量化解读--8bit下最优方案?(一)
- 2024-07-01,CUDA-MODE 课程笔记 第一课: 如何在 PyTorch 中 profile CUDA kernels
- 2024-07-04,CUDA-MODE 第一课课后实战(上)
- 2024-07-06,CUDA-MODE 课程笔记 第二课: PMPP 书的第1-3章速通
- 2024-07-13,CUDA-MODE 课程笔记 第四课: PMPP 书的第4-5章笔记
- 2024-07-18,CUDA-MODE课程笔记 第6课: 如何优化PyTorch中的优化器
- 2024-07-23,CUTLASS 2.x & CUTLASS 3.x Intro 学习笔记
- 2017-12-07,【推荐】CUTLASS:CUDA C++高性能线性代数运算库
- 2024-02-28,熬了几个通宵,我写了份CUDA新手入门代码
- 2024-05-13,Shared memory!CUDA数据拷贝速度拉满~
- 2024-03-29,大语言模型硬件加速器综述
- 2024-04-06,图解大模型计算加速系列:vLLM源码解析1,整体架构
- 2023-06-26,深入浅出GPU优化系列:elementwise优化及CUDA工具链介绍
- 2023-06-27,漫谈高性能计算与性能优化:访存
- 2024-07-04,澎峰科技研发的高性能计算原语库PerfIPP库技术白皮书发布(附下载)
- 2024-03-11,图解Mixtral 8 * 7b推理优化原理与源码实现
- 2024-03-29,图解大模型计算加速系列之:vLLM核心技术PagedAttention原理
- 2024-04-12,图解大模型计算加速系列:vLLM源码解析2,调度器策略(Scheduler)
- 2024-04-19,从啥也不会到Cuda GEMM优化
- 2024-03-19,NVIDIA大语言模型落地的全流程解析
- 2024-03-20,TensorRT-LLM初探(二)简析了结构,用的更明白
- 2024-03-21,高性能 LLM 推理框架的设计与实现
- 2024-04-15,[深入分析CUTLASS系列
- 2024-04-21,搞懂 NVIDIA GPU 性能指标 很容易弄混的一个概念: Utilization vs Saturation
- 2024-04-22,快速提升性能,如何更好地使用GPU(上)
- 2024-05-22,大模型精度(FP16,FP32,BF16)详解与实践
- 2024-07-24,CUDA性能简易优化(一)背景知识
- 2024-01-09,LLM推理库TensorRT-LLM深入分析
- 2024-04-10,一文上手 Tensor Core指令级编程
- 2024-04-23,大语言模型量化
- 2024-04-25,动手实现混合精度矩阵乘CUDA内核
- 2024-04-26,一文了解CUDA矩阵乘编程
- 2024-04-20,Tensor Cores 使用介绍
- 2024-05-27,[并行训练
- 2024-06-20, FP8量化解读--8bit下最优方案?(一)
- 2024-07-01,CUDA-MODE 课程笔记 第一课: 如何在 PyTorch 中 profile CUDA kernels
- 2024-07-04,CUDA-MODE 第一课课后实战(上)
- 2024-07-06,CUDA-MODE 课程笔记 第二课: PMPP 书的第1-3章速通
- 2024-07-13,CUDA-MODE 课程笔记 第四课: PMPP 书的第4-5章笔记
- 2024-07-18,CUDA-MODE课程笔记 第6课: 如何优化PyTorch中的优化器
- 2024-07-23,CUTLASS 2.x & CUTLASS 3.x Intro 学习笔记
- 2017-12-07,【推荐】CUTLASS:CUDA C++高性能线性代数运算库
- 2024-02-28,熬了几个通宵,我写了份CUDA新手入门代码
- 2024-05-13,Shared memory!CUDA数据拷贝速度拉满~
- 2024-03-29,大语言模型硬件加速器综述
- 2024-04-07,Llama提速500%!谷歌美女程序员手搓矩阵乘法内核
- 2024-04-09,1000行C语言搓出GPT-2!AI大神Karpathy新项目刚上线就狂揽2.5k星
- 2024-04-11,llm.c:实现了大语言模型(LLM)训练的简单、纯 C/CUDA 版本,无需 PyTorch 或 cPython
- 2024-04-17,NVIDIA希望有更多支持CUDA的编程语言
- 2024-04-10,【太疯狂了】用 1000 行纯 C 代码实现 GPT-2 训练:Andrej Karpathy重塑LLM训练格局
- 2024-04-14,【全球黑客加持】Karpathy 1000行纯C训练大模型速度已追平PyTorch
- 2024-01-26,基于TensorRT-LLM的大模型部署(速通笔记)
- 2022-10-16,TensorRT/CUDA超全代码资料仓库
- 2024-04-09,“真男人就应该用 C 编程”!用 1000 行 C 代码手搓了一个大模型,Mac 即可运行,特斯拉前AI总监爆火科普 LLM
- 2024-04-09,纯C语言手搓GPT-2,前OpenAI、特斯拉高管新项目火了
- 2024-05-20,首个GPU高级语言,大规模并行就像写Python,已获8500 Star
- 2023-09-10,H100推理飙升8倍!英伟达官宣开源TensorRT-LLM,支持10+模型
- 2024-04-11,美团外卖基于GPU的向量检索系统实践
- 2024-04-20,英伟达开源人工智能代数库:线性代数子例程的 CUDA 模板
- 2024-03-18,LLM百倍推理加速之量化篇
- 2024-03-22,LLM推理:GPU资源和推理框架选择
- 2024-03-27,LLM 推理加速方式汇总
- 2024-04-26,LLM推理量化:FP8 VS INT8
- 2024-04-28,Nvidia GPU池化-远程GPU
- 2024-05-01,Nvidia Tensor Core 初探
- 2024-05-24,Pytorch 显存管理机制与显存占用分析方法
- 2024-06-02,[LLM推理优化
- 2022-08-08,【机器学习】K均值聚类算法原理
- 2024-04-10,【太疯狂了】用 1000 行纯 C 代码实现 GPT-2 训练:Andrej Karpathy重塑LLM训练格局
- 2024-04-14,【全球黑客加持】Karpathy 1000行纯C训练大模型速度已追平PyTorch
- 2024-01-26,基于TensorRT-LLM的大模型部署(速通笔记)
- 2022-08-11,【CUDA编程】基于CUDA的Kmeans算法的简单实现
- 2024-04-09,“真男人就应该用 C 编程”!用 1000 行 C 代码手搓了一个大模型,Mac 即可运行,特斯拉前AI总监爆火科普 LLM
- 2024-04-09,纯C语言手搓GPT-2,前OpenAI、特斯拉高管新项目火了
- 2024-05-20,首个GPU高级语言,大规模并行就像写Python,已获8500 Star
- 2023-09-10,H100推理飙升8倍!英伟达官宣开源TensorRT-LLM,支持10+模型
- 2024-04-07,Llama提速500%!谷歌美女程序员手搓矩阵乘法内核
- 2024-04-09,1000行C语言搓出GPT-2!AI大神Karpathy新项目刚上线就狂揽2.5k星
- 2024-04-11,llm.c:实现了大语言模型(LLM)训练的简单、纯 C/CUDA 版本,无需 PyTorch 或 cPython
- 2024-04-17,NVIDIA希望有更多支持CUDA的编程语言
- 2022-10-16,TensorRT/CUDA超全代码资料仓库
- 2024-04-11,美团外卖基于GPU的向量检索系统实践
- 2024-04-20,英伟达开源人工智能代数库:线性代数子例程的 CUDA 模板
- 2024-03-18,LLM百倍推理加速之量化篇
- 2024-03-22,LLM推理:GPU资源和推理框架选择
- 2024-03-27,LLM 推理加速方式汇总
- 2024-04-26,LLM推理量化:FP8 VS INT8
- 2024-04-28,Nvidia GPU池化-远程GPU
- 2024-05-01,Nvidia Tensor Core 初探
- 2024-05-24,Pytorch 显存管理机制与显存占用分析方法
- 2024-02-25,英伟达(NVIDA)崛起不平凡之路--老黄全球AI芯片新帝国简史
- 2024-01-24,【CUDA编程】基于 CUDA 的 Kmeans 算法的进阶实现(二)
- 2024-04-08,【CUDA编程】CUDA 统一内存
- 2023-09-06,GPU底层优化,如何让Transformer在GPU上跑得更快?
- 2024-04-12,深入浅出,PyTorch模型int8量化原理拆解
- 2024-04-13,CUDA模型部署实战,自己写的CUDA矩阵乘法能优化到多快?
- 2024-04-22,CUDA编程中,Tensor Cores的详细拆解
- 2024-06-22,FP8量化解读,8bit下部署最优方案?
- 2024-06-26,Cuda编程实践,我的第一份Cuda代码
- 2024-03-25,理解NVIDIA GPU 性能:利用率与饱和度
- 2024-04-30,加速矩阵计算:英伟达TensorCore架构演进与原理最全解析
- 2024-05-15,揭秘 Tensor Core 底层:如何让AI计算速度飞跃
- 2024-05-27,浅析GPU分布式通信技术-PCle、NVLink、NVSwitch
- 2024-04-19,AI 推理:CPU 的崛起
- 2023-07-21,AI模型部署 | TensorRT模型INT8量化的Python实现
- 2023-10-30,利用NVIDIA Jetson Orin的强大能力执行本地LLM模型
- 2024-05-07,基于NVIDIA Jetson AGX Orin和Audio2Face做一个AI聊天数字人
- 2024-05-14,CUDA与OpenCL:并行计算革命的冲突与未来
- 2024-05-11,我找到了AlexNet当年的源代码,没用框架,从零手撸CUDA/C++
- 2024-02-25,英伟达(NVIDA)崛起不平凡之路--老黄全球AI芯片新帝国简史
- 2022-08-08,【机器学习】K均值聚类算法原理
- 2022-08-11,【CUDA编程】基于CUDA的Kmeans算法的简单实现
- 2024-01-24,【CUDA编程】基于 CUDA 的 Kmeans 算法的进阶实现(二)
- 2023-07-21,AI模型部署 | TensorRT模型INT8量化的Python实现
- 2023-10-30,利用NVIDIA Jetson Orin的强大能力执行本地LLM模型
- 2024-04-08,【CUDA编程】CUDA 统一内存
- 2023-09-06,GPU底层优化,如何让Transformer在GPU上跑得更快?
- 2024-05-07,基于NVIDIA Jetson AGX Orin和Audio2Face做一个AI聊天数字人
- 2024-04-12,深入浅出,PyTorch模型int8量化原理拆解
- 2024-05-14,CUDA与OpenCL:并行计算革命的冲突与未来
- 2024-04-13,CUDA模型部署实战,自己写的CUDA矩阵乘法能优化到多快?
- 2024-04-22,CUDA编程中,Tensor Cores的详细拆解
- 2024-06-22,FP8量化解读,8bit下部署最优方案?
- 2024-06-26,Cuda编程实践,我的第一份Cuda代码
- 2024-03-25,理解NVIDIA GPU 性能:利用率与饱和度
- 2024-04-30,加速矩阵计算:英伟达TensorCore架构演进与原理最全解析
- 2024-05-15,揭秘 Tensor Core 底层:如何让AI计算速度飞跃
- 2024-05-27,浅析GPU分布式通信技术-PCle、NVLink、NVSwitch
- 2024-04-19,AI 推理:CPU 的崛起
- 2024-05-11,我找到了AlexNet当年的源代码,没用框架,从零手撸CUDA/C++
- 2024-04-10,针对大型语言模型的高效CUDA优化可实现性能翻倍提升
- 2024-06-21,解密高性能计算:如何用流和Kernel触发提升GPU通信效率
- 2024-04-19,英伟达坚持了16年的CUDA,到底是什么
- 2024-04-09,100行C代码重塑深度学习:用纯C/CUDA打造的极简LLM训练
- 2024-03-29,图像预处理库CV-CUDA开源了,打破预处理瓶颈,提升推理吞吐量20多倍
- 2024-02-26,GPU(一)GPU简介
- 2024-04-03,【CUDA】一文讲清流与并发,讲不清我重讲
- 2024-05-02,【【CUDA】一文讲清共享内存和常量内存
- 2022-09-19,零知识证明 - FPGA vs. GPU
- 2022-06-16,减少重复造轮子,帮你解放生产力的「小矩阵功能」来啦!
- 2024-06-03,黄仁勋:英伟达将一年推一款全新芯片,没有英伟达就没有今天AI的一切(附最新演讲全文)
- 2024-06-01,传统SLAM使用CUDA加速是否有比较大的优势呢?
- 2024-06-01,黄仁勋:不喜欢裁员,我宁愿“折磨”他们|中企荐读
- 2024-06-04,使用 TensorRT C++ API 调用GPU加速部署 YOLOv10 实现 500FPS 推理速度——快到飞起!!
- 2023-06-12,为CUDA Kernel选择合适的grid_size和block_size
- 2024-04-10,针对大型语言模型的高效CUDA优化可实现性能翻倍提升
- 2024-06-21,解密高性能计算:如何用流和Kernel触发提升GPU通信效率
- 2024-04-19,英伟达坚持了16年的CUDA,到底是什么
- 2024-04-09,100行C代码重塑深度学习:用纯C/CUDA打造的极简LLM训练
- 2024-03-29,图像预处理库CV-CUDA开源了,打破预处理瓶颈,提升推理吞吐量20多倍
- 2024-02-26,GPU(一)GPU简介
- 2024-04-03,【CUDA】一文讲清流与并发,讲不清我重讲
- 2024-05-02,【【CUDA】一文讲清共享内存和常量内存
- 2022-09-19,零知识证明 - FPGA vs. GPU
- 2022-06-16,减少重复造轮子,帮你解放生产力的「小矩阵功能」来啦!
- 2024-06-03,黄仁勋:英伟达将一年推一款全新芯片,没有英伟达就没有今天AI的一切(附最新演讲全文)
- 2024-06-01,传统SLAM使用CUDA加速是否有比较大的优势呢?
- 2024-06-01,黄仁勋:不喜欢裁员,我宁愿“折磨”他们|中企荐读
- 2024-06-04,使用 TensorRT C++ API 调用GPU加速部署 YOLOv10 实现 500FPS 推理速度——快到飞起!!
- 2023-06-12,为CUDA Kernel选择合适的grid_size和block_size
- 2024-06-20,大模型量化性能评价指标
- 2024-06-24,FP8 量化基础 - 英伟达
- 2024-07-05,聊聊大模型推理中的分离式推理
- 2024-07-11,FP8 低精度训练:Transformer Engine 简析
- 2024-06-17,黄仁勋致毕业生:勇于进入0亿美元市场,希望你能找到自己的GPU
- 2024-03-26,GPU 上 GEMM 的性能优化指标
- 2023-07-06,【他山之石】CUDA SGEMM矩阵乘法优化笔记——从入门到cublas
- 2024-07-06,Thrust 库:让 C++ 并行计算飞跃
- 2024-07-09,理想是如何将视觉语言大模型部署到Orin-X上的?
- 2024-07-08,实战 | YOLOv8使用TensorRT加速推理教程(步骤 + 代码)
- 2024-07-10,OpenCV使用CUDA加速资料汇总(pdf+视频+源码)
- 2024-07-24,CUDA实现matmul的并行策略
- 2024-07-27,flash attention的CUDA编程
- 知乎「是聪明貂吖」
- 2024-06-20,大模型量化性能评价指标
- 2024-06-24,FP8 量化基础 - 英伟达
- 2024-07-05,聊聊大模型推理中的分离式推理
- 2024-07-11,FP8 低精度训练:Transformer Engine 简析
- 2024-06-17,黄仁勋致毕业生:勇于进入0亿美元市场,希望你能找到自己的GPU
- 2024-03-26,GPU 上 GEMM 的性能优化指标
- 2023-07-06,【他山之石】CUDA SGEMM矩阵乘法优化笔记——从入门到cublas
- 2024-07-06,Thrust 库:让 C++ 并行计算飞跃
- 2024-07-09,理想是如何将视觉语言大模型部署到Orin-X上的?
- 2024-07-08,实战 | YOLOv8使用TensorRT加速推理教程(步骤 + 代码)
- 2024-07-10,OpenCV使用CUDA加速资料汇总(pdf+视频+源码)
- 2024-07-24,CUDA实现matmul的并行策略
- 2024-07-27,flash attention的CUDA编程
- 知乎「是聪明貂吖」
- 2024-02-18,《高性能并行编程与优化》课程笔记目录
- 2024-02-18,《高性能并行编程与优化》课程笔记目录
- 2024-06-20,NVIDIA Isaac Sim 4.0 和 NVIDIA Isaac Lab 为机器人工作流和仿真提供强大助力
- 2022-10-31,深入浅出GPU优化系列:spmv优化
- 2023-09-06,GPU底层优化,如何让Transformer在GPU上跑得更快?
- 2023-10-30,利用NVIDIA Jetson Orin的强大能力执行本地LLM模型
- 2024-05-01,ops(4):AdamW 优化器的 CUDA 实现
- 2024-05-02,ops(5):激活函数与残差连接的 CUDA 实现
- 2024-05-03,ops(6):embedding 层与 LM head 层的 CUDA 实现
- 2024-05-06,ops(7):self-attention 的 CUDA 实现及优化 (上)
- 2024-05-08,ops(8):self-attention 的 CUDA 实现及优化 (下)
- 2024-05-14,CUDA(四):使用 CUDA 实现 Transformer 结构
- 知乎「紫气东来」
- 2023-09-02,CUDA(一):CUDA 编程基础
- 2024-08-09,如何把 PyTorch 的 GPU 利用率提升到 100% ?
- 2023-09-09,CUDA(二):GPU的内存体系及其优化指南
- 2023-09-29,CUDA(三):通用矩阵乘法:从入门到熟练
- 2024-04-29,ops(1):LayerNorm 算子的 CUDA 实现与优化
- 2024-04-30,ops(2):SoftMax算子的 CUDA 实现
- 2024-05-01,ops(3):Cross Entropy 的 CUDA 实现
- 2024-12-19,智源大模型通用算子库FlagGems四大能力升级,为AI系统开源生态注入新活力
- 2023-05-04,实战 | TVM优化Pytorch模型
- 2022-05-23,基于 MLIR 完成对 GEMM 的编译优化 中英视频上,中部分
- 2023-06-25,MLIR_对自定义IR Dialect编写bufferization pass
- 2024-02-28,熬了几个通宵,我写了份CUDA新手入门代码
- 2024-04-24,NVIDIA GPU 架构下的 FP8 训练与推理
- 2024-05-14,快速提升性能,如何更好地使用GPU(下)
- 2025-01-03,AI项目工程化,CUDA开发心得汇总!
- 2024-04-30,加速矩阵计算:英伟达TensorCore架构演进与原理最全解析
- 2025-01-04,探索 Triton 编程密码:语法与实践指南大揭秘
- 2023-10-27,现已公开发布!欢迎使用 NVIDIA TensorRT-LLM 优化大语言模型推理
- 2024-04-24,NVIDIA GPU 架构下的 FP8 训练与推理
- 2024-01-26,基于TensorRT-LLM的大模型部署(速通笔记)
-
Frameworks
- yetone/openai-translator - translator?style=social"/> : The translator that does more than just translation - powered by OpenAI.
- go-skynet/LocalAI - skynet/LocalAI?style=social"/> : 🤖 Self-hosted, community-driven, local OpenAI-compatible API. Drop-in replacement for OpenAI running LLMs on consumer-grade hardware. Free Open Source OpenAI alternative. No GPU required. LocalAI is an API to run ggml compatible models: llama, gpt4all, rwkv, whisper, vicuna, koala, gpt4all-j, cerebras, falcon, dolly, starcoder, and many other. [localai.io](https://localai.io/)
- wangzhaode/ChatGLM-MNN - MNN?style=social"/> : Pure C++, Easy Deploy ChatGLM-6B.
- MUTLASS
- MooreThreads/torch_musa
- FlagGems - performance general operator library implemented in [OpenAI Triton](https://github.com/openai/triton). It aims to provide a suite of kernel functions to accelerate LLM training and inference.
- FlagPerf - source software platform for benchmarking AI chips. FlagPerf是智源研究院联合AI硬件厂商共建的一体化AI硬件评测引擎,旨在建立以产业实践为导向的指标体系,评测AI硬件在软件栈组合(模型+框架+编译器)下的实际能力。
- SCUDA - over-IP. SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
- Burn - rs/burn?style=social"/> : Burn - A Flexible and Comprehensive Deep Learning Framework in Rust. [burn-rs.github.io/](https://burn-rs.github.io/)
- BLAS - vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations.
- BLAS - vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations.
- LAPACK - LAPACK/lapack?style=social"/> : LAPACK development repository. [LAPACK](https://www.netlib.org/lapack/) — Linear Algebra PACKage. LAPACK is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision.
- OpenBLAS
- BLIS - like Library Instantiation Software Framework.
- NumPy
- SciPy - source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. [scipy.org](https://scipy.org/)
- Gonum
- CCCL - quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.
- HIP - Compute Interface for Portability. HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code. [rocmdocs.amd.com/projects/HIP/](https://rocmdocs.amd.com/projects/HIP/)
- PyCUDA
- jessfraz/advent-of-cuda - of-cuda?style=social"/> : Doing advent of code with CUDA and rust.
- Bend - level programming language.[higherorderco.com](https://higherorderco.com/)
- HVM
- ZLUDA
- Rust-CUDA - GPU/Rust-CUDA?style=social"/> : Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
- cudarc
- bindgen_cuda - lang/rust-bindgen) in philosophy. It will help create automatic bindgen to cuda kernels source files and make them easier to use directly from Rust.
- cuda-driver - driver?style=social"/> : 基于 CUDA Driver API 的 cuda 运行时环境。
- async-cuda - ai/async-cuda?style=social"/> : Asynchronous CUDA for Rust.
- async-tensorrt - ai/async-tensorrt?style=social"/> : Asynchronous TensorRT for Rust.
- krnl - r-earp/krnl?style=social"/> : Safe, portable, high performance compute (GPGPU) kernels.
- custos
- spinorml/nvlib
- DoeringChristian/cuda-rs - rs?style=social"/> : Cuda Bindings for rust generated with bindgen-cli (similar to cust_raw).
- romankoblov/rust-nvrtc - nvrtc?style=social"/> : NVRTC bindings for RUST.
- solkitten/astro-cuda - cuda?style=social"/> : CUDA Driver API bindings for Rust.
- BLIS - like Library Instantiation Software Framework.
- LAPACK - LAPACK/lapack?style=social"/> : LAPACK development repository. [LAPACK](https://www.netlib.org/lapack/) — Linear Algebra PACKage. LAPACK is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision.
- OpenBLAS
- NumPy
- SciPy - source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. [scipy.org](https://scipy.org/)
- Gonum
- CCCL - quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.
- HIP - Compute Interface for Portability. HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code. [rocmdocs.amd.com/projects/HIP/](https://rocmdocs.amd.com/projects/HIP/)
- PyCUDA
- jessfraz/advent-of-cuda - of-cuda?style=social"/> : Doing advent of code with CUDA and rust.
- Bend - level programming language.[higherorderco.com](https://higherorderco.com/)
- HVM
- ZLUDA
- cudarc
- bindgen_cuda - lang/rust-bindgen) in philosophy. It will help create automatic bindgen to cuda kernels source files and make them easier to use directly from Rust.
- cuda-driver - driver?style=social"/> : 基于 CUDA Driver API 的 cuda 运行时环境。
- async-cuda - ai/async-cuda?style=social"/> : Asynchronous CUDA for Rust.
- async-tensorrt - ai/async-tensorrt?style=social"/> : Asynchronous TensorRT for Rust.
- krnl - r-earp/krnl?style=social"/> : Safe, portable, high performance compute (GPGPU) kernels.
- custos
- spinorml/nvlib
- DoeringChristian/cuda-rs - rs?style=social"/> : Cuda Bindings for rust generated with bindgen-cli (similar to cust_raw).
- romankoblov/rust-nvrtc - nvrtc?style=social"/> : NVRTC bindings for RUST.
- solkitten/astro-cuda - cuda?style=social"/> : CUDA Driver API bindings for Rust.
- bokutotu/curs
- rust-cuda/cuda-sys - cuda/cuda-sys?style=social"/> : Rust binding to CUDA APIs.
- bheisler/RustaCUDA
- tmrob2/cuda2rust_sandpit
- PhDP/rust-cuda-template - cuda-template?style=social"/> : Simple template for Rust + CUDA.
- neka-nat/cuimage - nat/cuimage?style=social"/> : Rust implementation of image processing library with CUDA.
- yanghaku/cuda-driver-sys - driver-sys?style=social"/> : Rust binding to CUDA Driver APIs.
- Canyon-ml/canyon-sys - ml/canyon-sys?style=social"/> : Rust Bindings for Cuda, CuDNN.
- cea-hpc/HARP - hpc/HARP?style=social"/> : Small tool for profiling the performance of hardware-accelerated Rust code using OpenCL and CUDA.
- Conqueror712/CUDA-Simulator - Simulator?style=social"/> : A self-developed version of the user-mode CUDA emulator project and a learning repository for Rust.
- cszach/rust-cuda-template - cuda-template?style=social"/> : A Rust CUDA template with detailed instructions.
- exor2008/fluid-simulator - simulator?style=social"/> : Rust CUDA fluid simulator.
- chichieinstein/rustycuda
- Jafagervik/cruda - Writing rust with cuda.
- lennyerik/cutransform
- cjordan/hip-sys - sys?style=social"/> : Rust bindings for HIP.
- rust-gpu - gpu?style=social"/> : 🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧 [shader.rs](https://shader.rs/)
- wgpu - rs/wgpu?style=social"/> : Safe and portable GPU abstraction in Rust, implementing WebGPU API. [wgpu.rs](https://wgpu.rs/)
- Vulkano - rs/vulkano?style=social"/> : Safe and rich Rust wrapper around the Vulkan API. Vulkano is a Rust wrapper around [the Vulkan graphics API](https://www.vulkan.org/). It follows the Rust philosophy, which is that as long as you don't use unsafe code you shouldn't be able to trigger any undefined behavior. In the case of Vulkan, this means that non-unsafe code should always conform to valid API usage.
- Ash - rs/ash?style=social"/> : Vulkan bindings for Rust.
- ocl
- opencl3
- CUDA.jl
- AMDGPU.jl
- te42kyfo/gpu-benches - benches?style=social"/> : collection of benchmarks to measure basic GPU capabilities.
- wgpu - rs/wgpu?style=social"/> : Safe and portable GPU abstraction in Rust, implementing WebGPU API. [wgpu.rs](https://wgpu.rs/)
- Vulkano - rs/vulkano?style=social"/> : Safe and rich Rust wrapper around the Vulkan API. Vulkano is a Rust wrapper around [the Vulkan graphics API](https://www.vulkan.org/). It follows the Rust philosophy, which is that as long as you don't use unsafe code you shouldn't be able to trigger any undefined behavior. In the case of Vulkan, this means that non-unsafe code should always conform to valid API usage.
- bokutotu/curs
- rust-cuda/cuda-sys - cuda/cuda-sys?style=social"/> : Rust binding to CUDA APIs.
- bheisler/RustaCUDA
- tmrob2/cuda2rust_sandpit
- PhDP/rust-cuda-template - cuda-template?style=social"/> : Simple template for Rust + CUDA.
- neka-nat/cuimage - nat/cuimage?style=social"/> : Rust implementation of image processing library with CUDA.
- yanghaku/cuda-driver-sys - driver-sys?style=social"/> : Rust binding to CUDA Driver APIs.
- Canyon-ml/canyon-sys - ml/canyon-sys?style=social"/> : Rust Bindings for Cuda, CuDNN.
- cea-hpc/HARP - hpc/HARP?style=social"/> : Small tool for profiling the performance of hardware-accelerated Rust code using OpenCL and CUDA.
- Conqueror712/CUDA-Simulator - Simulator?style=social"/> : A self-developed version of the user-mode CUDA emulator project and a learning repository for Rust.
- cszach/rust-cuda-template - cuda-template?style=social"/> : A Rust CUDA template with detailed instructions.
- exor2008/fluid-simulator - simulator?style=social"/> : Rust CUDA fluid simulator.
- chichieinstein/rustycuda
- Jafagervik/cruda - Writing rust with cuda.
- lennyerik/cutransform
- cjordan/hip-sys - sys?style=social"/> : Rust bindings for HIP.
- rust-gpu - gpu?style=social"/> : 🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧 [shader.rs](https://shader.rs/)
- Ash - rs/ash?style=social"/> : Vulkan bindings for Rust.
- ocl
- opencl3
- CUDA.jl
- AMDGPU.jl
- te42kyfo/gpu-benches - benches?style=social"/> : collection of benchmarks to measure basic GPU capabilities.
- PyTorch
- PaddlePaddle
- CUTLASS
- MatX - GPU-Accelerated Numerical Computing in Modern C++. An efficient C++17 GPU numerical computing library with Python-like syntax. [nvidia.github.io/MatX](https://nvidia.github.io/MatX)
- GenericLinearAlgebra.jl
- flashlight/flashlight
- NVlabs/tiny-cuda-nn - cuda-nn?style=social"/> : Lightning fast C++/CUDA neural network framework.
- yhwang-hub/dl_model_infer - hub/dl_model_infer?style=social"/> : his is a c++ version of the AI reasoning library. Currently, it only supports the reasoning of the tensorrt model. The follow-up plan supports the c++ reasoning of frameworks such as Openvino, NCNN, and MNN. There are two versions for pre- and post-processing, c++ version and cuda version. It is recommended to use the cuda version., This repository provides accelerated deployment cases of deep learning CV popular models, and cuda c supports dynamic-batch image process, infer, decode, NMS.
- llm.c - 2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.
- llama2.c - line C file (run.c).
- gemma.cpp
- llama.cpp
- MatX - GPU-Accelerated Numerical Computing in Modern C++. An efficient C++17 GPU numerical computing library with Python-like syntax. [nvidia.github.io/MatX](https://nvidia.github.io/MatX)
- GenericLinearAlgebra.jl
- custos-math - math?style=social"/> : This crate provides CUDA, OpenCL, CPU (and Stack) based matrix operations using [custos](https://github.com/elftausend/custos).
- PyTorch
- PaddlePaddle
- flashlight/flashlight
- NVlabs/tiny-cuda-nn - cuda-nn?style=social"/> : Lightning fast C++/CUDA neural network framework.
- yhwang-hub/dl_model_infer - hub/dl_model_infer?style=social"/> : his is a c++ version of the AI reasoning library. Currently, it only supports the reasoning of the tensorrt model. The follow-up plan supports the c++ reasoning of frameworks such as Openvino, NCNN, and MNN. There are two versions for pre- and post-processing, c++ version and cuda version. It is recommended to use the cuda version., This repository provides accelerated deployment cases of deep learning CV popular models, and cuda c supports dynamic-batch image process, infer, decode, NMS.
- llm.c - 2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.
- llama2.c - line C file (run.c).
- gemma.cpp
- llama.cpp
- whisper.cpp - performance inference of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model.
- ChatGLM.cpp - plus/chatglm.cpp?style=social"/> : C++ implementation of [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) and [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B).
- MegEngine/InferLLM
- whisper.cpp - performance inference of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model.
- ChatGLM.cpp - plus/chatglm.cpp?style=social"/> : C++ implementation of [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) and [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B).
- MegEngine/InferLLM
- skeskinen/llama-lite - lite?style=social"/> : Embeddings focused small version of Llama NLP model.
- Const-me/Whisper - me/Whisper?style=social"/> : High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model.
- ztxz16/fastllm - 6B,MOSS; 可以在安卓设备上流畅运行ChatGLM-6B。
- davidar/eigenGPT
- zjhellofss/KuiperInfer (自制深度学习推理框架) - performance deep learning inference library step by step.
- ztxz16/fastllm - 6B,MOSS; 可以在安卓设备上流畅运行ChatGLM-6B。
- davidar/eigenGPT
- skeskinen/llama-lite - lite?style=social"/> : Embeddings focused small version of Llama NLP model.
- Const-me/Whisper - me/Whisper?style=social"/> : High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model.
- Tlntin/Qwen-TensorRT-LLM - TensorRT-LLM?style=social"/> : 使用TRT-LLM完成对Qwen-7B-Chat实现推理加速。
- FeiGeChuanShu/trt2023 - 7B用TensorRT-LLM模型搭建及优化。
- TRT2022/trtllm-llama - llama?style=social"/> : ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化。
- llama2.mojo
- dorjeduck/llm.mojo
- Candle
- Safetensors
- tch-rs - rs?style=social"/> : Rust bindings for the C++ api of PyTorch.
- Tlntin/Qwen-TensorRT-LLM - TensorRT-LLM?style=social"/> : 使用TRT-LLM完成对Qwen-7B-Chat实现推理加速。
- FeiGeChuanShu/trt2023 - 7B用TensorRT-LLM模型搭建及优化。
- TRT2022/trtllm-llama - llama?style=social"/> : ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化。
- llama2.mojo
- dorjeduck/llm.mojo
- Candle
- Safetensors
- Tokenizers - of-the-Art Tokenizers optimized for Research and Production. [huggingface.co/docs/tokenizers](https://huggingface.co/docs/tokenizers/index)
- dfdx
- luminal
- crabml
- rustai-solutions/candle_demo_openchat_35 - solutions/candle_demo_openchat_35?style=social"/> : candle_demo_openchat_35.
- llama2.rs
- Llama2-burn - burn?style=social"/> : Llama2 LLM ported to Rust burn.
- gaxler/llama2.rs
- whisper-burn - burn?style=social"/> : A Rust implementation of OpenAI's Whisper model using the burn framework.
- stable-diffusion-burn - diffusion-burn?style=social"/> : Stable Diffusion v1.4 ported to Rust's burn framework.
- coreylowman/llama-dfdx - dfdx?style=social"/> : [LLaMa 7b](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) with CUDA acceleration implemented in rust. Minimal GPU memory needed!
- rustformers/llm
- Chidori
- Tokenizers - of-the-Art Tokenizers optimized for Research and Production. [huggingface.co/docs/tokenizers](https://huggingface.co/docs/tokenizers/index)
- dfdx
- luminal
- crabml
- TensorFlow Rust
- tch-rs - rs?style=social"/> : Rust bindings for the C++ api of PyTorch.
- rustai-solutions/candle_demo_openchat_35 - solutions/candle_demo_openchat_35?style=social"/> : candle_demo_openchat_35.
- llama2.rs
- Llama2-burn - burn?style=social"/> : Llama2 LLM ported to Rust burn.
- gaxler/llama2.rs
- whisper-burn - burn?style=social"/> : A Rust implementation of OpenAI's Whisper model using the burn framework.
- stable-diffusion-burn - diffusion-burn?style=social"/> : Stable Diffusion v1.4 ported to Rust's burn framework.
- coreylowman/llama-dfdx - dfdx?style=social"/> : [LLaMa 7b](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) with CUDA acceleration implemented in rust. Minimal GPU memory needed!
- tazz4843/whisper-rs - rs?style=social"/> : Rust bindings to [whisper.cpp](https://github.com/ggerganov/whisper.cpp).
- rustformers/llm
- Chidori
- llm-chain - chain?style=social"/> : llm-chain is a collection of Rust crates designed to help you work with Large Language Models (LLMs) more effectively. [llm-chain.xyz](https://llm-chain.xyz/)
- Atome-FE/llama-node - FE/llama-node?style=social"/> : Believe in AI democratization. llama for nodejs backed by llama-rs and llama.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna model. [www.npmjs.com/package/llama-node](https://www.npmjs.com/package/llama-node)
- Noeda/rllama
- lencx/ChatGPT
- Synaptrix/ChatGPT-Desktop - Desktop?style=social"/> : Fuel your productivity with ChatGPT-Desktop - Blazingly fast and supercharged!
- Poordeveloper/chatgpt-app - app?style=social"/> : A ChatGPT App for all platforms. Built with Rust + Tauri + Vue + Axum.
- mxismean/chatgpt-app - app?style=social"/> : Tauri 项目:ChatGPT App.
- sonnylazuardi/chat-ai-desktop - ai-desktop?style=social"/> : Chat AI Desktop App. Unofficial ChatGPT desktop app for Mac & Windows menubar using Tauri & Rust.
- m1guelpf/browser-agent - agent?style=social"/> : A browser AI agent, using GPT-4. [docs.rs/browser-agent](https://docs.rs/browser-agent/latest/browser_agent/)
- sigoden/aichat - 3.5/GPT-4 in the terminal.
- uiuifree/rust-openai-chatgpt-api - openai-chatgpt-api?style=social"/> : "rust-openai-chatgpt-api" is a Rust library for accessing the ChatGPT API, a powerful NLP platform by OpenAI. The library provides a simple and efficient interface for sending requests and receiving responses, including chat. It uses reqwest and serde for HTTP requests and JSON serialization.
- 1595901624/gpt-aggregated-edition - aggregated-edition?style=social"/> : 聚合ChatGPT官方版、ChatGPT免费版、文心一言、Poe、chatchat等多平台,支持自定义导入平台。
- Cormanz/smartgpt
- femtoGPT
- shafishlabs/llmchain-rs - rs?style=social"/> : 🦀Rust + Large Language Models - Make AI Services Freely and Easily. Inspired by LangChain.
- flaneur2020/llama2.rs
- Heng30/chatbox - ui and Rust.
- fairjm/dioxus-openai-qa-gui - openai-qa-gui?style=social"/> : a simple openai qa desktop app built with dioxus.
- mxismean/chatgpt-app - app?style=social"/> : Tauri 项目:ChatGPT App.
- sonnylazuardi/chat-ai-desktop - ai-desktop?style=social"/> : Chat AI Desktop App. Unofficial ChatGPT desktop app for Mac & Windows menubar using Tauri & Rust.
- m1guelpf/browser-agent - agent?style=social"/> : A browser AI agent, using GPT-4. [docs.rs/browser-agent](https://docs.rs/browser-agent/latest/browser_agent/)
- llm-chain - chain?style=social"/> : llm-chain is a collection of Rust crates designed to help you work with Large Language Models (LLMs) more effectively. [llm-chain.xyz](https://llm-chain.xyz/)
- Atome-FE/llama-node - FE/llama-node?style=social"/> : Believe in AI democratization. llama for nodejs backed by llama-rs and llama.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna model. [www.npmjs.com/package/llama-node](https://www.npmjs.com/package/llama-node)
- Noeda/rllama
- lencx/ChatGPT
- Synaptrix/ChatGPT-Desktop - Desktop?style=social"/> : Fuel your productivity with ChatGPT-Desktop - Blazingly fast and supercharged!
- Poordeveloper/chatgpt-app - app?style=social"/> : A ChatGPT App for all platforms. Built with Rust + Tauri + Vue + Axum.
- sigoden/aichat - 3.5/GPT-4 in the terminal.
- uiuifree/rust-openai-chatgpt-api - openai-chatgpt-api?style=social"/> : "rust-openai-chatgpt-api" is a Rust library for accessing the ChatGPT API, a powerful NLP platform by OpenAI. The library provides a simple and efficient interface for sending requests and receiving responses, including chat. It uses reqwest and serde for HTTP requests and JSON serialization.
- 1595901624/gpt-aggregated-edition - aggregated-edition?style=social"/> : 聚合ChatGPT官方版、ChatGPT免费版、文心一言、Poe、chatchat等多平台,支持自定义导入平台。
- Cormanz/smartgpt
- femtoGPT
- shafishlabs/llmchain-rs - rs?style=social"/> : 🦀Rust + Large Language Models - Make AI Services Freely and Easily. Inspired by LangChain.
- flaneur2020/llama2.rs
- Heng30/chatbox - ui and Rust.
- fairjm/dioxus-openai-qa-gui - openai-qa-gui?style=social"/> : a simple openai qa desktop app built with dioxus.
- llama2.zig
- llama2.zig
- renerocksai/gpt4all.zig - based chat client for an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa.
- vLLM - project/vllm?style=social"/> : A high-throughput and memory-efficient inference and serving engine for LLMs. [docs.vllm.ai](https://docs.vllm.ai/)
- MLC LLM - ai/mlc-llm?style=social"/> : Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. [mlc.ai/mlc-llm](https://mlc.ai/mlc-llm/)
- Lamini - ai/lamini?style=social"/> : Lamini: The LLM engine for rapidly customizing models 🦙.
- datawhalechina/self-llm - llm?style=social"/> : 《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合中国宝宝的部署教程。
- ninehills/llm-inference-benchmark - inference-benchmark?style=social"/> : LLM Inference benchmark.
- NVIDIA/nccl - GPU communication.
- NVIDIA/multi-gpu-programming-models - gpu-programming-models?style=social"/> : Examples demonstrating available options to program multiple GPUs in a single node or a cluster.
- wilicc/gpu-burn - burn?style=social"/> : Multi-GPU CUDA stress test.
- Cupoch - nat/cupoch?style=social"/> : Robotics with GPU computing.
- renerocksai/gpt4all.zig - based chat client for an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa.
- vllm-project/vllm - project/vllm?style=social"/> : A high-throughput and memory-efficient inference and serving engine for LLMs. [vllm.readthedocs.io](https://vllm.readthedocs.io/en/latest/)
- MLC LLM - ai/mlc-llm?style=social"/> : Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. [mlc.ai/mlc-llm](https://mlc.ai/mlc-llm/)
- Lamini - ai/lamini?style=social"/> : Lamini: The LLM engine for rapidly customizing models 🦙.
- datawhalechina/self-llm - llm?style=social"/> : 《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合中国宝宝的部署教程。
- ninehills/llm-inference-benchmark - inference-benchmark?style=social"/> : LLM Inference benchmark.
- NVIDIA/nccl - GPU communication.
- NVIDIA/multi-gpu-programming-models - gpu-programming-models?style=social"/> : Examples demonstrating available options to program multiple GPUs in a single node or a cluster.
- wilicc/gpu-burn - burn?style=social"/> : Multi-GPU CUDA stress test.
- Cupoch - nat/cupoch?style=social"/> : Robotics with GPU computing.
- Tachyon - network/tachyon?style=social"/> : Modular ZK(Zero Knowledge) backend accelerated by GPU.
- Blitzar - knowledge proof acceleration with GPUs for C++ and Rust. [www.spaceandtime.io/](https://www.spaceandtime.io/)
- blitzar-rs - rs?style=social"/> : High-Level Rust wrapper for the blitzar-sys crate. [www.spaceandtime.io/](https://www.spaceandtime.io/)
- ICICLE - zk/icicle?style=social"/> : ICICLE is a library for ZK acceleration using CUDA-enabled GPUs.
- Tachyon - network/tachyon?style=social"/> : Modular ZK(Zero Knowledge) backend accelerated by GPU.
- Blitzar - knowledge proof acceleration with GPUs for C++ and Rust. [www.spaceandtime.io/](https://www.spaceandtime.io/)
- blitzar-rs - rs?style=social"/> : High-Level Rust wrapper for the blitzar-sys crate. [www.spaceandtime.io/](https://www.spaceandtime.io/)
- ICICLE - zk/icicle?style=social"/> : ICICLE is a library for ZK acceleration using CUDA-enabled GPUs.
- YichengDWu/matmul.mojo - threaded implimentation of the [BLIS](https://en.wikipedia.org/wiki/BLIS_(software)) algorithm in pure Mojo 🔥.
- NVIDIA/cuda-python - python?style=social"/> : CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. CUDA Python Low-level Bindings. [nvidia.github.io/cuda-python/](https://nvidia.github.io/cuda-python/latest/)
- CuPy
- TensorRT - performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. [developer.nvidia.com/tensorrt](https://developer.nvidia.com/tensorrt)
- TensorRT-LLM - LLM?style=social"/> : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. [nvidia.github.io/TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM)
- linxihui/dkernel - 3-small models](https://huggingface.co/microsoft/Phi-3-small-8k-instruct). The sparse attention is also supported in vLLM for efficient inference.
- 'gpu' Dialect - level abstractions for launching GPU kernels following a programming model similar to that of CUDA or OpenCL.
- ONNX-MLIR - mlir?style=social"/> : Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure.
- TPU-MLIR - mlir?style=social"/> : Machine learning compiler based on MLIR for Sophgo TPU. TPU-MLIR is an open-source machine-learning compiler based on MLIR for TPU. This project provides a complete toolchain, which can convert pre-trained neural networks from different frameworks into binary files bmodel that can be efficiently operated on TPUs.
- IREE - org/iree?style=social"/> : IREE: Intermediate Representation Execution Environment. A retargetable MLIR-based machine learning compiler and runtime toolkit. [iree.dev/](http://iree.dev/)
- 'amdgpu' Dialect - specific functionality and LLVM intrinsics.
- pyMLIR - the Multi-Level Intermediate Representation. pyMLIR is a full Python interface to parse, process, and output [MLIR](https://mlir.llvm.org/) files according to the syntax described in the [MLIR documentation](https://github.com/llvm/llvm-project/tree/master/mlir/docs). pyMLIR supports the basic dialects and can be extended with other dialects.
- Torch-MLIR - mlir?style=social"/> : The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
- ByteIR - to-end model compilation solution. [byteir.ai](https://byteir.ai/)
- Xilinx/mlir-aie - aie?style=social"/> : An MLIR-based toolchain for AMD AI Engine-enabled devices. This repository contains an MLIR-based toolchain for AI Engine-enabled devices, such as [AMD Ryzen™ AI](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html) and [Versal™](https://www.xilinx.com/products/technology/ai-engine.html).
- cuBLAS - accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.
- cuDNN - accelerated library of primitives for [deep neural networks](https://developer.nvidia.com/deep-learning). cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.
- zjhellofss/KuiperInfer (自制深度学习推理框架) - performance deep learning inference library step by step.
- YichengDWu/matmul.mojo - threaded implimentation of the [BLIS](https://en.wikipedia.org/wiki/BLIS_(software)) algorithm in pure Mojo 🔥.
- TensorRT - performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. [developer.nvidia.com/tensorrt](https://developer.nvidia.com/tensorrt)
- TensorRT-LLM - LLM?style=social"/> : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. [nvidia.github.io/TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM)
- DeployAI/nndeploy - zh.readthedocs.io/zh/latest/](https://nndeploy-zh.readthedocs.io/zh/latest/)
- purton-tech/bionicgpt - tech/bionicgpt?style=social"/> : Accelerate LLM adoption in your organisation. Chat with your confidential data safely and securely. [bionic-gpt.com](https://bionic-gpt.com/)
- EugenHotaj/zig_inference
- Ollama
- harleyszhang/lite_llama
- Liger-Kernel - Kernel?style=social"/> : Efficient Triton Kernels for LLM Training. [arxiv.org/pdf/2410.10989](https://arxiv.org/pdf/2410.10989)
- BobMcDear/attorch
- zjhellofss/KuiperLLama
- zjhellofss/kuiperdatawhale
- MarioSieg/magnetron
- lucasdelimanogueira/PyNorch - GPU support and automatic differentiation!)
- FlashAttention - AILab/flash-attention?style=social"/> : Fast and memory-efficient exact attention. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". (**[arXiv 2022](https://arxiv.org/abs/2205.14135)**).
- 66RING/tiny-flash-attention - flash-attention?style=social"/> : [flash attention](https://github.com/Dao-AILab/flash-attention) tutorial written in python, triton, cuda, cutlass.
- weishengying/tiny-flash-attention - flash-attention?style=social"/> : 使用 cutlass 实现 flash-attention 精简版,具有教学意义。
- jepeake/tiny-flash-attention - flash-attention?style=social"/> : flash attention in ~20 lines.
-
Official Version
- CUDA
- MLIR - Level Intermediate Representation Compiler Framework. The MLIR project is a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together.
- TVM
- CUTLASS - performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement [cuBLAS](https://developer.nvidia.com/cublas) and [cuDNN](https://developer.nvidia.com/cudnn).
- cuBLAS - accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.
- cuDNN - accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.
-
Learning Resources
- xgqdut2016/cuda_code
- hyperai/triton-cn - cn?style=social"/> : Triton Documentation in Chinese Simplified / Triton 中文文档. [triton.hyper.ai](https://triton.hyper.ai/)
- Triton Docs
- xgqdut2016/hpc_project
- xgqdut2016/hpc2torch
- LAFF-On-PfHP - On Programming for High Performance.
- flame/how-to-optimize-gemm - to-optimize-gemm?style=social"/> : How To Optimize Gemm wiki pages. [https://github.com/flame/how-to-optimize-gemm/wiki](https://github.com/flame/how-to-optimize-gemm/wiki)
- flame/blislab
- tpoisonooo/how-to-optimize-gemm - to-optimize-gemm?style=social"/> : row-major matmul optimization. [zhuanlan.zhihu.com/p/65436463](https://zhuanlan.zhihu.com/p/65436463).
- NVIDIA CUDA Toolkit Documentation
- NVIDIA CUDA C++ Programming Guide
- NVIDIA CUDA C++ Best Practices Guide
- CuPy User Guide
- NVIDIA/cuda-samples - samples?style=social"/> : Samples for CUDA Developers which demonstrates features in CUDA Toolkit.
- NVIDIA/CUDALibrarySamples
- flame/how-to-optimize-gemm - to-optimize-gemm?style=social"/> : How To Optimize Gemm wiki pages. [https://github.com/flame/how-to-optimize-gemm/wiki](https://github.com/flame/how-to-optimize-gemm/wiki)
- flame/blislab
- tpoisonooo/how-to-optimize-gemm - to-optimize-gemm?style=social"/> : row-major matmul optimization. [zhuanlan.zhihu.com/p/65436463](https://zhuanlan.zhihu.com/p/65436463).
- LAFF-On-PfHP - On Programming for High Performance.
- NVIDIA CUDA Toolkit Documentation
- NVIDIA CUDA C++ Programming Guide
- NVIDIA CUDA C++ Best Practices Guide
- CuPy User Guide
- NVIDIA/cuda-samples - samples?style=social"/> : Samples for CUDA Developers which demonstrates features in CUDA Toolkit.
- NVIDIA/CUDALibrarySamples
- NVIDIA-developer-blog/code-samples - developer-blog/code-samples?style=social"/> : Source code examples from the [Parallel Forall Blog](http://developer.nvidia.com/parallel-forall).
- HeKun-NVIDIA/CUDA-Programming-Guide-in-Chinese - NVIDIA/CUDA-Programming-Guide-in-Chinese?style=social"/> : This is a Chinese translation of the CUDA programming guide. 本项目为 CUDA C Programming Guide 的中文翻译版。
- MAhaitao999/CUDA_Programming
- sangyc10/CUDA-code - code?style=social"/> : bilibili视频【CUDA编程基础入门系列(持续更新)】配套代码。
- NVIDIA-developer-blog/code-samples - developer-blog/code-samples?style=social"/> : Source code examples from the [Parallel Forall Blog](http://developer.nvidia.com/parallel-forall).
- HeKun-NVIDIA/CUDA-Programming-Guide-in-Chinese - NVIDIA/CUDA-Programming-Guide-in-Chinese?style=social"/> : This is a Chinese translation of the CUDA programming guide. 本项目为 CUDA C Programming Guide 的中文翻译版。
- cuda-mode/resource-stream - mode/resource-stream?style=social"/> : CUDA related news and material links.
- brucefan1983/CUDA-Programming - Programming?style=social"/> : Sample codes for my CUDA programming book.
- YouQixiaowu/CUDA-Programming-with-Python - Programming-with-Python?style=social"/> : 关于书籍CUDA Programming使用了pycuda模块的Python版本的示例代码。
- RussWong/CUDATutorial
- QINZHAOYU/CudaSteps - 基础与实践》(樊哲勇 著)的cuda学习之路。
- DefTruth//CUDA-Learn-Notes - Learn-Notes?style=social"/> : 🎉CUDA/C++ 笔记 / 大模型手撕CUDA / 技术博客,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
- BBuf/how-to-optim-algorithm-in-cuda - to-optim-algorithm-in-cuda?style=social"/> : how to optimize some algorithm in cuda.
- PaddleJitLab/CUDATutorial - learning tutorail for CUDA High Performance Programing. 从零开始学习 CUDA 高性能编程。
- leimao/CUDA-GEMM-Optimization - GEMM-Optimization?style=social"/> : [CUDA Matrix Multiplication Optimization](https://leimao.github.io/article/CUDA-Matrix-Multiplication-Optimization/). This repository contains the CUDA kernels for general matrix-matrix multiplication (GEMM) and the corresponding performance analysis.
- interestingLSY/CUDA-From-Correctness-To-Performance-Code - From-Correctness-To-Performance-Code?style=social"/> : Codes & examples for "CUDA - From Correctness to Performance". The lecture can be found at [https://wiki.lcpu.dev/zh/hpc/from-scratch/cuda](https://wiki.lcpu.dev/zh/hpc/from-scratch/cuda).
- cuda-mode/resource-stream - mode/resource-stream?style=social"/> : CUDA related news and material links.
- brucefan1983/CUDA-Programming - Programming?style=social"/> : Sample codes for my CUDA programming book.
- QINZHAOYU/CudaSteps - 基础与实践》(樊哲勇 著)的cuda学习之路。
- MAhaitao999/CUDA_Programming
- sangyc10/CUDA-code - code?style=social"/> : bilibili视频【CUDA编程基础入门系列(持续更新)】配套代码。
- RussWong/CUDATutorial
- DefTruth//CUDA-Learn-Notes - Learn-Notes?style=social"/> : 🎉CUDA/C++ 笔记 / 大模型手撕CUDA / 技术博客,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
- BBuf/how-to-optim-algorithm-in-cuda - to-optim-algorithm-in-cuda?style=social"/> : how to optimize some algorithm in cuda.
- PaddleJitLab/CUDATutorial - learning tutorail for CUDA High Performance Programing. 从零开始学习 CUDA 高性能编程。
- leimao/CUDA-GEMM-Optimization - GEMM-Optimization?style=social"/> : [CUDA Matrix Multiplication Optimization](https://leimao.github.io/article/CUDA-Matrix-Multiplication-Optimization/). This repository contains the CUDA kernels for general matrix-matrix multiplication (GEMM) and the corresponding performance analysis.
- interestingLSY/CUDA-From-Correctness-To-Performance-Code - From-Correctness-To-Performance-Code?style=social"/> : Codes & examples for "CUDA - From Correctness to Performance". The lecture can be found at [https://wiki.lcpu.dev/zh/hpc/from-scratch/cuda](https://wiki.lcpu.dev/zh/hpc/from-scratch/cuda).
- Liu-xiandong/How_to_optimize_in_GPU - xiandong/How_to_optimize_in_GPU?style=social"/> : This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
- Bruce-Lee-LY/matrix_multiply - Lee-LY/matrix_multiply?style=social"/> : Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.
- Bruce-Lee-LY/cuda_hgemm - Lee-LY/cuda_hgemm?style=social"/> : Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
- Bruce-Lee-LY/cuda_hgemv - Lee-LY/cuda_hgemv?style=social"/> : Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
- enp1s0/ozIMMU
- AyakaGEMM/Hands-on-GEMM - on-GEMM?style=social"/> : A GEMM tutorial.
- Cjkkkk/CUDA_gemm
- AyakaGEMM/Hands-on-MLIR - on-MLIR?style=social"/> : Hands-on-MLIR.
- zpzim/MSplitGEMM
- jundaf2/CUDA-INT8-GEMM - INT8-GEMM?style=social"/> : CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API.
- chanzhennan/cuda_gemm_benchmark - xiandong/How_to_optimize_in_GPU](https://github.com/Liu-xiandong/How_to_optimize_in_GPU).
- YuxueYang1204/CudaDemo
- CoffeeBeforeArch/cuda_programming
- rbaygildin/learn-gpgpu - gpgpu?style=social"/> : Algorithms implemented in CUDA + resources about GPGPU.
- godweiyang/NN-CUDA-Example - CUDA-Example?style=social"/> : Several simple examples for popular neural network toolkits calling custom CUDA operators.
- Liu-xiandong/How_to_optimize_in_GPU - xiandong/How_to_optimize_in_GPU?style=social"/> : This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
- Bruce-Lee-LY/matrix_multiply - Lee-LY/matrix_multiply?style=social"/> : Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.
- Bruce-Lee-LY/cuda_hgemm - Lee-LY/cuda_hgemm?style=social"/> : Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
- Bruce-Lee-LY/cuda_hgemv - Lee-LY/cuda_hgemv?style=social"/> : Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
- enp1s0/ozIMMU
- Cjkkkk/CUDA_gemm
- AyakaGEMM/Hands-on-GEMM - on-GEMM?style=social"/> : A GEMM tutorial.
- AyakaGEMM/Hands-on-MLIR - on-MLIR?style=social"/> : Hands-on-MLIR.
- zpzim/MSplitGEMM
- jundaf2/CUDA-INT8-GEMM - INT8-GEMM?style=social"/> : CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API.
- chanzhennan/cuda_gemm_benchmark - xiandong/How_to_optimize_in_GPU](https://github.com/Liu-xiandong/How_to_optimize_in_GPU).
- YuxueYang1204/CudaDemo
- CoffeeBeforeArch/cuda_programming
- rbaygildin/learn-gpgpu - gpgpu?style=social"/> : Algorithms implemented in CUDA + resources about GPGPU.
- godweiyang/NN-CUDA-Example - CUDA-Example?style=social"/> : Several simple examples for popular neural network toolkits calling custom CUDA operators.
- yhwang-hub/Matrix_Multiplication_Performance_Optimization - hub/Matrix_Multiplication_Performance_Optimization?style=social"/> : Matrix Multiplication Performance Optimization.
- yao-jiashu/KernelCodeGen - jiashu/KernelCodeGen?style=social"/> : GEMM/Conv2d CUDA/HIP kernel code generation using MLIR.
- caiwanxianhust/ClusteringByCUDA
- ulrichstern/cuda-convnet - convnet?style=social"/> : Alex Krizhevsky's original code from Google Code. "微信公众号「人工智能大讲堂」《[找到了AlexNet当年的源代码,没用框架,从零手撸CUDA/C++](https://mp.weixin.qq.com/s/plxXG8y5QlxSionyjyPXqw)》"。
- PacktPublishing/Learn-CUDA-Programming - CUDA-Programming?style=social"/> : Learn CUDA Programming, published by Packt.
- PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA - On-GPU-Programming-with-Python-and-CUDA?style=social"/> : Hands-On GPU Programming with Python and CUDA, published by Packt.
- PacktPublishing/Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA - On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA?style=social"/> : Hands-On GPU Accelerated Computer Vision with OpenCV and CUDA, published by Packt.
- codingonion/cuda-beginner-course-cpp-version - beginner-course-cpp-version?style=social"/> : bilibili视频【CUDA 12.x 并行编程入门(C++版)】配套代码。
- codingonion/cuda-beginner-course-python-version - beginner-course-python-version?style=social"/> : bilibili视频【CUDA 12.x 并行编程入门(Python版)】配套代码。
- codingonion/cuda-beginner-course-rust-version - beginner-course-rust-version?style=social"/> : bilibili视频【CUDA 12.x 并行编程入门(Rust版)】配套代码。
- NVIDIA TensorRT Docs
- HeKun-NVIDIA/TensorRT-Developer_Guide_in_Chinese - NVIDIA/TensorRT-Developer_Guide_in_Chinese?style=social"/> : 本项目是NVIDIA TensorRT的中文版开发手册, 有个人翻译并添加自己的理解。
- LitLeo/TensorRT_Tutorial
- yao-jiashu/KernelCodeGen - jiashu/KernelCodeGen?style=social"/> : GEMM/Conv2d CUDA/HIP kernel code generation using MLIR.
- caiwanxianhust/ClusteringByCUDA
- ulrichstern/cuda-convnet - convnet?style=social"/> : Alex Krizhevsky's original code from Google Code. "微信公众号「人工智能大讲堂」《[找到了AlexNet当年的源代码,没用框架,从零手撸CUDA/C++](https://mp.weixin.qq.com/s/plxXG8y5QlxSionyjyPXqw)》"。
- PacktPublishing/Learn-CUDA-Programming - CUDA-Programming?style=social"/> : Learn CUDA Programming, published by Packt.
- PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA - On-GPU-Programming-with-Python-and-CUDA?style=social"/> : Hands-On GPU Programming with Python and CUDA, published by Packt.
- PacktPublishing/Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA - On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA?style=social"/> : Hands-On GPU Accelerated Computer Vision with OpenCV and CUDA, published by Packt.
- codingonion/cuda-beginner-course-cpp-version - beginner-course-cpp-version?style=social"/> : bilibili视频【CUDA 12.x 并行编程入门(C++版)】配套代码。
- codingonion/cuda-beginner-course-python-version - beginner-course-python-version?style=social"/> : bilibili视频【CUDA 12.x 并行编程入门(Python版)】配套代码。
- codingonion/cuda-beginner-course-rust-version - beginner-course-rust-version?style=social"/> : bilibili视频【CUDA 12.x 并行编程入门(Rust版)】配套代码。
- NVIDIA TensorRT Docs
- HeKun-NVIDIA/TensorRT-Developer_Guide_in_Chinese - NVIDIA/TensorRT-Developer_Guide_in_Chinese?style=social"/> : 本项目是NVIDIA TensorRT的中文版开发手册, 有个人翻译并添加自己的理解。
- LitLeo/TensorRT_Tutorial
- ifromeast/cuda_learning
- LLVM Docs
- MLIR Docs
- BBuf/tvm_mlir_learn
- j2kun/mlir-tutorial - tutorial?style=social"/> : This is the code repository for a series of articles on the [MLIR framework](https://mlir.llvm.org/) for building compilers.
- KEKE046/mlir-tutorial - tutorial?style=social"/> : Hands-On Practical MLIR Tutorial.
- Triton - lang/triton?style=social"/> : Development repository for the Triton language and compiler. [triton-lang.org/](https://triton-lang.org/)
- chenzomi12/AISystem
- chenzomi12/AIFoundation
- CuPy
- BobMcDear/neural-network-cuda - network-cuda?style=social"/> : Neural network from scratch in CUDA/C++.
- Apache TVM 中文站
-
Awesome List
- Erkaman/Awesome-CUDA - CUDA?style=social"/> : This is a list of useful libraries and resources for CUDA development.
- jslee02/awesome-gpgpu - gpgpu?style=social"/> : 😎 A curated list of awesome GPGPU (CUDA/OpenCL/Vulkan) resources.
- mikeroyal/CUDA-Guide - Guide?style=social"/> : A guide covering CUDA including the applications and tools that will make you a better and more efficient CUDA developer.
- Erkaman/Awesome-CUDA - CUDA?style=social"/> : This is a list of useful libraries and resources for CUDA development.
- jslee02/awesome-gpgpu - gpgpu?style=social"/> : 😎 A curated list of awesome GPGPU (CUDA/OpenCL/Vulkan) resources.
- mikeroyal/CUDA-Guide - Guide?style=social"/> : A guide covering CUDA including the applications and tools that will make you a better and more efficient CUDA developer.
- rkinas/triton-resources - resources?style=social"/> : A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
-
Jobs and Interview
- 2023-12-21,[英伟达内推
- 2023-12-21,[英伟达内推
- 2024-04-16,一份英伟达的offer,一年能到手多少钱
- 2024-04-22,英伟达大力建设智能驾驶中心,扩大招聘,欢迎来内推
- 2024-03-21,美团自动配送车2024春季招聘 | 社招专场
- 2024-04-21,推理部署工程师面试题库
- 2024-05-14,半导体外企 | NVIDIA英伟达招聘!13薪,月薪20-80k,含非技术岗,内部定制礼品,22周全薪产假
- 2024-06-01,英伟达算法岗面试,问的贼细!
- 知乎「Tim在路上」
- 2024-01-18,国内大厂GPU CUDA高频面试问题汇总(含部分答案)
- 2024-04-16,一份英伟达的offer,一年能到手多少钱
- 2024-04-22,英伟达大力建设智能驾驶中心,扩大招聘,欢迎来内推
- 2024-03-21,美团自动配送车2024春季招聘 | 社招专场
- 2024-04-21,推理部署工程师面试题库
- 2024-05-14,半导体外企 | NVIDIA英伟达招聘!13薪,月薪20-80k,含非技术岗,内部定制礼品,22周全薪产假
- 2024-06-01,英伟达算法岗面试,问的贼细!
- 知乎「Tim在路上」
- 2024-01-18,国内大厂GPU CUDA高频面试问题汇总(含部分答案)
-
Applications
- emptysoal/cuda-image-preprocess - image-preprocess?style=social"/> : Speed up image preprocess with cuda when handle image or tensorrt inference. Cuda编程加速图像预处理。
- Melody-Zhou/tensorRT_Pro-YOLOv8 - Zhou/tensorRT_Pro-YOLOv8?style=social"/> : This repository is based on [shouxieai/tensorRT_Pro](https://github.com/shouxieai/tensorRT_Pro), with adjustments to support YOLOv8. 目前已支持 YOLOv8、YOLOv8-Cls、YOLOv8-Seg、YOLOv8-OBB、YOLOv8-Pose、RT-DETR、ByteTrack、YOLOv9、YOLOv10、RTMO 高性能推理!!!🚀🚀🚀
- shouxieai/tensorRT_Pro
- shouxieai/infer
- hamdiboukamcha/yolov10-tensorrt - tensorrt?style=social"/> : YOLOv10 C++ TensorRT : Real-Time End-to-End Object Detection.
- triple-Mu/YOLOv8-TensorRT - Mu/YOLOv8-TensorRT?style=social"/> : YOLOv8 using TensorRT accelerate !
- emptysoal/cuda-image-preprocess - image-preprocess?style=social"/> : Speed up image preprocess with cuda when handle image or tensorrt inference. Cuda编程加速图像预处理。
- laugh12321/TensorRT-YOLO - YOLO?style=social"/> : 🚀 TensorRT-YOLO: Support YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, PP-YOLOE using TensorRT acceleration with EfficientNMS! TensorRT-YOLO 是一个支持 YOLOv3、YOLOv5、YOLOv6、YOLOv7、YOLOv8、YOLOv9、YOLOv10、PP-YOLOE 和 PP-YOLOE+ 的推理加速项目,使用 NVIDIA TensorRT 进行优化。项目不仅集成了 EfficientNMS TensorRT 插件以增强后处理效果,还使用了 CUDA 核函数来加速前处理过程。TensorRT-YOLO 提供了 C++ 和 Python 推理的支持,旨在提供快速而优化的目标检测解决方案。
- l-sf/Linfer - sf/Linfer?style=social"/> : 基于TensorRT的C++高性能推理库,Yolov10, YoloPv2,Yolov5/7/X/8,RT-DETR,单目标跟踪OSTrack、LightTrack。
- FeiYull/TensorRT-Alpha - AI-IOT/torch2trt?style=social"/> : 🔥🔥🔥TensorRT for YOLOv8、YOLOv8-Pose、YOLOv8-Seg、YOLOv8-Cls、YOLOv7、YOLOv6、YOLOv5、YOLONAS......🚀🚀🚀CUDA IS ALL YOU NEED.🍎🍎🍎
- laugh12321/TensorRT-YOLO - YOLO?style=social"/> : 🚀 TensorRT-YOLO: Support YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, PP-YOLOE using TensorRT acceleration with EfficientNMS! TensorRT-YOLO 是一个支持 YOLOv3、YOLOv5、YOLOv6、YOLOv7、YOLOv8、YOLOv9、YOLOv10、PP-YOLOE 和 PP-YOLOE+ 的推理加速项目,使用 NVIDIA TensorRT 进行优化。项目不仅集成了 EfficientNMS TensorRT 插件以增强后处理效果,还使用了 CUDA 核函数来加速前处理过程。TensorRT-YOLO 提供了 C++ 和 Python 推理的支持,旨在提供快速而优化的目标检测解决方案。
- l-sf/Linfer - sf/Linfer?style=social"/> : 基于TensorRT的C++高性能推理库,Yolov10, YoloPv2,Yolov5/7/X/8,RT-DETR,单目标跟踪OSTrack、LightTrack。
- Melody-Zhou/tensorRT_Pro-YOLOv8 - Zhou/tensorRT_Pro-YOLOv8?style=social"/> : This repository is based on [shouxieai/tensorRT_Pro](https://github.com/shouxieai/tensorRT_Pro), with adjustments to support YOLOv8. 目前已支持 YOLOv8、YOLOv8-Cls、YOLOv8-Seg、YOLOv8-OBB、YOLOv8-Pose、RT-DETR、ByteTrack、YOLOv9、YOLOv10、RTMO 高性能推理!!!🚀🚀🚀
- shouxieai/tensorRT_Pro
- shouxieai/infer
- hamdiboukamcha/yolov10-tensorrt - tensorrt?style=social"/> : YOLOv10 C++ TensorRT : Real-Time End-to-End Object Detection.
- triple-Mu/YOLOv8-TensorRT - Mu/YOLOv8-TensorRT?style=social"/> : YOLOv8 using TensorRT accelerate !
- FeiYull/TensorRT-Alpha - AI-IOT/torch2trt?style=social"/> : 🔥🔥🔥TensorRT for YOLOv8、YOLOv8-Pose、YOLOv8-Seg、YOLOv8-Cls、YOLOv7、YOLOv6、YOLOv5、YOLONAS......🚀🚀🚀CUDA IS ALL YOU NEED.🍎🍎🍎
- cyrusbehr/YOLOv8-TensorRT-CPP - TensorRT-CPP?style=social"/> : YOLOv8 TensorRT C++ Implementation. A C++ Implementation of YoloV8 using TensorRT Supports object detection, semantic segmentation, and body pose estimation.
- VIDIA-AI-IOT/torch2trt - AI-IOT/torch2trt?style=social"/> : An easy to use PyTorch to TensorRT converter.
- zhiqwang/yolort
- Linaom1214/TensorRT-For-YOLO-Series - For-YOLO-Series?style=social"/> : YOLO Series TensorRT Python/C++. tensorrt for yolo series (YOLOv8, YOLOv7, YOLOv6....), nms plugin support.
- wang-xinyu/tensorrtx - xinyu/tensorrtx?style=social"/> : TensorRTx aims to implement popular deep learning networks with tensorrt network definition APIs.
- cyrusbehr/YOLOv8-TensorRT-CPP - TensorRT-CPP?style=social"/> : YOLOv8 TensorRT C++ Implementation. A C++ Implementation of YoloV8 using TensorRT Supports object detection, semantic segmentation, and body pose estimation.
- VIDIA-AI-IOT/torch2trt - AI-IOT/torch2trt?style=social"/> : An easy to use PyTorch to TensorRT converter.
- zhiqwang/yolort
- Linaom1214/TensorRT-For-YOLO-Series - For-YOLO-Series?style=social"/> : YOLO Series TensorRT Python/C++. tensorrt for yolo series (YOLOv8, YOLOv7, YOLOv6....), nms plugin support.
- wang-xinyu/tensorrtx - xinyu/tensorrtx?style=social"/> : TensorRTx aims to implement popular deep learning networks with tensorrt network definition APIs.
- DefTruth/lite.ai.toolkit
- PaddlePaddle/FastDeploy - to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.
- enazoe/yolo-tensorrt - tensorrt?style=social"/> : TensorRT8.Support Yolov5n,s,m,l,x .darknet -> tensorrt. Yolov4 Yolov3 use raw darknet *.weights and *.cfg fils. If the wrapper is useful to you,please Star it.
- guojianyang/cv-detect-robot - detect-robot?style=social"/> : 🔥🔥🔥🔥🔥🔥Docker NVIDIA Docker2 YOLOV5 YOLOX YOLO Deepsort TensorRT ROS Deepstream Jetson Nano TX2 NX for High-performance deployment(高性能部署)。
- BlueMirrors/Yolov5-TensorRT - TensorRT?style=social"/> : Yolov5 TensorRT Implementations.
- lewes6369/TensorRT-Yolov3 - Yolov3?style=social"/> : TensorRT for Yolov3.
- CaoWGG/TensorRT-YOLOv4 - YOLOv4?style=social"/> :tensorrt5, yolov4, yolov3,yolov3-tniy,yolov3-tniy-prn.
- isarsoft/yolov4-triton-tensorrt - triton-tensorrt?style=social"/> : YOLOv4 on Triton Inference Server with TensorRT.
- TrojanXu/yolov5-tensorrt - tensorrt?style=social"/> : A tensorrt implementation of yolov5.
- tjuskyzhang/Scaled-YOLOv4-TensorRT - YOLOv4-TensorRT?style=social"/> : Implement yolov4-tiny-tensorrt, yolov4-csp-tensorrt, yolov4-large-tensorrt(p5, p6, p7) layer by layer using TensorRT API.
- Syencil/tensorRT - 7 Network Lib 包括常用目标检测、关键点检测、人脸检测、OCR等 可训练自己数据。
- SeanAvery/yolov5-tensorrt - tensorrt?style=social"/> : YOLOv5 in TensorRT.
- Monday-Leo/YOLOv7_Tensorrt - Leo/YOLOv7_Tensorrt?style=social"/> : A simple implementation of Tensorrt YOLOv7.
- ibaiGorordo/ONNX-YOLOv6-Object-Detection - YOLOv6-Object-Detection?style=social"/> : Python scripts performing object detection using the YOLOv6 model in ONNX.
- ibaiGorordo/ONNX-YOLOv7-Object-Detection - YOLOv7-Object-Detection?style=social"/> : Python scripts performing object detection using the YOLOv7 model in ONNX.
- triple-Mu/yolov7 - Mu/yolov7?style=social"/> : End2end TensorRT YOLOv7.
- hewen0901/yolov7_trt
- tsutof/tiny_yolov2_onnx_cam
- Monday-Leo/Yolov5_Tensorrt_Win10 - Leo/Yolov5_Tensorrt_Win10?style=social"/> : A simple implementation of tensorrt yolov5 python/c++🔥
- Wulingtian/yolov5_tensorrt_int8
- Wulingtian/yolov5_tensorrt_int8_tools
- MadaoFY/yolov5_TensorRT_inference
- DefTruth/lite.ai.toolkit
- PaddlePaddle/FastDeploy - to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.
- enazoe/yolo-tensorrt - tensorrt?style=social"/> : TensorRT8.Support Yolov5n,s,m,l,x .darknet -> tensorrt. Yolov4 Yolov3 use raw darknet *.weights and *.cfg fils. If the wrapper is useful to you,please Star it.
- guojianyang/cv-detect-robot - detect-robot?style=social"/> : 🔥🔥🔥🔥🔥🔥Docker NVIDIA Docker2 YOLOV5 YOLOX YOLO Deepsort TensorRT ROS Deepstream Jetson Nano TX2 NX for High-performance deployment(高性能部署)。
- BlueMirrors/Yolov5-TensorRT - TensorRT?style=social"/> : Yolov5 TensorRT Implementations.
- lewes6369/TensorRT-Yolov3 - Yolov3?style=social"/> : TensorRT for Yolov3.
- CaoWGG/TensorRT-YOLOv4 - YOLOv4?style=social"/> :tensorrt5, yolov4, yolov3,yolov3-tniy,yolov3-tniy-prn.
- isarsoft/yolov4-triton-tensorrt - triton-tensorrt?style=social"/> : YOLOv4 on Triton Inference Server with TensorRT.
- TrojanXu/yolov5-tensorrt - tensorrt?style=social"/> : A tensorrt implementation of yolov5.
- tjuskyzhang/Scaled-YOLOv4-TensorRT - YOLOv4-TensorRT?style=social"/> : Implement yolov4-tiny-tensorrt, yolov4-csp-tensorrt, yolov4-large-tensorrt(p5, p6, p7) layer by layer using TensorRT API.
- SeanAvery/yolov5-tensorrt - tensorrt?style=social"/> : YOLOv5 in TensorRT.
- Syencil/tensorRT - 7 Network Lib 包括常用目标检测、关键点检测、人脸检测、OCR等 可训练自己数据。
- Monday-Leo/YOLOv7_Tensorrt - Leo/YOLOv7_Tensorrt?style=social"/> : A simple implementation of Tensorrt YOLOv7.
- ibaiGorordo/ONNX-YOLOv6-Object-Detection - YOLOv6-Object-Detection?style=social"/> : Python scripts performing object detection using the YOLOv6 model in ONNX.
- ibaiGorordo/ONNX-YOLOv7-Object-Detection - YOLOv7-Object-Detection?style=social"/> : Python scripts performing object detection using the YOLOv7 model in ONNX.
- triple-Mu/yolov7 - Mu/yolov7?style=social"/> : End2end TensorRT YOLOv7.
- hewen0901/yolov7_trt
- tsutof/tiny_yolov2_onnx_cam
- tatsuya-fukuoka/yolov7-onnx-infer - fukuoka/yolov7-onnx-infer?style=social"/> : Inference with yolov7's onnx model.
- Monday-Leo/Yolov5_Tensorrt_Win10 - Leo/Yolov5_Tensorrt_Win10?style=social"/> : A simple implementation of tensorrt yolov5 python/c++🔥
- ervgan/yolov5_tensorrt_inference
- Wulingtian/yolov5_tensorrt_int8
- Wulingtian/yolov5_tensorrt_int8_tools
- MadaoFY/yolov5_TensorRT_inference
- ibaiGorordo/ONNX-YOLOv8-Object-Detection - YOLOv8-Object-Detection?style=social"/> : Python scripts performing object detection using the YOLOv8 model in ONNX.
- we0091234/yolov8-tensorrt - tensorrt?style=social"/> : yolov8 tensorrt 加速.
- FeiYull/yolov8-tensorrt - tensorrt?style=social"/> : YOLOv8的TensorRT+CUDA加速部署,代码可在Win、Linux下运行。
- cvdong/YOLO_TRT_SIM
- cvdong/YOLO_TRT_PY
- Psynosaur/Jetson-SecVision - SecVision?style=social"/> : Person detection for Hikvision DVR with AlarmIO ports, uses TensorRT and yolov4.
- AlbinZhu/easy-trt - trt?style=social"/> : TensorRT for YOLOv10 with CUDA.
- cvdong/YOLO_TRT_SIM
- cvdong/YOLO_TRT_PY
- ibaiGorordo/ONNX-YOLOv8-Object-Detection - YOLOv8-Object-Detection?style=social"/> : Python scripts performing object detection using the YOLOv8 model in ONNX.
- we0091234/yolov8-tensorrt - tensorrt?style=social"/> : yolov8 tensorrt 加速.
- Psynosaur/Jetson-SecVision - SecVision?style=social"/> : Person detection for Hikvision DVR with AlarmIO ports, uses TensorRT and yolov4.
- FeiYull/yolov8-tensorrt - tensorrt?style=social"/> : YOLOv8的TensorRT+CUDA加速部署,代码可在Win、Linux下运行。
- tatsuya-fukuoka/yolov7-onnx-infer - fukuoka/yolov7-onnx-infer?style=social"/> : Inference with yolov7's onnx model.
- ervgan/yolov5_tensorrt_inference
- AlbinZhu/easy-trt - trt?style=social"/> : TensorRT for YOLOv10 with CUDA.
- kalfazed/tensorrt_starter
- kalfazed/tensorrt_starter
Programming Languages
Categories
Sub Categories
Keywords
cuda
99
rust
67
tensorrt
58
gpu
50
deep-learning
42
python
34
yolov5
32
nvidia
30
llm
28
pytorch
27
machine-learning
23
cuda-programming
22
yolo
22
yolov8
22
cublas
21
neural-network
21
onnx
20
cpp
19
inference
18
ai
18
llama
17
gpgpu
16
object-detection
16
openai
15
vulkan
14
yolov3
14
cuda-kernels
14
yolov7
14
cudnn
13
gemm
12
gpu-programming
12
gpu-computing
12
chatgpt
12
hpc
12
gpt
12
opencl
11
yolov4
10
onnxruntime
10
gpu-acceleration
10
yolox
10
tensor
9
tauri
8
computer-vision
8
jetson
8
autograd
8
detection
8
yolov9
8
yolov10
8
yolov6
8
scientific-computing
8