An open API service indexing awesome lists of open source software.

CUDA

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

https://github.com/mihaibujanca/dynamicfusion

Implementation of Newcombe et al. CVPR 2015 DynamicFusion paper

3d-reconstruction computer-vision cuda non-rigid opencv

Last synced: 30 Jan 2026

https://github.com/IBM/aihwkit

IBM Analog Hardware Acceleration Kit

ai analog-devices cuda neural-networks pytorch

Last synced: 11 May 2025

https://github.com/JuliaGPU/CUDAnative.jl

Julia support for native CUDA programming

cuda cuda-toolkit julia julia-library

Last synced: 22 Jul 2025

https://github.com/nosferalatu/simplegpuhashtable

A simple GPU hash table implemented in CUDA using lock free techniques

cuda cuda-programming data-structures gpu gpu-cuda-programs

Last synced: 27 Dec 2025

https://github.com/dfm/extending-jax

Extending JAX with custom C++ and CUDA code

cuda jax xla

Last synced: 05 Apr 2025

https://github.com/engcang/vins-application

VINS-Fusion, VINS-Fisheye, OpenVINS, EnVIO, ROVIO, S-MSCKF, ORB-SLAM2, NVIDIA Elbrus application of different sets of cameras and imu on different board including desktop and Jetson boards

cuda nvidia ros ros2 slam vio visual-slam

Last synced: 11 Oct 2025

https://github.com/fixstars/cuda-bundle-adjustment

A CUDA implementation of Bundle Adjustment

bundle-adjustment cuda g2o slam structure-from-motion visual-slam

Last synced: 05 Apr 2025

https://github.com/NVIDIA/cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

cuda cuquantum custatevec cutensornet nvidia quantum-computing

Last synced: 02 Apr 2025

https://github.com/nosferalatu/SimpleGPUHashTable

A simple GPU hash table implemented in CUDA using lock free techniques

cuda cuda-programming data-structures gpu gpu-cuda-programs

Last synced: 06 May 2025

https://github.com/Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 14 May 2025

https://github.com/alpaka-group/alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:

cpp cpp17 cuda gpu header-only heterogeneous-parallel-programming hip hpc openacc openmp rocm tbb

Last synced: 15 May 2025

https://github.com/osai-ai/tensor-stream

A library for real-time video stream decoding to CUDA memory

c-plus-plus cuda python pytorch video video-processing

Last synced: 07 May 2025

https://github.com/luoyetx/mini-caffe

Minimal runtime core of Caffe, Forward only, GPU support and Memory efficiency.

android caffe cuda cudnn forward-only linux mini-caffe openblas windows

Last synced: 15 Mar 2025

https://github.com/bruce-lee-ly/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 05 Apr 2025

https://github.com/ekondis/mixbench

A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)

benchmark cuda gpu hip opencl openmp sycl

Last synced: 04 Apr 2025

https://github.com/nersc/timemory

Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ template API is essentially a framework to creating tools: it is designed to provide a unifying interface for recording various performance measurements alongside data logging and interfaces to other tools.

analysis c cplusplus cpp cross-language cross-platform cuda cupti gotcha hardware-counters instrumentation-api memory-measurements modular-design mpi papi performance performance-measurement python roofline

Last synced: 04 Oct 2025

https://github.com/LambdaLabsML/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: 08 Mar 2025

https://github.com/NERSC/timemory

Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ template API is essentially a framework to creating tools: it is designed to provide a unifying interface for recording various performance measurements alongside data logging and interfaces to other tools.

analysis c cplusplus cpp cross-language cross-platform cuda cupti gotcha hardware-counters instrumentation-api memory-measurements modular-design mpi papi performance performance-measurement python roofline

Last synced: 06 May 2025

https://github.com/uob-hpc/babelstream

STREAM, for lots of devices written in many programming models

benchmark cuda gpgpu gpu hpc kokkos memory-bandwidth openacc opencl openmp parallel-processing raja sycl

Last synced: 21 Oct 2025

https://github.com/zjhellofss/KuiperLLama

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2

Last synced: 08 Sep 2025

https://github.com/harrism/hemi

Simple utilities to enable code reuse and portability between CUDA C/C++ and standard C/C++.

c-plus-plus cuda cuda-device cuda-kernels gpu hemi

Last synced: 06 Apr 2025

https://github.com/zjhellofss/kuiperllama

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2

Last synced: 16 May 2025

https://github.com/omlins/parallelstencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs

cuda gpu julia multi-gpu multi-xpu parallel staggered-grids stencil stencil-codes xpu

Last synced: 28 Jan 2026

https://github.com/omlins/ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs

cuda gpu julia multi-gpu multi-xpu parallel staggered-grids stencil stencil-codes xpu

Last synced: 27 Mar 2025

https://github.com/nvidia/cuda-checkpoint

CUDA checkpoint and restore utility

checkpoint cuda

Last synced: 16 May 2025

https://github.com/fzj-jsc/tutorial-multi-gpu

Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial

cuda exascale-computing gpu hpc isc22 isc23 isc24 isc25 mpi multi-gpu nccl nvshmem sc21 sc22 sc23 sc24 sc25 supercomputing

Last synced: 07 Feb 2026

https://github.com/UoB-HPC/BabelStream

STREAM, for lots of devices written in many programming models

benchmark cuda gpgpu gpu hpc kokkos memory-bandwidth openacc opencl openmp parallel-processing raja sycl

Last synced: 21 Apr 2025

https://github.com/mrshaw01/software-engineer

A curated learning repository focused on High-Performance Computing (HPC) — covering fundamentals to advanced topics in CUDA, MPI, C++, and Python-C++ interoperability.

cpp cuda high-performance-computing hip python

Last synced: 16 Jul 2025

https://github.com/a2flo/floor

A C++ Compute/Graphics Library and Toolchain enabling same-source CUDA/Host/Metal/OpenCL/Vulkan C++ programming and execution.

c-plus-plus compiler compute cuda graphics ios linux macos metal opencl openxr rendering spir spir-v virtual-reality vulkan windows

Last synced: 16 May 2025

https://github.com/nvidia/tilus

Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.

cuda kernel programming tile

Last synced: 06 Sep 2025

https://github.com/knightcrawler25/optix-pathtracer

Simple physically based path tracer based on Nvidia's Optix Ray Tracing Engine

brdf cuda disney gpu optix pathtracing raytracing

Last synced: 07 Apr 2025

https://github.com/QMCPACK/qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support

c-plus-plus cuda electronic-structure gpu high-performance-computing hpc mpi oneapi quantum-chemistry quantum-monte-carlo rocm

Last synced: 26 Mar 2025

https://github.com/rkinas/triton-resources

A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.

cuda triton

Last synced: 16 May 2025

https://github.com/charles-r-earp/autograph

A machine learning library for Rust.

cuda machine-learning neural-networks rust

Last synced: 16 May 2025

https://github.com/lattice/quda

QUDA is a library for performing calculations in lattice QCD on GPUs.

c c-plus-plus cuda gpu mpi multi-gpu qcd

Last synced: 15 May 2025

https://github.com/gezp/docker-ubuntu-desktop

Docker Image for Ubuntu Desktop which support HW GPU accelerated GUI apps. you can access the Container with ssh or remote desktop, just like Cloud VM.

cuda docker kasmvnc nomachine nvidia-gpu opengl remote-desktop ubuntu virtualgl

Last synced: 13 Apr 2025

https://github.com/nvidia-genomics-research/genomeworks

SDK for GPU accelerated genome assembly and analysis

alignment cuda genomics gpu mapping nvidia partial-order-alignment poa python-api

Last synced: 06 Apr 2026

https://github.com/sekwiatkowski/Komputation

Komputation is a neural network framework for the Java Virtual Machine written in Kotlin and CUDA C.

artificial-intelligence convolutional-neural-networks cuda framework gpu jvm kotlin machine-learning neural-networks nlp nvidia recurrent-neural-networks seq2seq

Last synced: 01 Apr 2025

https://github.com/NVIDIA-Genomics-Research/GenomeWorks

SDK for GPU accelerated genome assembly and analysis

alignment cuda genomics gpu mapping nvidia partial-order-alignment poa python-api

Last synced: 09 May 2025

https://github.com/pcb9382/FaceAlgorithm

face detection face recognition包含人脸检测(retinaface,yolov5face,yolov7face,yolov8face),人脸检测跟踪(ByteTracker),人脸角度计算(Face_Angle)人脸矫正(Face_Aligner),人脸识别(Arcface),口罩检测(MaskRecognitiion),年龄性别检测(Gender_age),静默活体检测(Silent_Face_Anti_Spoofing),FaceAlignment(106keypoints)

cuda face-alignment face-detection face-recognition tensorrt yolov5face yolov7face yolov8face

Last synced: 18 Mar 2025

https://github.com/GoodAI/BrainSimulator

Brain Simulator is a platform for visual prototyping of artificial intelligence architectures.

ai brain-simulator cuda machine-learning

Last synced: 08 Jul 2025

https://github.com/andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

cpp cuda inference-engine llama llamacpp llm llm-inference machine-learning mistral

Last synced: 12 Apr 2025

https://github.com/JuliaGPU/CuArrays.jl

A Curious Cumulation of CUDA Cuisine

cuda gpu-programming julia

Last synced: 22 Jul 2025

https://github.com/rentainhe/pytorch-distributed-training

Simple tutorials on Pytorch DDP training

apex cuda ddp-training deep-learning pytorch

Last synced: 06 Mar 2026

https://github.com/LLNL/blt

A streamlined CMake build system foundation for developing HPC software

blt build-system build-tools cmake cpp cuda hpc radiuss testing

Last synced: 21 Apr 2025

https://github.com/marian-nmt/marian-dev

Fast Neural Machine Translation in C++ - development repository

cpp11 cuda fast gpu-acceleration neural-machine-translation

Last synced: 15 May 2025

https://github.com/llnl/blt

A streamlined CMake build system foundation for developing HPC software

blt build-system build-tools cmake cpp cuda hpc radiuss testing

Last synced: 15 May 2025

https://github.com/koide3/gtsam_points

A collection of GTSAM factors and optimizers for point cloud SLAM

bundle-adjustment continuous-time cuda factor-graph gpu gtsam kdtree localization mapping point-cloud registration slam voxelmap

Last synced: 12 Apr 2025

https://github.com/trinkle23897/fast-poisson-image-editing

A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

cpp cuda high-performance-computing image-processing jacobi-iteration jacobi-method mpi numpy openmp parallel-computing poisson-image-editing pybind11 python

Last synced: 05 Apr 2025

https://github.com/Trinkle23897/Fast-Poisson-Image-Editing

A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

cpp cuda high-performance-computing image-processing jacobi-iteration jacobi-method mpi numpy openmp parallel-computing poisson-image-editing pybind11 python

Last synced: 02 Apr 2025

https://github.com/CHIP-SPV/chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.

cuda hip hpc level0 llvm opencl spir-v

Last synced: 04 Apr 2025

https://github.com/slicer/light-the-torch

Install PyTorch distributions with computation backend auto-detection

cuda install pip pytorch

Last synced: 10 May 2026

https://github.com/openucx/ucc

Unified Collective Communication Library

collectives cuda deep-learning hpc infiniband mpi openshmem pgas pytorch roce sharp

Last synced: 16 May 2025

https://github.com/jcuda/jcuda

JCuda - Java bindings for CUDA

cuda gpu java

Last synced: 14 Apr 2025

https://github.com/pmeier/light-the-torch

Install PyTorch distributions with computation backend auto-detection

cuda install pip pytorch

Last synced: 24 Mar 2025

https://github.com/modelscope/dash-infer

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

cpu cuda guided-decoding llm llm-inference native-engine

Last synced: 12 Apr 2025

https://github.com/AmusementClub/vs-mlrt

Efficient CPU/GPU/Vulkan ML Runtimes for VapourSynth (with built-in support for waifu2x, DPIR, RealESRGANv2/v3, Real-CUGAN, RIFE, SCUNet and more!)

artificial-intelligence cuda deep-learning directml dpir gpu migraphx ncnn neural-network onnx onnxruntime openvino real-cugan real-esrgan rife tensorrt vapoursynth vulkan waifu2x

Last synced: 24 Mar 2025

https://github.com/ritchieng/dlami

A Deep Learning Amazon Web Service (AWS) AMI that is open, free and works. Run in less than 5 minutes. TensorFlow, Keras, PyTorch, Theano, MXNet, CNTK, Caffe and all dependencies.

ami aws cuda cudnn5 keras python tensorflow ubuntu

Last synced: 08 May 2025

https://github.com/shapelets/khiva

An open-source library of algorithms to analyse time series in GPU and CPU.

clustering cpp cuda data-series discords distances gpu khiva kshape matrix-profile motifs multicore opencl shapelets snippets time-series timeseries

Last synced: 08 May 2025

https://github.com/opendilab/di-hpc

OpenDILab RL HPC OP Lib, including CUDA and Triton kernel

cuda hpc lstm pytorch reinforcement-learning triton

Last synced: 09 Apr 2025

https://github.com/ceed/libceed

CEED Library: Code for Efficient Extensible Discretizations

api ceed cuda ecp exascale-computing gpu high-order high-performance-computing hpc julia linear-algebra

Last synced: 15 May 2025

https://github.com/marnovo/macos-egpu-cuda-guide

Set up CUDA for machine learning (and gaming) on macOS using a NVIDIA eGPU

apple cuda deep-learning egpu gaming gpu guide hacktoberfest mac machine-learning macos nvidia

Last synced: 17 Aug 2025

https://github.com/CEED/libCEED

CEED Library: Code for Efficient Extensible Discretizations

api ceed cuda ecp exascale-computing gpu high-order high-performance-computing hpc julia linear-algebra

Last synced: 07 May 2025

https://github.com/marnovo/macOS-eGPU-CUDA-guide

Set up CUDA for machine learning (and gaming) on macOS using a NVIDIA eGPU

apple cuda deep-learning egpu gaming gpu guide hacktoberfest mac machine-learning macos nvidia

Last synced: 14 Jul 2025

https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

cuda sgemm

Last synced: 04 Apr 2025

https://github.com/Hellisotherpeople/CX_DB8

a contextual, biasable, word-or-sentence-or-paragraph extractive summarizer powered by the latest in text embeddings (Bert, Universal Sentence Encoder, Flair)

contextual-summarization cuda debate-evidence embeddings extractive-summarization flair python semantic-search semantic-summarization summarization summarizer token-level-summarization universal-sentence-encoder

Last synced: 13 Jul 2025

https://github.com/bh107/bohrium

Automatic parallelization of Python/NumPy, C, and C++ codes on Linux and MacOSX

cuda gpu gpu-acceleration multi-core numpy opencl parallel-computing

Last synced: 21 Oct 2025

https://github.com/bytedance/abq-llm

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

cuda llm-inference mlsys quantized-networks research

Last synced: 04 Apr 2025

https://github.com/ifl-camp/supra

SUPRA: Software Defined Ultrasound Processing for Real-Time Applications - An Open Source 2D and 3D Pipeline from Beamforming to B-Mode

2d 3d cuda openigtlink pipeline real-time software-defined supra tum ultrasound ultrasound-imaging ultrasound-pipeline

Last synced: 05 Jul 2025

https://github.com/leimao/cuda-gemm-optimization

CUDA Matrix Multiplication Optimization

cuda

Last synced: 20 Feb 2026

https://github.com/turlucode/ros-docker-gui

ROS Docker Containers with X11 (GUI) support [Linux]

cuda docker gui nvidia ros ros2

Last synced: 08 Apr 2025

https://github.com/DeMoriarty/TorchPQ

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

cuda nearest-neighbor-search pytorch

Last synced: 01 Apr 2025

https://github.com/demoriarty/torchpq

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

cuda nearest-neighbor-search pytorch

Last synced: 05 Apr 2025

https://github.com/alpine-dav/ascent

A flyweight in situ visualization and analysis runtime for multi-physics HPC simulations

analysis cuda data-viz hpc mpi parallel-computing radiuss rendering scientific-computing

Last synced: 25 Jun 2025

https://github.com/helmut-hoffer-von-ankershoffen/jetson

Helmut Hoffer von Ankershoffen experimenting with arm64 based NVIDIA Jetson (Nano and AGX Xavier) edge devices running Kubernetes (K8s) for machine learning (ML) including Jupyter Notebooks, TensorFlow Training and TensorFlow Serving using CUDA for smart IoT.

ansible archiconda cuda docker edge-devices hoffer-von-ankershoffen jupyter k8s kubeflow kubernetes kustomize machine-learning ml nvidia-jetson-nano nvidia-jetson-xavier skaffold smart-iot software-engineering tensorflow-serving virtualbox

Last synced: 14 Apr 2025

https://github.com/nvidia/dl4agx

Deep Learning tools and applications for NVIDIA AGX platforms.

autonomous-driving computer-vision cuda deep-learning drive-agx embedded

Last synced: 12 Apr 2025