CUDA
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.
- GitHub: https://github.com/topics/cuda
- Wikipedia: https://en.wikipedia.org/wiki/CUDA
- Created by: Nvidia
- Released: June 23, 2007
- Related Topics: nvcc,
- Last updated: 2026-06-16 00:07:13 UTC
- JSON Representation
https://github.com/goldbattle/libelas-gpu
Implementation of LIBELAS in cuda.
cpu cuda depth-maps gpu libelas libelas-gpu
Last synced: 10 Apr 2025
https://github.com/Dr-Noob/peakperf
Achieve peak performance on x86 CPUs and NVIDIA GPUs
assembly avx cpu cpu-frequency cpu-microarchitecture cuda gflop gpu intrinsics microarchitecture microbenchmark nvidia performance
Last synced: 21 Apr 2025
https://github.com/ecrc/kblas-gpu
Subset of BLAS routines optimized for NVIDIA GPUs
Last synced: 01 Mar 2026
https://github.com/dr-noob/peakperf
Achieve peak performance on x86 CPUs and NVIDIA GPUs
assembly avx cpu cpu-frequency cpu-microarchitecture cuda gflop gpu intrinsics microarchitecture microbenchmark nvidia performance
Last synced: 09 Apr 2025
https://github.com/harrism/sublimetext-cuda-cpp
CUDA C++ package for Sublime Text 2 & 3
cuda snippets sublime-text tmlanguage
Last synced: 05 Aug 2025
https://github.com/iconben/z-image-studio
A Cli, a webUI, and a MCP server for the Z-Image-Turbo text-to-image generation model (Tongyi-MAI/Z-Image-Turbo base model as well as quantized models)
ai apple apple-silicon cuda diffusers localllm lora mcp-server mps python text-to-image text2image webui z-image z-image-turbo
Last synced: 15 Jan 2026
https://github.com/nickkarpowicz/lightwaveexplorer
An efficient, user-friendly solver for nonlinear light-matter interaction
c-plus-plus cuda nonlinear-optics oneapi optics-simulation simulation sycl
Last synced: 07 Feb 2026
https://github.com/larc/gproshan
geometry processing and shape analysis framework
computational-geometry cpp cuda dictionary-learning geometry-processing opengl shape-analysis sparse-coding
Last synced: 16 Apr 2025
https://github.com/NickKarpowicz/LightwaveExplorer
An efficient, user-friendly solver for nonlinear light-matter interaction
c-plus-plus cuda nonlinear-optics oneapi optics-simulation simulation sycl
Last synced: 04 Apr 2025
https://github.com/ztxtech/Time-Evidence-Fusion-Network
Official implementation of "Time Evidence Fusion Network: Multi-source View in Long-Term Time Series Forecasting" (https://arxiv.org/abs/2405.06419)
cuda deep-learning machine-learning macos neural-network neural-networks pytorch time-series time-series-analysis time-series-forecasting time-series-prediction uestc
Last synced: 01 Apr 2025
https://github.com/sh1ng/arboretum
Gradient Boosting powered by GPU(NVIDIA CUDA)
arboretum cuda gpu gradient-boosting gradient-boosting-machine machine-learning python
Last synced: 07 Nov 2025
https://github.com/xmartlabs/cuda-calculator
Online CUDA Occupancy Calculator
cuda gpgpu gpu gpu-computing gpu-kernels gpu-programming kernel nvidia occupancy
Last synced: 10 Mar 2025
https://github.com/fahimfba/cuda-wsl2-ubuntu
Install CUDA on Windows11 using WSL2
cuda cuda-programming cuda-support cuda-toolkit cuda-wsl deep-learning deep-reinforcement-learning deeplearning deeplearning-ai machine-learning machinelearning machinelearning-python wsl wsl-environment wsl-ubuntu wsl2
Last synced: 14 Apr 2025
https://github.com/gunrock/loops
🎃 GPU load-balancing library for regular and irregular computations.
cuda gpu gpu-computing hpc load-balancing parallel
Last synced: 09 May 2026
https://github.com/tomrunia/pytorchsteerablepyramid
PyTorch implementation of the Complex Steerable Pyramid
batch computer-vision cuda image-processing mkl pyramid pytorch
Last synced: 04 May 2025
https://github.com/bruce-lee-ly/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
cublas cuda cuda-core gemm gemv gpu hgemm hgemv matrix-multiply nvidia tensor-core
Last synced: 17 Jun 2025
https://github.com/open-atmos/PySDM
Pythonic particle-based (super-droplet) warm-rain/aqueous-chemistry cloud microphysics package with box, parcel & 1D/2D prescribed-flow examples in Python, Julia and Matlab
atmospheric-modelling atmospheric-physics cuda gpu gpu-computing monte-carlo-simulation numba nvrtc particle-system physics-simulation pint pypi-package python research simulation thrust
Last synced: 04 Apr 2025
https://github.com/jpuigcerver/pytorch-baidu-ctc
PyTorch bindinga for Baidu's Warp-CTC
Last synced: 11 Apr 2025
https://github.com/rapidsai/nx-cugraph
GPU Accelerated Backend for NetworkX
Last synced: 17 Mar 2026
https://github.com/bokutotu/zenu
A Deep Learning framework with very few dependencies, Written in Rust
ai autograd blas cublas cuda cudnn deep-learning deep-neural-networks gpu-computing hpc rust
Last synced: 05 Apr 2025
https://github.com/Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
cublas cuda cuda-core gemm gemv gpu hgemm hgemv matrix-multiply nvidia tensor-core
Last synced: 14 May 2025
https://github.com/dakenf/stable-diffusion-nodejs
GPU-accelerated javascript runtime for StableDiffusion. Uses modified ONNX runtime to support CUDA and DirectML.
cuda directml nodejs stable-diffusion typescript
Last synced: 26 Oct 2025
https://github.com/fynv/thrustrtc
CUDA tool set for non-C++ languages that provides similar functionality like Thrust, with NVRTC at its core.
Last synced: 07 Apr 2025
https://github.com/wizyoung/optical-flow-gpu-docker
Compute dense optical flow using TV-L1 algorithm with NVIDIA GPU acceleration.
Last synced: 07 May 2025
https://github.com/moinfra/sylvan
🌳 An educational modern C++ deep learning framework supporting CUDA
autograd cuda deep-learning-framework dnn machine-learning transformer
Last synced: 14 Jun 2026
https://github.com/saddam213/llamastack
ASP.NET Core Web, WebApi & WPF implementations for LLama.cpp & LLamaSharp
alpaca chatgpt cuda huggingface llama llama2 llamacpp llamasharp llm
Last synced: 30 Sep 2025
https://github.com/enp1s0/ozimmu
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
cuda gemm mixed-precision tensorcore tensorcores
Last synced: 09 Apr 2025
https://github.com/kibae/pg_onnx
pg_onnx: ONNX Runtime integrated with PostgreSQL. Perform ML inference with data in your database.
ai contributions-welcome cuda deep-learning inference machine-learning onnx onnxruntime postgresql postgresql-extension
Last synced: 28 Apr 2026
https://github.com/goldsborough/k-means
Code accompanying my blog post on k-means in Python, C++ and CUDA
cpp cuda k-means machine-learning parallel python
Last synced: 12 Oct 2025
https://github.com/pkestene/ramsesgpu
Astrophysics MHD simulation code optimized for large cluster of GPU
astrophysics cea cfd conservation-law cuda euler-equations finite-volume gpu gpu-computing hdf5 hpc kelvin-helmholtz-instability magnetohydrodynamics mhd muscl-hancock parallel-computing pnetcdf rayleigh-taylor shearing-box turbulence
Last synced: 25 Feb 2026
https://github.com/denzp/rust-ptx-builder
Convenient `build.rs` helper for NVPTX crates
Last synced: 16 Mar 2025
https://github.com/brickray/gpu-pathtracer
physically based path tracer on gpu
cuda gpu pathtracing raytracing tracing
Last synced: 08 May 2025
https://github.com/xiaosong9905/cuda-optimization-guide
Xiao's CUDA Optimization Guide [Active Adding New Contents]
cuda gpu hpc nvidia-gpu optimization parallel-computing
Last synced: 15 May 2025
https://github.com/jeng1220/openacc_fortran_examples
Simple OpenACC Fortran Examples
Last synced: 22 Mar 2025
https://github.com/enp1s0/ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
cuda gemm mixed-precision tensorcore tensorcores
Last synced: 04 Apr 2025
https://github.com/pikselkroken/pixlstash
PixlStash is a Python-based image management, tagging and editing web app leveraging AI tools for tagging, similarity checks, grouping and quality assessment. It has a REST-API and a VUE-based web frontend.
captioning-images comfyui comfyui-workflow cross-platform cuda docker-image image-classification image-database image-editing image-management image-manager image-tagging locally-hosted machine-learning picture-management pictures python self-hosted stable-diffusion vue
Last synced: 07 Jun 2026
https://github.com/adamtiger/tinyGPUlang
Tutorial on building a gpu compiler backend in LLVM
Last synced: 31 Mar 2026
https://github.com/stereolabs/zed-docker
Docker images for the ZED SDK
cuda docker nvidia-docker zed-camera
Last synced: 22 Feb 2026
https://github.com/rokibulislaam/colab-ffmpeg-cuda
FFmpeg build with CUDA support for Linux (especially for Google Colab)
colab-notebook cuda ffmpeg ffmpeg-installer h264 h265 hevc-encoder nvenc ubuntu1804
Last synced: 14 Apr 2025
https://github.com/khrylx/dsgpuraytracing
A GPU-based ray tracer using CUDA
Last synced: 11 Jul 2025
https://github.com/adityashrm21/book-recommender-system-rbm
A book recommender system created using simple Restricted Boltzmann Machines in TensorFlow
book-recommender books cuda geoffrey-hinton hopfield-network neural-networks python3 rbm recommender-system restricted-boltzmann-machines tensorflow
Last synced: 29 Apr 2025
https://github.com/Natsu-Akatsuki/RangeNet-TensorRT
Rangenet++ with high-version TensorRT (e.g.8~10), libtorch, CUDA programming.
cuda libtorch semantic-segmentation tensorrt
Last synced: 31 Jul 2025
https://github.com/loeeeee/immich-in-lxc
Install Immich in LXC with optional CUDA support
bare-metal cuda guide immich install-script lxc machine-learning proxmox-ve ubuntu
Last synced: 01 Oct 2025
https://github.com/juliafolds/foldscuda.jl
Data-parallelism on CUDA using Transducers.jl and for loops (FLoops.jl)
cuda gpu high-performance iterators julia map-reduce parallel transducers
Last synced: 13 Apr 2025
https://github.com/trixi-gpu/trixicuda.jl
CUDA acceleration for Trixi.jl
acceleration cuda gpu high-performance-computing julia numerical-simulations parallel-programming pde scientific-computing
Last synced: 28 Jun 2025
https://github.com/DefTruth/ffpa-attn-mma
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
attention cuda flash-attention mlsys sdpa tensor-cores
Last synced: 09 Oct 2025
https://github.com/forceflow/cuda2glcore
Implementation of Cuda to OpenGL rendering
cuda graphics opengl rendering
Last synced: 21 Jan 2026
https://github.com/Par4All/par4all
Par4All is an automatic parallelizing and optimizing compiler (workbench) for C and Fortran sequential programs
abstract-interpretation automatic-parallelization c99 cuda fortran interprocedural opencl parallelization polyhedral-model
Last synced: 22 Apr 2025
https://github.com/par4all/par4all
Par4All is an automatic parallelizing and optimizing compiler (workbench) for C and Fortran sequential programs
abstract-interpretation automatic-parallelization c99 cuda fortran interprocedural opencl parallelization polyhedral-model
Last synced: 10 Apr 2025
https://github.com/emptysoal/cuda-image-preprocess
Speed up image preprocess with cuda when handle image or tensorrt inference
cnn cuda cuda-demo cuda-kernels cuda-programming deep-learning image-processing tensorrt
Last synced: 01 Aug 2025
https://github.com/rbaygildin/learn-gpgpu
Algorithms implemented in CUDA + resources about GPGPU
cublas cuda curand gpgpu gpu gpu-computing image-processing nvidia opencl parallel-computing pycuda
Last synced: 14 May 2025
https://github.com/3dlg-hcvc/m3dref-clip
[ICCV 2023] Multi3DRefer: Grounding Text Description to Multiple 3D Objects
3d clip computer-vision cuda deep-learning localization pytorch pytorch-lightning transformer visual-grounding
Last synced: 04 Aug 2025
https://github.com/ingonyama-zk/fast-danksharding
Danksharding Builder with GPU acceleration
Last synced: 10 Apr 2025
https://github.com/ctuning/ctuning-programs
Collective Knowledge extension with unified and customizable benchmarks (with extensible JSON meta information) to be easily integrated with customizable and portable Collective Knowledge workflows. You can easily compile and run these benchmarks using different compilers, environments, hardware and OS (Linux, MacOS, Windows, Android). More info:
c collaborative-benchmarking collaborative-optimization collective-knowledge common-benchmarks cpp crowd-benchmarking crowd-tuning cuda customizable-benchmarking fortran json-api json-metadata open-benchmarks opencl reproducible-research reproducible-workflows
Last synced: 10 Jan 2026
https://github.com/q-minh/physicsbasedanimationtoolkit
Cross-platform C++ library of algorithms and data structures commonly used in computer graphics research on physically-based simulation with Python bindings.
animation cmake cpp cuda gpu graphics physics python simulation
Last synced: 13 May 2025
https://github.com/stellar-group/octotiger
Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
astrophysics cuda cuda-kernels hpx kokkos simd stellar-mergers sycl
Last synced: 04 Jul 2025
https://github.com/yehengchen/ubuntu-deep-learning-environment-setup
Guide to installing Tensorflow with NVIDIA GPU and Deep learning enviroment - Nvidia Drivers/cuda/cuDNN/tensorflow-gpu/中文文档
cuda cudnn deep-learning nvidia-gpu tensorflow tensorflow-gpu ubuntu
Last synced: 05 May 2025
https://github.com/projectphysx/ptxprofiler
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
cuda gpu gpu-acceleration gpu-computing gpu-programming hpc nvidia nvidia-cuda nvidia-gpu opencl profiler ptx ptx-utils roofline-model sycl
Last synced: 10 Sep 2025
https://github.com/shredengineer/magneticalc
MagnetiCalc calculates the magnetic field of arbitrary coils.
coil cuda current education engineering field-calculation flux-density gui inductance interactive jit linux magnetic-field magnetostatics metric python simulation-modeling vector-potential visualization wire
Last synced: 27 Jul 2025
https://github.com/ProjectPhysX/PTXprofiler
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
cuda gpu gpu-acceleration gpu-computing gpu-programming hpc nvidia nvidia-cuda nvidia-gpu opencl profiler ptx ptx-utils roofline-model sycl
Last synced: 04 Apr 2025
https://github.com/jefflarkin/openacc-interoperability
Interoperability examples for OpenACC.
Last synced: 31 Jul 2025
https://github.com/1ytic/warp-rna
Recurrent Neural Aligner
cuda forward-backward rna rnn-transducer
Last synced: 14 Aug 2025
https://github.com/andi611/apriori-and-eclat-frequent-itemset-mining
Implementation of the Apriori and Eclat algorithms, two of the best-known basic algorithms for mining frequent item sets in a set of transactions, implementation in Python.
apriori apriori-algorithm cuda data-mining data-mining-algorithms eclat eclat-algorithm frequent-itemset-mining frequent-itemsets frequent-pattern-mining gcc gpu gpu-acceleration gpu-programming plot pycuda python transaction transactions
Last synced: 13 Apr 2025
https://github.com/kevinzakka/learn-cuda
Learning some parallel programming with CUDA
Last synced: 24 Mar 2025
https://github.com/eth-cscs/spfft
Sparse 3D FFT library with MPI, OpenMP, CUDA and ROCm support
cuda fft fft-library gpu-acceleration hpc mpi rocm
Last synced: 17 Jun 2025
https://github.com/abraham-ai/eden
Eden converts your python function into a hosted endpoint with minimal changes to your existing code :mage_man:
celery cuda fastapi python redis-client task-queue
Last synced: 23 Oct 2025
https://github.com/abhisheknair10/llama3.cu
Lightweight Llama 3 8B Inference Engine in CUDA C
Last synced: 14 Apr 2025
https://github.com/STEllAR-GROUP/octotiger
Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
astrophysics cuda cuda-kernels hpx kokkos simd stellar-mergers sycl
Last synced: 04 Apr 2025
https://github.com/passionlab/openequivariance
OpenEquivariance: a fast, open-source GPU JIT kernel generator for the Clebsch-Gordon Tensor Product.
cuda equivariance geometric-deep-learning graph-neural-networks sparse-tensors
Last synced: 16 Jan 2026
https://quokka-astro.github.io/quokka/
Two-moment AMR radiation hydrodynamics (with self-gravity, particles, and chemistry) on CPUs/GPUs for astrophysics
adaptive-mesh-refinement astrochemistry astrophysics cuda gpu hip hydrodynamics particles rocm self-gravity
Last synced: 09 Mar 2025
https://github.com/govertb/GPUGraphLayout
An experimental GPU accelerated implementation of ForceAtlas2
cuda forceatlas2 gephi graph-algorithms graph-layout social-network-analysis visualization
Last synced: 04 Apr 2025
https://github.com/js1010/cusim
Superfast CUDA implementation of Word2Vec and Latent Dirichlet Allocation (LDA)
cuda gensim gpu lda topic-modeling w2v word-embedding
Last synced: 30 Apr 2025
https://github.com/flatironinstitute/jaxmg
JAXMg: A multi-GPU linear solver in JAX
cuda distributed-computing jax
Last synced: 17 Mar 2026
https://github.com/AstroAccelerateOrg/astro-accelerate
AstroAccelerate is a many-core accelerated software package for processing time-domain radio-astronomy data.
Last synced: 31 Mar 2025
https://github.com/luisagroup/luisa-compute-rs
Rust frontend to LuisaCompute and more!
computer-graphics cuda differentiable-programming differentiable-rendering directx dsl dx gpu gpu-programming graphics raytracing rendering rust shading-language vulkan
Last synced: 20 Aug 2025
https://github.com/safeailab/zkdl
zkDL, an open source toolkit for zero-knowledge proofs of deep learning powered by CUDA
cuda deep-neural-networks gpu-acceleration privacy-enhancing-technologies zero-knowledge-proof
Last synced: 17 Jan 2026
https://github.com/lucidrains/autoregressive-linear-attention-cuda
CUDA implementation of autoregressive linear attention, with all the latest research findings
artificial-intelligence attention-mechanisms cuda deep-learning linear-attention
Last synced: 09 Oct 2025
https://github.com/star-hengxing/cs149-xmake
CS149 xmake version
cuda hpc ispc parrallel-computing xmake
Last synced: 06 Feb 2026
https://github.com/chiehpower/Setup-deeplearning-tools
Set up CI in DL/ cuda/ cudnn/ TensorRT/ onnx2trt/ onnxruntime/ onnxsim/ Pytorch/ Triton-Inference-Server/ Bazel/ Tesseract/ PaddleOCR/ NVIDIA-docker/ minIO/ Supervisord on AGX or PC from scratch.
agx ci cuda cudnn deep-learning docker installation minio nvidia onnx-simplifier onnx2trt onnxruntime paddleocr pytorch supervisord tensorrt tensorrt-inference-server tesseract-ocr triton-inference-server triton-server
Last synced: 20 Mar 2025
https://github.com/kokkos/kokkos-remote-spaces
Distributed View Extension for Kokkos
cuda distributed-computing gpu high-performance-computing hpc mpi parallel-computing pgas
Last synced: 03 Oct 2025
https://github.com/neur1n/x.h
Cross platform C/C++ utilities.
c cpp cross-platform cublas cuda logger logging
Last synced: 14 Jan 2026
https://github.com/cair/pytsetlinmachinecuda
Massively Parallel and Asynchronous Architecture for Logic-based AI
classification convolution cuda gpu learning-automata logic-based-artificial-intelligence regression tsetlin-machine
Last synced: 28 Oct 2025
https://github.com/autodesk/neon
Multi-GPU Framework for Voxel Grid Computations
cuda gpu gpu-acceleration grid hpc lbm parallel parallel-computing
Last synced: 21 Aug 2025
https://github.com/weft/warp
continuous energy monte carlo neutron transport in general geometries on GPUs
carlo cuda gpu monte monte-carlo neutron transport
Last synced: 04 Apr 2025
https://github.com/sskorol/vosk-api-gpu
Vosk ASR Docker images with GPU for Jetson boards, PCs, M1 laptops and GPC
asr cuda docker gcp gpu jetson jetson-nano jetson-xavier-nx m1 nvidia nvidia-docker vosk vosk-api
Last synced: 23 Mar 2025
https://github.com/dansarie/sboxgates
Program for finding low gate count implementations of S-boxes.
cryptanalysis cuda logic-circuit mpi
Last synced: 21 Feb 2026
https://github.com/andravin/spio
Memory-Efficient CUDA kernels for training ConvNets with PyTorch.
convolutional-neural-networks cuda pytorch
Last synced: 14 Jul 2025
https://github.com/lwYeo/SoliditySHA3Miner
All-in-one mixed multi-GPU (nVidia, AMD, Intel) & CPU miner solves proof of work to mine supported EIP918 tokens in a single instance (with API).
0xbitcoin amdminer cpuminer cuda ethos gpu-miner gpu-mining gpumining hiveos igpu linux miner nvidia-miner opencl solo-mining windows-10
Last synced: 06 May 2025
https://github.com/bruce-lee-ly/decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
cuda cuda-core decoding-attention flash-attention flashinfer flashmla gpu gqa inference large-language-model llm mha mla mqa multi-head-attention nvidia
Last synced: 19 Aug 2025