CUDA
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.
- GitHub: https://github.com/topics/cuda
- Wikipedia: https://en.wikipedia.org/wiki/CUDA
- Created by: Nvidia
- Released: June 23, 2007
- Related Topics: nvcc,
- Last updated: 2026-06-23 00:07:15 UTC
- JSON Representation
https://github.com/galaxies99/inception-cuda
CUDA Implementation of Inception
Last synced: 12 Apr 2025
https://github.com/xlite-dev/HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API. 🎉🎉
Last synced: 30 Jul 2025
https://github.com/prithivsakthiur/vlm-parsing
VLM-Parsing is a Gradio-based web application for parsing documents and images into structured HTML and Markdown formats using advanced Vision Language Models (VLMs).
cuda gradio html huggingface-models huggingface-spaces huggingface-transformers logics markdown ocr-recognition pytorch qwen2-5-vl spaces vlm
Last synced: 05 Apr 2026
https://github.com/pothosware/pothosgpu
Pothos toolkit for ArrayFire API support
arrayfire cuda dataflow dataflow-programming gpu opencl pothos
Last synced: 19 Apr 2026
https://github.com/maliknaik16/parallel-computing
CUDA programming in C++ for high-performance computing using Nvidia GPUs, optimized for tasks like machine learning, or image processing
cores cpp cuda gpu makefile matrix nvcc optimization
Last synced: 10 Jun 2025
https://github.com/markdtw/parallel-programming
Basic Pthread, OpenMP, CUDA examples
cuda openmp parallel-programming pthreads
Last synced: 20 Apr 2026
https://github.com/debowin/gpu-parallel-recommender-system
GPGPU Parallel User-User Collaborative Filtering System in CUDA C
collaborative-filtering cuda gpu-programming movielens-dataset recommender-system
Last synced: 24 Apr 2026
https://github.com/kpetridis24/four-russians-algorithm
Boolean matrix multiplication accelerated by the four-Russians algorithm
c cuda gpu high-performance matrix-multiplication preprocess
Last synced: 29 May 2026
https://github.com/teodutu/asc
Arhitectura Sistemelor de Calcul - UPB 2020
cache-optimization cuda parallel-programming profiling python-threading
Last synced: 24 Apr 2026
https://github.com/csvancea/gpu-hashtable
GPU-backed linear-probing hash table implemented in CUDA. Supports batch operations such as insert and retrieval.
Last synced: 24 Apr 2026
https://github.com/tiw302/mandelbrot-c
A simple Mandelbrot set explorer written in C. Crafted with SDL2 and multithreaded rendering for a smooth experience. ‹(•_•)›
c cuda fractal graphics mandelbrot multithreading sdl2 web webassembly
Last synced: 26 Apr 2026
https://github.com/neoblizz/cupti-plus-plus
CUPTI++ is a C++ interface to the CUDA Profiling Tools Interface (CUPTI).
cpp cuda cuda-profiler cupti profiler
Last synced: 26 Apr 2026
https://github.com/navdeep-g/dimreduce4gpu
Dimensionality reduction ("dimreduce") on GPUs ("4gpu")
cplusplus cuda dimensionality-reduction gpu linear-algebra pca python svd unsupervised-learning
Last synced: 14 Apr 2025
https://github.com/alpha74/cuda_basics
Nvidia NVCC CUDA programs for begineers.
c cpp cuda cuda-programs nvcc nvidia parallel-computing parallel-programming
Last synced: 08 May 2026
https://github.com/stdogpkg/cukuramoto
A python/CUDA pkg which solves numerically the kuramoto model through the Heun's method
complex-networks cuda kuramoto-model
Last synced: 28 Jan 2026
https://github.com/neoblizz/spmv
Efficient Sparse Matrix-Vector Multiplication (SpMV) using ModernGPU (MTX + CSR formats).
csr cuda gpgpu load-balancing mtx spmv
Last synced: 28 Apr 2026
https://github.com/grakshith/parallel-k-means
K-Means clustering for Image Colour Quantization and Image Compression
cuda image-color-quantization image-compression k-means mpi opencv openmp
Last synced: 28 Apr 2026
https://github.com/mu7annad0/100gpu
100 Days of CUDA: Optimizing My Life, One Kernel at a Time. 🔄🔥
Last synced: 08 Mar 2026
https://github.com/ginkgo-project/cudaarchitectureselector
A CMake module simplifying the specification of CUDA architectures
Last synced: 05 Nov 2025
https://github.com/aiday-mar/mpi-cuda-project
Using MPI and CUDA in order to accelerate the conjugate gradient algorithm execution in C++
c-plus-plus cuda gpu mpi university-project
Last synced: 02 May 2026
https://github.com/xmas7/cudampi
A large hybrid CPU/GPU sorting network using CUDA and MPI. The sorting network uses a standard Quicksort for CPUs and a custom Bitonic Sort for GPUs. These two algorithms were the fastest in a number of prior benchmarks.
cpu cuda gpu hybrid mpi network
Last synced: 29 Apr 2026
https://github.com/pelayo-felgueroso/tensorflow-gpu-setup
Step-by-step guide to installing TensorFlow with GPU support on Conda.
artificial-intelligence cuda deep-learning gpu machine-learning nvidia nvidia-gpu setup-guide tensorflow
Last synced: 17 Feb 2026
https://github.com/l1cacheDell/CUDA_Code
Codes for learning cuda. Implementation of multiple kernels.
Last synced: 10 Mar 2025
https://github.com/tvanfossen/entropic
Local-first agentic inference engine in C/C++. Multi-tier model routing, grammar-constrained output, MCP tool servers. Embeddable via C ABI.
agentic-ai agentic-framework cpp cpp20 cuda edge-ai embedded-ai gbnf gguf grammar-constrained-decoding inference-engine llama-cpp llm local-llm mcp on-device-ai privacy-first tool-calling
Last synced: 30 May 2026
https://github.com/isazi/aoflagger
AOFlagger Radio Frequency Interference mitigation algorithm.
Last synced: 30 Apr 2026
https://github.com/headless-start/data-augmentation-impact
This repository contains effect of Data Augmentation of Training Set during Model Training.
augmented-images cuda data gpu keras matplotlib mnist opencv-python python3 tensorflow training-data
Last synced: 05 Apr 2026
https://github.com/dqbd/cuda-btree
Implementation of B-Trees on NVIDIA CUDA
Last synced: 30 Apr 2026
https://github.com/szymon423/tsp-cpu-vs-gpu
Simple brute force approach to solve travelling salesman problem with CPU and GPU
Last synced: 11 Mar 2025
https://github.com/nixos-cuda/cuda-legacy
Select CUDA package sets which have aged out of Nixpkgs. [maintainers=@ConnorBaker, @SomeoneSerge]
Last synced: 15 May 2026
https://github.com/kar-dim/watermarking-gpu
Code for my Diploma thesis at Information and Communication Systems Engineering (University of the Aegean, School of Engineering) with title "Efficient implementation of watermark and watermark detection algorithms for image and video using the graphics processing unit". Part 2 / GPU
arrayfire cpp cuda ffmpeg gpu image-processing opencl parallel-computing video-processing watermark-image watermarking
Last synced: 09 Apr 2025
https://github.com/podgorskiy/deeplearningserversetup
My notes on setting up a server for Deep-Learning
cuda deep-learning driver ethernet ipmi neural-network nfs notes nvidia nvidia-driver nvidia-gpu server sshfs ubuntu
Last synced: 22 Aug 2025
https://github.com/dhruvsrikanth/cudann
A distributed implementation of a deep learning framework in CUDA.
cpp cuda deep-learning deep-learning-framework gpu-programming high-performance-computing hpc parallel-programming
Last synced: 01 May 2026
https://github.com/bogdanminko/laperf
La Perf is a framework for AI performance benchmarking — covering LLMs, VLMs, embeddings, with power-metrics collection.
ai-benchmark ai-performance apple-silicon cuda lmstudio ml-benchmark mlx mps nvidia-gpu ollama open-source-benchmark
Last synced: 15 May 2026
https://github.com/superlinear-ai/scipy-notebook-gpu
jupyter/scipy-notebook with CUDA Toolkit, cuDNN, NCCL, and TensorRT
cuda cudnn docker nccl scipy-notebook tensorflow tensorrt
Last synced: 01 May 2026
https://github.com/nellogan/distributed_compy
Distributed_compy is a distributed computing library that offers multi-threading, heterogeneous (CPU + mult-GPU), and multi-node support
cluster cuda heterogeneous-parallel-programming multi-threading multigpu openmp openmpi
Last synced: 16 Aug 2025
https://github.com/dito97/gol
High-performance Computing (90535) final project at UniGe
Last synced: 02 May 2026
https://github.com/cklxx/arle
Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.
agent cuda flashinfer gspo inference infra kv-cache llm metal mlx openai-compatible qwen3 qwen35 rl rust
Last synced: 02 May 2026
https://github.com/kim-hwiwon/t-espresso
A CUDA Library for Low-overhead Host-to-Device Transmission of Patterned Profile Data
Last synced: 04 May 2026
https://github.com/B1-663R/docker-mining
Dockerfiles to build docker images to start mining with an NVIDIA Docker architecture
cryptocurrency cuda docker-image docker-nvidia mining
Last synced: 28 Mar 2025
https://github.com/tank3-tk3/pi-calculation-cpu-gpu
PI calculation with CPU and GPU
c cpp cuda parallel-computing pi
Last synced: 13 Apr 2026
https://github.com/mulx10/firefly
Enhancing Object Detection in using Thermal Imaging for thin cross-section unidentifiable objects(eg. cyclist, pedestrians).
autonomous-cars autonomous-navigation autonomous-vehicles c cuda object-detection thermal-camera yolov3
Last synced: 03 Sep 2025
https://github.com/programmer-rd-ai/detectx
A Pythonic approach to object detection using Detectron2, a clean, modular framework for training and deploying computer vision models. DetectX simplifies the complexity of object detection while maintaining high performance and extensibility.
coco-dataset computer-vision computer-vision-library cuda deep-learning detectron2 faster-rcnn gpu-accelerated machine-learning ml-framework object-detection object-recognition python3 pytorch retinanet
Last synced: 10 Jun 2025
https://github.com/avitase/fast_frechet
Comparison of different (fast) discrete Fréchet distance implementations in C++ and CUDA.
benchmark cpp cuda frechet-distance simd
Last synced: 18 May 2026
https://github.com/tyler-hilbert/cuda-kmeans
K-Means in CUDA
cuda kmeans-clustering machine-learning nsight
Last synced: 30 Mar 2025
https://github.com/true-real-michael/python-plane-ransac
Parallel RANSAC for plane detection for multiple point clouds using Python and CUDA
cuda numba plane-detection python ransac
Last synced: 14 Mar 2025
https://github.com/LKohlhepp/Ito-Monte-Carlo
MC-Simulation of the Ito-SDE (Krülls 1994)
astronomy astrophysics cuda gpu-acceleration monte-carlo physics-simulation simulation stochastic-differential-equations
Last synced: 10 Mar 2025
https://github.com/pd2871/high-performance-computing
This repo contain the logs of High Performance Computing module's final Assignment
blurred-images c cuda gaussian-blur matrix-multiplication multi-threading parallel-computing pthreads pthreads-api
Last synced: 10 May 2026
https://github.com/tank3-tk3/parallel-processing-cuda
Parallel processing with CUDA C / C++
c cpp cuda parallel-computing parallel-programming
Last synced: 09 May 2026
https://github.com/tky823/bitlinear158compression
Compare compression models for inference by BitLinear158
Last synced: 12 Jun 2026
https://github.com/mrglaster/cuda-acfcalc
Calculation of the smallest ACF for signals of length N using CUDA technology.
acf c calculations cpp cuda google-colaboratory google-colaboratory-notebooks isu
Last synced: 06 May 2026
https://github.com/nachovizzo/saxpy_openacc_cpp
My way of thinking about OpenACC, C++, and Parallel computing in general
Last synced: 23 Jun 2026
https://github.com/dereklstinson/nccl
golang wrapper for nccl
cuda deep-learning go nccl parallel-computing
Last synced: 14 May 2026
https://github.com/poodarchu/vision-lab
Computer Vision Experiments in all.
computer-vision cuda object-detection
Last synced: 07 May 2026
https://github.com/daaboulex/unsloth-nix
Unsloth (git main) packaged for NixOS — CPU/CUDA/ROCm LoRA fine-tuning envs
cuda fine-tuning flake lora machine-learning nix nixos nixos-module pytorch rocm unsloth
Last synced: 10 Jun 2026
https://github.com/uefi-code/msra_thepracticespaceproject_pytorchcuda
My repo to attend MSRA the Practice Space Project 2022, CUDA Implement and Optimize
Last synced: 06 May 2026
https://github.com/xebastex/sfw-python
Python package designed to provide the essentials tools for off-the-grid inverse problem. This is the bedrock for future GUI implementation.
blasso cuda frank-wolfe pytorch
Last synced: 09 May 2026
https://github.com/alextmjugador/rust-cuda-quickstart
Bring the Rust-CUDA project back to life under modern Linux environments.
cuda cuda-programming cuda-rust cuda-support docker rust
Last synced: 06 May 2026
https://github.com/speedcell4/torchdevice
Setup CUDA_VISIBLE_DEVICES
cuda deep-learning gpu machine-learning pytorch
Last synced: 07 May 2026
https://github.com/sun-zhenxing/fast-neural-style
快速风格迁移部署
cuda cv2 fast-neural-style opencv
Last synced: 05 May 2026
https://github.com/igorcosta/deep-docker
Docker image for Deep Learning on AWS Cloud
cuda deep-learning docker docker-image tensorflow
Last synced: 05 May 2026
https://github.com/seieric/gst-dsobjectsmask
📀NVIDIA DeepStream integrated GStreamer Plugin. Mask objects with cuda cores on Jetson boards. Fast and smooth since everything is done on NVMM.🏎
cuda cuda-programming deepstream gpu gstreamer gstreamer-plugins instance-segmentation jetson-agx-orin jetson-agx-xavier jetson-tx1 jetson-tx2 jetson-xavier maskrcnn nvidia-jetson nvidia-jetson-nano opencv opencv4 resnet resnet50
Last synced: 06 May 2026
https://github.com/garciparedes/cuda-examples
Cuda examples who I develop to learn HPC based on GPU
c c-plus-plus cuda examples gpgpu gpu hpc
Last synced: 09 May 2026
https://github.com/brosnanyuen/raybnn_dataloader
Data Loader for RayBNN
arrayfire cpu csv csv-parser cuda data-structures gpu-computing oneapi opencl parallel parallel-computing rust
Last synced: 07 May 2026
https://github.com/seralexeev/rabbit0
Robot Rabbit
cuda jetson nvidia robotics ros2 zed-camera
Last synced: 15 Jun 2026
https://github.com/abdulfatir/subkmeans
Numpy and pyCUDA implementation of subKmeans
clustering cuda kdd kmeans numpy pycuda python subspace-clustering
Last synced: 09 May 2026
https://github.com/poyea/lollipop
🍭 Sweet GPU compute kernels in CUDA, wrapped via CuPy
cuda cuda-kernel cuda-kernels cuda-programming gpu-kernels gpu-programming python
Last synced: 17 Jun 2026
https://github.com/ezamagni/knapsack-simd
A genetic 01-Knapsack problem solver in CUDA
cuda knapsack-problem knapsack01
Last synced: 09 May 2026
https://github.com/manishklach/gpu-resident-inference-lab
Research lab for GPU-resident LLM inference loops: persistent kernels, sparse KV selection, tiered residency, speculative decode, and trace-driven scheduling.
cuda gpu-systems kv-cache llm-inference mega-kernel model-systems persistent-kernel runtime speculative-decoding
Last synced: 19 Jun 2026
https://github.com/jayemscript/llm-systems-from-scratch
A hands-on learning project for building the core systems behind Large Language Models using C++, Rust, and optional Python/JavaScript bindings. Includes tensor operations, autograd, neural networks, tokenization, and a minimal transformer pipeline.
ai-systems autograd c-language cpp cuda educational-project high-performance-computing inference-engine machine-learning neural-networks-from-scratch pybind11 tensor-library tokenization transformers wasm
Last synced: 19 Jun 2026
https://github.com/seongwon980/htop-gpu
Terminal dashboard for NVIDIA GPUs, system CPU/memory, and processes — clickable, with conda env / docker container / cwd info per process.
btop cli cuda dashboard gpu htop machine-learning monitor nvidia nvtop python sysadmin terminal tui
Last synced: 22 Jun 2026
https://github.com/daelsepara/hipslm
CPU and GPU (using HIP) implementations of phase pattern generators for use with spatial light modulators
computer-generated-holography cuda gpu hip hologram holography phase phase-pattern slm spatial-light-modulator
Last synced: 22 Jun 2026
https://github.com/abhans/archdev
Container that is built with Arch Linux with NVIDIA Driver & CUDA support, PyTorch and TensorFlow built in.
archlinux container cuda docker
Last synced: 07 May 2026
https://github.com/jblaschke/pynvtx
Thin pybind11 wrapper for NVTX wrappers -- with some bells and whistles attached.
Last synced: 23 Jun 2026
https://github.com/kibotu/llm-windows-server
Turn your Windows GPU into a private, low-latency LLM server. Docker-based, OpenAI-compatible API.
agentic cuda docker gguf llma-cpp local-llm nvidia-gpu openai-api opencode qwen self-hosted windows
Last synced: 10 Jun 2026
https://github.com/giorgiogamba/parallel_programming
Experimenting with parallel programming
cuda cuda-kernels cuda-programming cuda-toolkit parallel parallel-computing parallel-processing parallel-programming visual-studio
Last synced: 18 Feb 2026
https://github.com/matx64/rs-netbot
Old School Runescape bot with CNN for object identification
Last synced: 04 May 2026
https://github.com/microo8/micronn
Simple neural network library with backpropagation using CUDA
Last synced: 19 May 2026
https://github.com/umer-farooq-cs/canny-edge-detector
High-performance Canny edge detector with CPU and CUDA implementations. Loads PGM images, performs Gaussian smoothing, gradients, non-max suppression, and hysteresis. Benchmarks both paths, outputs edge maps, and reports speedup. Simple Makefile, sample images included.
c canny-edge-detection computer-vision cpp cuda gpu high-performance-computing image-processing nvcc pgm
Last synced: 18 Apr 2026
https://github.com/linux-alex/geep
GEEP (Genetic Evolutionary Engineering Platform) - a C++/Qt framework for genetic programming, optimized with CUDA acceleration. GEEP enables large-scale population-based optimization, ideal for solving high-dimensional problems using evolutionary algorithms and GPU computing.
cpp cuda framework genetic-programming
Last synced: 18 May 2026
https://github.com/bjornmelin/deep-learning-evolution
🧠 Deep-Learning Evolution: Unified collection of TensorFlow & PyTorch projects, featuring custom CUDA kernels, distributed training, memory‑efficient methods, and production‑ready pipelines. Showcases advanced GPU optimizations, from foundational models to cutting‑edge architectures. 🚀
ai-research cuda data-science deep-learning distributed-training gan gpu-acceleration machine-learning model-optimization neural-networks python pytorch tensorflow training-pipeline transformers
Last synced: 09 May 2026
https://github.com/alwaysai/jetpack-46-hacky-hour
NVIDIA’s Jetpack 4.6 capabilities and how to use them with EdgeIQ, alwaysAI Computer Vision framework.
alwaysai computer-vision cuda edge-computing jetpack tensorrt
Last synced: 01 May 2026
https://github.com/michaelfranzl/image_debian-gpgpu
Dockerfile for a Debian base image with AMD and Nvidia GPGPU support
amd container container-image cuda debian docker gpgpu nvidia opencl
Last synced: 10 May 2026
https://github.com/hyunjinno/multicore_computing
A repository of multicore programming in Java and C.
c cpp cuda java multithreading openmp thread thrust
Last synced: 18 Apr 2026
https://github.com/wallneradam/docker-ccminer
CCMiner (tpruvot version) Docker Builder
ccminer cuda docker gpu litecoin miner monero nvidia nvidia-docker
Last synced: 18 Apr 2026
https://github.com/pjueon/cuda_intellisense
A simple python script to fix cuda C++ intellisense for visual studio.
Last synced: 09 Apr 2026
https://github.com/dhruvsrikanth/monte-carlo-ray-tracing
In this repository, you will find a serial and distributed GPU-based implementation of the ray tracing simulation.
c cpp cuda gpu-computing gpu-programming high-performance-computing parallel-programming raytracing unified-memory-parallelism
Last synced: 01 May 2026