An open API service indexing awesome lists of open source software.

CUDA

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

https://github.com/galaxies99/inception-cuda

CUDA Implementation of Inception

cuda inception-v3

Last synced: 12 Apr 2025

https://github.com/xlite-dev/HGEMM

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API. 🎉🎉

cuda hgemm tensor-cores

Last synced: 30 Jul 2025

https://github.com/puzzlef/pagerank-cuda-dynamic

Design of CUDA-based Parallel Dynamic PageRank algorithm for measuring importance.

algorithm cuda gpu graph pagerank static temporal

Last synced: 21 Feb 2026

https://github.com/prithivsakthiur/vlm-parsing

VLM-Parsing is a Gradio-based web application for parsing documents and images into structured HTML and Markdown formats using advanced Vision Language Models (VLMs).

cuda gradio html huggingface-models huggingface-spaces huggingface-transformers logics markdown ocr-recognition pytorch qwen2-5-vl spaces vlm

Last synced: 05 Apr 2026

https://github.com/pothosware/pothosgpu

Pothos toolkit for ArrayFire API support

arrayfire cuda dataflow dataflow-programming gpu opencl pothos

Last synced: 19 Apr 2026

https://github.com/maliknaik16/parallel-computing

CUDA programming in C++ for high-performance computing using Nvidia GPUs, optimized for tasks like machine learning, or image processing

cores cpp cuda gpu makefile matrix nvcc optimization

Last synced: 10 Jun 2025

https://github.com/markdtw/parallel-programming

Basic Pthread, OpenMP, CUDA examples

cuda openmp parallel-programming pthreads

Last synced: 20 Apr 2026

https://github.com/debowin/gpu-parallel-recommender-system

GPGPU Parallel User-User Collaborative Filtering System in CUDA C

collaborative-filtering cuda gpu-programming movielens-dataset recommender-system

Last synced: 24 Apr 2026

https://github.com/kpetridis24/four-russians-algorithm

Boolean matrix multiplication accelerated by the four-Russians algorithm

c cuda gpu high-performance matrix-multiplication preprocess

Last synced: 29 May 2026

https://github.com/teodutu/asc

Arhitectura Sistemelor de Calcul - UPB 2020

cache-optimization cuda parallel-programming profiling python-threading

Last synced: 24 Apr 2026

https://github.com/csvancea/gpu-hashtable

GPU-backed linear-probing hash table implemented in CUDA. Supports batch operations such as insert and retrieval.

cuda hashtable

Last synced: 24 Apr 2026

https://github.com/tiw302/mandelbrot-c

A simple Mandelbrot set explorer written in C. Crafted with SDL2 and multithreaded rendering for a smooth experience. ‹(•_•)›

c cuda fractal graphics mandelbrot multithreading sdl2 web webassembly

Last synced: 26 Apr 2026

https://github.com/denzp/current

CUDA high-level Rust framework

cuda rust

Last synced: 26 Apr 2026

https://github.com/neoblizz/cupti-plus-plus

CUPTI++ is a C++ interface to the CUDA Profiling Tools Interface (CUPTI).

cpp cuda cuda-profiler cupti profiler

Last synced: 26 Apr 2026

https://github.com/hansalemaos/nvidiacheck

Monitors NVIDIA GPU information and log the data into a pandas DataFrame - Windows only.

cuda log logging nvidia torch

Last synced: 27 Apr 2026

https://github.com/navdeep-g/dimreduce4gpu

Dimensionality reduction ("dimreduce") on GPUs ("4gpu")

cplusplus cuda dimensionality-reduction gpu linear-algebra pca python svd unsupervised-learning

Last synced: 14 Apr 2025

https://github.com/huwzpf/parallel-processing-cpu-and-gpu-env-and-lib-with-powercap

(2024/2025) A library and environment for parallel processing in a power-limited CPU+GPU cluster environment.

c cpu cuda gpu mpi openmp parallel powercap

Last synced: 11 Apr 2025

https://github.com/alpha74/cuda_basics

Nvidia NVCC CUDA programs for begineers.

c cpp cuda cuda-programs nvcc nvidia parallel-computing parallel-programming

Last synced: 08 May 2026

https://github.com/stdogpkg/cukuramoto

A python/CUDA pkg which solves numerically the kuramoto model through the Heun's method

complex-networks cuda kuramoto-model

Last synced: 28 Jan 2026

https://github.com/neoblizz/spmv

Efficient Sparse Matrix-Vector Multiplication (SpMV) using ModernGPU (MTX + CSR formats).

csr cuda gpgpu load-balancing mtx spmv

Last synced: 28 Apr 2026

https://github.com/grakshith/parallel-k-means

K-Means clustering for Image Colour Quantization and Image Compression

cuda image-color-quantization image-compression k-means mpi opencv openmp

Last synced: 28 Apr 2026

https://github.com/mu7annad0/100gpu

100 Days of CUDA: Optimizing My Life, One Kernel at a Time. 🔄🔥

cuda gpu

Last synced: 08 Mar 2026

https://github.com/ginkgo-project/cudaarchitectureselector

A CMake module simplifying the specification of CUDA architectures

cmake cmake-modules cuda

Last synced: 05 Nov 2025

https://github.com/aiday-mar/mpi-cuda-project

Using MPI and CUDA in order to accelerate the conjugate gradient algorithm execution in C++

c-plus-plus cuda gpu mpi university-project

Last synced: 02 May 2026

https://github.com/xmas7/cudampi

A large hybrid CPU/GPU sorting network using CUDA and MPI. The sorting network uses a standard Quicksort for CPUs and a custom Bitonic Sort for GPUs. These two algorithms were the fastest in a number of prior benchmarks.

cpu cuda gpu hybrid mpi network

Last synced: 29 Apr 2026

https://github.com/pelayo-felgueroso/tensorflow-gpu-setup

Step-by-step guide to installing TensorFlow with GPU support on Conda.

artificial-intelligence cuda deep-learning gpu machine-learning nvidia nvidia-gpu setup-guide tensorflow

Last synced: 17 Feb 2026

https://github.com/l1cacheDell/CUDA_Code

Codes for learning cuda. Implementation of multiple kernels.

cuda cuda-programming

Last synced: 10 Mar 2025

https://github.com/tvanfossen/entropic

Local-first agentic inference engine in C/C++. Multi-tier model routing, grammar-constrained output, MCP tool servers. Embeddable via C ABI.

agentic-ai agentic-framework cpp cpp20 cuda edge-ai embedded-ai gbnf gguf grammar-constrained-decoding inference-engine llama-cpp llm local-llm mcp on-device-ai privacy-first tool-calling

Last synced: 30 May 2026

https://github.com/isazi/aoflagger

AOFlagger Radio Frequency Interference mitigation algorithm.

cuda gpu many-core rfi

Last synced: 30 Apr 2026

https://github.com/headless-start/data-augmentation-impact

This repository contains effect of Data Augmentation of Training Set during Model Training.

augmented-images cuda data gpu keras matplotlib mnist opencv-python python3 tensorflow training-data

Last synced: 05 Apr 2026

https://github.com/dqbd/cuda-btree

Implementation of B-Trees on NVIDIA CUDA

b-tree cuda nvidia

Last synced: 30 Apr 2026

https://github.com/steleman/pytorch-cuda-2.7.1

Clone of PyTorch: Tensors and Dynamic neural networks in Python and C++ with strong GPU acceleration.

cuda fedora macos pytorch sequoia

Last synced: 30 Apr 2026

https://github.com/szymon423/tsp-cpu-vs-gpu

Simple brute force approach to solve travelling salesman problem with CPU and GPU

cuda tsp

Last synced: 11 Mar 2025

https://github.com/nixos-cuda/cuda-legacy

Select CUDA package sets which have aged out of Nixpkgs. [maintainers=@ConnorBaker, @SomeoneSerge]

cuda nixpkgs nixpkgs-overlay

Last synced: 15 May 2026

https://github.com/kar-dim/watermarking-gpu

Code for my Diploma thesis at Information and Communication Systems Engineering (University of the Aegean, School of Engineering) with title "Efficient implementation of watermark and watermark detection algorithms for image and video using the graphics processing unit". Part 2 / GPU

arrayfire cpp cuda ffmpeg gpu image-processing opencl parallel-computing video-processing watermark-image watermarking

Last synced: 09 Apr 2025

https://github.com/dhruvsrikanth/cudann

A distributed implementation of a deep learning framework in CUDA.

cpp cuda deep-learning deep-learning-framework gpu-programming high-performance-computing hpc parallel-programming

Last synced: 01 May 2026

https://github.com/bogdanminko/laperf

La Perf is a framework for AI performance benchmarking — covering LLMs, VLMs, embeddings, with power-metrics collection.

ai-benchmark ai-performance apple-silicon cuda lmstudio ml-benchmark mlx mps nvidia-gpu ollama open-source-benchmark

Last synced: 15 May 2026

https://github.com/superlinear-ai/scipy-notebook-gpu

jupyter/scipy-notebook with CUDA Toolkit, cuDNN, NCCL, and TensorRT

cuda cudnn docker nccl scipy-notebook tensorflow tensorrt

Last synced: 01 May 2026

https://github.com/slesniew/parallel-processing-cpu-and-gpu-env-and-lib-with-powercap

(2024/2025) A library and environment for parallel processing in a power-limited CPU+GPU cluster environment.

c cpu cuda gpu mpi openmp parallel powercap

Last synced: 30 Mar 2025

https://github.com/nellogan/distributed_compy

Distributed_compy is a distributed computing library that offers multi-threading, heterogeneous (CPU + mult-GPU), and multi-node support

cluster cuda heterogeneous-parallel-programming multi-threading multigpu openmp openmpi

Last synced: 16 Aug 2025

https://github.com/dito97/gol

High-performance Computing (90535) final project at UniGe

cuda mpi openmp

Last synced: 02 May 2026

https://github.com/cklxx/arle

Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.

agent cuda flashinfer gspo inference infra kv-cache llm metal mlx openai-compatible qwen3 qwen35 rl rust

Last synced: 02 May 2026

https://github.com/kim-hwiwon/t-espresso

A CUDA Library for Low-overhead Host-to-Device Transmission of Patterned Profile Data

cuda profiler

Last synced: 04 May 2026

https://github.com/B1-663R/docker-mining

Dockerfiles to build docker images to start mining with an NVIDIA Docker architecture

cryptocurrency cuda docker-image docker-nvidia mining

Last synced: 28 Mar 2025

https://github.com/tank3-tk3/pi-calculation-cpu-gpu

PI calculation with CPU and GPU

c cpp cuda parallel-computing pi

Last synced: 13 Apr 2026

https://github.com/mulx10/firefly

Enhancing Object Detection in using Thermal Imaging for thin cross-section unidentifiable objects(eg. cyclist, pedestrians).

autonomous-cars autonomous-navigation autonomous-vehicles c cuda object-detection thermal-camera yolov3

Last synced: 03 Sep 2025

https://github.com/programmer-rd-ai/detectx

A Pythonic approach to object detection using Detectron2, a clean, modular framework for training and deploying computer vision models. DetectX simplifies the complexity of object detection while maintaining high performance and extensibility.

coco-dataset computer-vision computer-vision-library cuda deep-learning detectron2 faster-rcnn gpu-accelerated machine-learning ml-framework object-detection object-recognition python3 pytorch retinanet

Last synced: 10 Jun 2025

https://github.com/avitase/fast_frechet

Comparison of different (fast) discrete Fréchet distance implementations in C++ and CUDA.

benchmark cpp cuda frechet-distance simd

Last synced: 18 May 2026

https://github.com/true-real-michael/python-plane-ransac

Parallel RANSAC for plane detection for multiple point clouds using Python and CUDA

cuda numba plane-detection python ransac

Last synced: 14 Mar 2025

https://github.com/pd2871/high-performance-computing

This repo contain the logs of High Performance Computing module's final Assignment

blurred-images c cuda gaussian-blur matrix-multiplication multi-threading parallel-computing pthreads pthreads-api

Last synced: 10 May 2026

https://github.com/tank3-tk3/parallel-processing-cuda

Parallel processing with CUDA C / C++

c cpp cuda parallel-computing parallel-programming

Last synced: 09 May 2026

https://github.com/tky823/bitlinear158compression

Compare compression models for inference by BitLinear158

cuda pytorch quantization

Last synced: 12 Jun 2026

https://github.com/mrglaster/cuda-acfcalc

Calculation of the smallest ACF for signals of length N using CUDA technology.

acf c calculations cpp cuda google-colaboratory google-colaboratory-notebooks isu

Last synced: 06 May 2026

https://github.com/nachovizzo/saxpy_openacc_cpp

My way of thinking about OpenACC, C++, and Parallel computing in general

cpp cuda gpu openacc

Last synced: 23 Jun 2026

https://github.com/dereklstinson/nccl

golang wrapper for nccl

cuda deep-learning go nccl parallel-computing

Last synced: 14 May 2026

https://github.com/poodarchu/vision-lab

Computer Vision Experiments in all.

computer-vision cuda object-detection

Last synced: 07 May 2026

https://github.com/willigarneau/object-detection-cuda

🕺 Put my knowledge of OpenCV and Cuda into practice to create an object detection system. 💻

camera cplusplus cuda detector filter opencv

Last synced: 08 May 2026

https://github.com/daaboulex/unsloth-nix

Unsloth (git main) packaged for NixOS — CPU/CUDA/ROCm LoRA fine-tuning envs

cuda fine-tuning flake lora machine-learning nix nixos nixos-module pytorch rocm unsloth

Last synced: 10 Jun 2026

https://github.com/pedro-avalos/cuda-samples-snap

Unofficial snap for CUDA Samples

cuda gpu gpu-test linux nvidia package snap snapcraft

Last synced: 08 May 2026

https://github.com/uefi-code/msra_thepracticespaceproject_pytorchcuda

My repo to attend MSRA the Practice Space Project 2022, CUDA Implement and Optimize

ann cuda pytorch

Last synced: 06 May 2026

https://github.com/kayuii/ironfish-miner

docker nvidia/amd Gpu hpool-dev/ironfish-miner ironfish-miner

amdgpu cuda docker gpu nvidia rocm

Last synced: 07 May 2026

https://github.com/xebastex/sfw-python

Python package designed to provide the essentials tools for off-the-grid inverse problem. This is the bedrock for future GUI implementation.

blasso cuda frank-wolfe pytorch

Last synced: 09 May 2026

https://github.com/alextmjugador/rust-cuda-quickstart

Bring the Rust-CUDA project back to life under modern Linux environments.

cuda cuda-programming cuda-rust cuda-support docker rust

Last synced: 06 May 2026

https://github.com/speedcell4/torchdevice

Setup CUDA_VISIBLE_DEVICES

cuda deep-learning gpu machine-learning pytorch

Last synced: 07 May 2026

https://github.com/sun-zhenxing/fast-neural-style

快速风格迁移部署

cuda cv2 fast-neural-style opencv

Last synced: 05 May 2026

https://github.com/igorcosta/deep-docker

Docker image for Deep Learning on AWS Cloud

cuda deep-learning docker docker-image tensorflow

Last synced: 05 May 2026

https://github.com/seieric/gst-dsobjectsmask

📀NVIDIA DeepStream integrated GStreamer Plugin. Mask objects with cuda cores on Jetson boards. Fast and smooth since everything is done on NVMM.🏎

cuda cuda-programming deepstream gpu gstreamer gstreamer-plugins instance-segmentation jetson-agx-orin jetson-agx-xavier jetson-tx1 jetson-tx2 jetson-xavier maskrcnn nvidia-jetson nvidia-jetson-nano opencv opencv4 resnet resnet50

Last synced: 06 May 2026

https://github.com/garciparedes/cuda-examples

Cuda examples who I develop to learn HPC based on GPU

c c-plus-plus cuda examples gpgpu gpu hpc

Last synced: 09 May 2026

https://github.com/gmfatcat/ai-photoviewer

AI幫你分類你的舊照片

ai cuda local-first photo

Last synced: 16 Jun 2026

https://github.com/abdulfatir/subkmeans

Numpy and pyCUDA implementation of subKmeans

clustering cuda kdd kmeans numpy pycuda python subspace-clustering

Last synced: 09 May 2026

https://github.com/poyea/lollipop

🍭 Sweet GPU compute kernels in CUDA, wrapped via CuPy

cuda cuda-kernel cuda-kernels cuda-programming gpu-kernels gpu-programming python

Last synced: 17 Jun 2026

https://github.com/ezamagni/knapsack-simd

A genetic 01-Knapsack problem solver in CUDA

cuda knapsack-problem knapsack01

Last synced: 09 May 2026

https://github.com/manishklach/gpu-resident-inference-lab

Research lab for GPU-resident LLM inference loops: persistent kernels, sparse KV selection, tiered residency, speculative decode, and trace-driven scheduling.

cuda gpu-systems kv-cache llm-inference mega-kernel model-systems persistent-kernel runtime speculative-decoding

Last synced: 19 Jun 2026

https://github.com/jayemscript/llm-systems-from-scratch

A hands-on learning project for building the core systems behind Large Language Models using C++, Rust, and optional Python/JavaScript bindings. Includes tensor operations, autograd, neural networks, tokenization, and a minimal transformer pipeline.

ai-systems autograd c-language cpp cuda educational-project high-performance-computing inference-engine machine-learning neural-networks-from-scratch pybind11 tensor-library tokenization transformers wasm

Last synced: 19 Jun 2026

https://github.com/skillfulelectro/integral-solver

Simple integral solver

c cpp cuda math mathematics

Last synced: 08 May 2026

https://github.com/seongwon980/htop-gpu

Terminal dashboard for NVIDIA GPUs, system CPU/memory, and processes — clickable, with conda env / docker container / cwd info per process.

btop cli cuda dashboard gpu htop machine-learning monitor nvidia nvtop python sysadmin terminal tui

Last synced: 22 Jun 2026

https://github.com/daelsepara/hipslm

CPU and GPU (using HIP) implementations of phase pattern generators for use with spatial light modulators

computer-generated-holography cuda gpu hip hologram holography phase phase-pattern slm spatial-light-modulator

Last synced: 22 Jun 2026

https://github.com/abhans/archdev

Container that is built with Arch Linux with NVIDIA Driver & CUDA support, PyTorch and TensorFlow built in.

archlinux container cuda docker

Last synced: 07 May 2026

https://github.com/jblaschke/pynvtx

Thin pybind11 wrapper for NVTX wrappers -- with some bells and whistles attached.

cuda nvtx nvtx-markers

Last synced: 23 Jun 2026

https://github.com/kibotu/llm-windows-server

Turn your Windows GPU into a private, low-latency LLM server. Docker-based, OpenAI-compatible API.

agentic cuda docker gguf llma-cpp local-llm nvidia-gpu openai-api opencode qwen self-hosted windows

Last synced: 10 Jun 2026

https://github.com/steleman/llvm-21.1.8

LLVM 21.1.8 on Fedora 41/43 and some additions for Torch-MLIR, ONNX-MLIR and IREE

clang cuda fedora fedora-41 fedora-43 iree llvm mlir onnx-mlir torch-mlir

Last synced: 07 Apr 2026

https://github.com/matx64/rs-netbot

Old School Runescape bot with CNN for object identification

cuda numpy python pytorch

Last synced: 04 May 2026

https://github.com/microo8/micronn

Simple neural network library with backpropagation using CUDA

c cuda neural-network

Last synced: 19 May 2026

https://github.com/umer-farooq-cs/canny-edge-detector

High-performance Canny edge detector with CPU and CUDA implementations. Loads PGM images, performs Gaussian smoothing, gradients, non-max suppression, and hysteresis. Benchmarks both paths, outputs edge maps, and reports speedup. Simple Makefile, sample images included.

c canny-edge-detection computer-vision cpp cuda gpu high-performance-computing image-processing nvcc pgm

Last synced: 18 Apr 2026

https://github.com/linux-alex/geep

GEEP (Genetic Evolutionary Engineering Platform) - a C++/Qt framework for genetic programming, optimized with CUDA acceleration. GEEP enables large-scale population-based optimization, ideal for solving high-dimensional problems using evolutionary algorithms and GPU computing.

cpp cuda framework genetic-programming

Last synced: 18 May 2026

https://github.com/bjornmelin/deep-learning-evolution

🧠 Deep-Learning Evolution: Unified collection of TensorFlow & PyTorch projects, featuring custom CUDA kernels, distributed training, memory‑efficient methods, and production‑ready pipelines. Showcases advanced GPU optimizations, from foundational models to cutting‑edge architectures. 🚀

ai-research cuda data-science deep-learning distributed-training gan gpu-acceleration machine-learning model-optimization neural-networks python pytorch tensorflow training-pipeline transformers

Last synced: 09 May 2026

https://github.com/alwaysai/jetpack-46-hacky-hour

NVIDIA’s Jetpack 4.6 capabilities and how to use them with EdgeIQ, alwaysAI Computer Vision framework.

alwaysai computer-vision cuda edge-computing jetpack tensorrt

Last synced: 01 May 2026

https://github.com/michaelfranzl/image_debian-gpgpu

Dockerfile for a Debian base image with AMD and Nvidia GPGPU support

amd container container-image cuda debian docker gpgpu nvidia opencl

Last synced: 10 May 2026

https://github.com/hyunjinno/multicore_computing

A repository of multicore programming in Java and C.

c cpp cuda java multithreading openmp thread thrust

Last synced: 18 Apr 2026

https://github.com/wallneradam/docker-ccminer

CCMiner (tpruvot version) Docker Builder

ccminer cuda docker gpu litecoin miner monero nvidia nvidia-docker

Last synced: 18 Apr 2026

https://github.com/pjueon/cuda_intellisense

A simple python script to fix cuda C++ intellisense for visual studio.

cuda visual-studio

Last synced: 09 Apr 2026

https://github.com/dhruvsrikanth/monte-carlo-ray-tracing

In this repository, you will find a serial and distributed GPU-based implementation of the ray tracing simulation.

c cpp cuda gpu-computing gpu-programming high-performance-computing parallel-programming raytracing unified-memory-parallelism

Last synced: 01 May 2026