An open API service indexing awesome lists of open source software.

CUDA

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

https://github.com/fabulani/360ip-with-cuda

360° Image Processing with CUDA and OpenCV.

360-image 360-video cpp cuda image-processing opencv

Last synced: 11 May 2026

https://github.com/apws25/accelmoe

This repository is for CUDA kernel re-implementation of CPU-based MoE model.

cpp cuda mixture-of-experts

Last synced: 11 May 2026

https://github.com/daniilvorontsov/fourier-option-pricing

MSc thesis project concerned with option pricing for Levy Jump models. Package includes pricing implementations for European Call and Put options for Carr-Madan, COS and Fourier Time Stepping.

carr-madan cuda fourier-transform monte-carlo option-pricing

Last synced: 11 May 2026

https://github.com/theogravity/dual-rtx-6000-blackwell-gemma-4-31b-it-nvfp4

Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.

am5 amd blackwell cuda docker fp4 gemma gemma4 llm-inference multi-token-prediction nvfp4 prefix-caching rtx-6000 speculative-decoding tensor-parallel vllm

Last synced: 11 May 2026

https://github.com/ironjr/minimal-cuda-pytorch

Repository-level snippet for minimal implementation of a PyTorch CUDA extension.

cuda minimal pytorch

Last synced: 04 May 2026

https://github.com/drbh/quemer

GPU accelerated k-mer counter

biology cuda gpu

Last synced: 07 May 2025

https://github.com/aeyage/intraday_prices

GPU-accelerated portfolio optimisation

cuda cupy nvidia-gpu

Last synced: 05 Apr 2025

https://github.com/ahmadrafidev/learn-cuda

A place where I learn about CUDA

cuda cuda-programming gpu os parallel-programming

Last synced: 13 Apr 2025

https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.

cpp cuda cuda-kernels cuda-streams deep-learning-inference gpu gpu-optimization gpu-profiling high-performance-computing nsight nvidia parrallel-computing pinned-memory

Last synced: 16 May 2026

https://github.com/hshindo/libcuda.jl

CUDA GPU array for Julia

cuda gpu julia

Last synced: 16 May 2026

https://github.com/ribin-baby/cuda_cudnn_installation_on_ubuntu20.04

Installation of CUDA-11.8 with cuDNN-8.7 for ubuntu(20.04) server A30 GPU, and onnx gpu installation guide

cuda gpu linux onnxruntime server

Last synced: 16 May 2026

https://github.com/rkarahul/person-detector-faceverifier

Person-Detector-FaceVerifier is a sophisticated system for detecting and verifying faces in images. Ideal for applications like passport control and security, it combines advanced face detection with precise verification techniques.

bootstrap5 css3 cuda django html5 javascipt opencv-python os python pytorch yolov8

Last synced: 07 Apr 2026

https://github.com/kanchishimono/python-images

Ubuntu based Python container images, including CUDA images

container-image cuda docker dockerfile machine-learning python python3

Last synced: 30 Apr 2026

https://github.com/kar-dim/cas-2d

Implementation of the AMD FidelityFX CAS (Contrast Adaptive Sharpening) algorithm on CUDA/OpenCL, for sharpening static images.

cpp cuda dll fidelityfx gpu image-processing parallel-computing sharpen

Last synced: 22 Jun 2025

https://github.com/zyn10/cuda_code

cude practice

cuda cuda-programming

Last synced: 22 Jun 2025

https://github.com/kratugautam99/logiclink-project

LogicLink is a conversational AI chatbot developed by Kratu Gautam (AIML Engineer). Powered by the TinyLlama-1.1B-Chat-v1.0 model, it provides an interactive interface for engaging conversations, query resolution, and task assistance. Version 5 features streaming responses, conversation management, and a sleek GUI.

antd-design chatbot-application conversational-ai cuda gradio graphical-user-interface huggingface-spaces huggingface-transformers jupyter-notebooks keras large-language-models mlops model-service-controller modelscope-studio natural-language-generation natural-language-processing pytorch reasoning-agent tensorflow

Last synced: 07 Apr 2026

https://github.com/chensongpoixs/cmedia_transcode

媒体服务转码版本GPU(cuda) 支持H264与H265转码

cuda gpu h264 h265 media transcode-media

Last synced: 19 May 2026

https://github.com/phantom7knight/cuda-fusion

This project is for learning CUDA to understand the GPU work better.

cuda cuda-programming gpgpu gpu

Last synced: 17 May 2026

https://github.com/drilonaliu/bachelor-thesis

Parallel Programming Fractals

cuda fractals gpu parallel-programming

Last synced: 15 May 2026

https://github.com/programmergnome/kutyai

This is a python dog breed recognizer graphical application with 420 breeds and 42000 images.

cuda deep-learning image-classification python3 qt5-gui tensorflow transfer-learning

Last synced: 11 May 2026

https://github.com/morristai/kvik-rs

KvikIO Rust implementation

cuda cufile gds kvikio nvidia rust

Last synced: 02 Apr 2026

https://github.com/ubermorgott/morgottalk

Cross-platform desktop push-to-talk voice transcription. Single binary. GPU accelerated (CUDA/Vulkan/Metal/ROCm/OpenCL). Powered by whisper.cpp.

cuda desktop go gpu speech-to-text svelte transcription voice wails whisper

Last synced: 07 Apr 2026

https://github.com/ergus/algorithms

Set of multiple algorithms implemented in multiple paradigms

algorithms cmake concurrency cpp cuda gpgpu inter-language metaprogramming multithreading pthreads stl testing

Last synced: 17 May 2026

https://github.com/santiagoenriquega/gpu_projects

Various Python GPU accelerated computations and simulations.

cuda cupy numba opencl pyopencl python

Last synced: 17 May 2026

https://github.com/versi379/optimized-matrix-multiplication

This project utilizes CUDA and cuBLAS to optimize matrix multiplication, achieving up to a 5x speedup on large matrices by leveraging GPU acceleration. It also improves memory efficiency and reduces data transfer times between CPU and GPU.

cublas cuda cuda-programming hpc matrix-multiplication parallel-computing parallel-programming

Last synced: 17 May 2026

https://github.com/tianzonglin/cloud-control-gui

A tool to compute, visualize, analyse and drag points (high-dimensional data)

cuda interaction-design visualization

Last synced: 25 Apr 2026

https://github.com/miferreiro/cdap-cuda

CUDA exercises for the subject of "Computación Distribuída e de Altas Prestacións" in the Master Degree of Computer Engineering of the University of Vigo in 2020

c cuda scan

Last synced: 17 May 2026

https://github.com/reuben-sun/pybind-cuda-demo

一个 基于pybind11实现python调用cuda C++接口 的示例

cpp cuda pybind11 python pytorch

Last synced: 07 Apr 2026

https://github.com/puzzlef/vector-max-cuda

Performance of sequential vs CUDA-based vector element max.

basics cuda element experiment max vector

Last synced: 17 May 2026

https://github.com/tomosatop/docker-lammps

Lammps を手軽に使いたかったので、サービスを作りました

cuda lammps wsl-ubuntu

Last synced: 28 Mar 2025

https://github.com/rushirg/cuda-matrix-multiplication

Matrix Multiplication on GPGPU in CUDA

cpu cuda gpu parallel-processing

Last synced: 17 May 2026

https://github.com/ivanbgd/cuda_quad_c

Calculates a definite integral by using three different rules. Compares sequential to parallel implementations.

cuda integrals parallel-implementations

Last synced: 28 Mar 2025

https://github.com/moshiba/fmindex

ultra fast parallel FM index generation for DNA reads

cpp cuda fmindex parallel

Last synced: 18 May 2026

https://github.com/obj-wtf/gan-architecture

APP For training GAN Models on Architecture Plan

architecture building cuda gan pix2pix-tensorflow plan

Last synced: 18 May 2026

https://github.com/demetriantitus/machine-vision---yolov8

This project provides a comprehensive guide to object detection in cluttered environments using YOLOv8. It demonstrates how to identify and classify objects in both still images and video streams

computer-vision cuda dataset image-classification machine-learning nvidia-gpu object-detection surveillance traffic-monitoring video-analysis yolov8

Last synced: 18 May 2026

https://github.com/tfogal/gemm-db

For creating a cacheable GEMM cost model.

cuda rust

Last synced: 18 May 2026

https://github.com/cppshizoids/cuda

This is my basic lessons of CUDA

cuda cuda-demo cuda-programming

Last synced: 15 Jul 2025

https://github.com/wiktor2718/matrix_flow

Matrix Flow is a simple machine learning library written in Rust and CUDA. It was created as a portfolio project to deepen my understanding of machine learning, GPU programming, and Rust. It provides an API for matrix manipulation and includes specially optimized neural networks.

adam-optimizer benchmarking cuda deep-learning gpu-computing machine-learning matrix-operations neural-networks portfolio-project rust

Last synced: 18 May 2026

https://github.com/akira4o4/cuda-yolo-processing

CUDA YOLO Processing

cuda yolo

Last synced: 12 Jul 2025

https://github.com/loveboyme/yolov5-tensorrt-accelerator

基于TensorRT加速的YOLOv5高性能推理框架 | High-performance YOLOv5 inference framework accelerated by TensorRT with dynamic optimization

cuda dynamic-shapes-cuda-stream fp16 int8 pycuda tensorrt yolov5

Last synced: 29 Mar 2025

https://github.com/avarga1/vllm-hb

vLLM-compatible inference runtime in pure Rust. Zero Python. Zero libtorch. CUDA via candle.

candle cuda inference llm openai-api rust tokio vllm

Last synced: 07 Apr 2026

https://github.com/edisonslightbulbs/viewer

Exploring real-time 3D point cloud rendering using Cuda and openGL

cuda cxx11 opengl pangolin submodule

Last synced: 02 May 2026

https://github.com/ne0nwinds/gpupuzzles

My solutions to srush/GPU-Puzzles using CUDA

cpp cuda gpgpu

Last synced: 16 May 2026

https://github.com/aayes89/pyllm

Entrena tu propio LLM desde cero

cpu cuda llm llm-training pip python3

Last synced: 18 May 2026

https://github.com/edcalderin/huggingface_ragflow

This project implements a classic Retrieval-Augmented Generation (RAG) system using HuggingFace models with quantization techniques. The system processes PDF documents, extracts their content, and enables interactive question-answering through a Streamlit web application.

bitsandbytes cuda huggingface huggingface-embeddings langchain langchain-community large-language-models llm nf4 python qdrant quantization rag retrieval-augmented-generation ruff streamlit text-generation

Last synced: 15 Jul 2025

https://github.com/jiriklepl/bits-knn-jpdc2024

Replication package for the paper Towards Optimal GPU-accelerated K-Nearest Neighbors Search

bitonic-sort cuda gpu k-nearest-neighbors knn-search top-k

Last synced: 21 Mar 2025

https://github.com/amruthapatil/nyu-cudaconvolution

Implementing convolution operations on an image using CUDA, exploiting different methodologies - basic, tiled, and cuDNN

cuda high-performance

Last synced: 13 Mar 2025

https://github.com/rajshrestha86/kmeans-clusterize-cuda

Implementation of K-Means algorithm from scratch using CUDA.

c cuda kmeans-clustering

Last synced: 18 May 2026

https://github.com/brendanm12345/simple_renderer_cs149

Simple CUDA renderer implementation. 19th most efficient out of 150+ submissions

cpp cuda

Last synced: 18 May 2026

https://github.com/xstupi00/N-Body-CUDA

PCG - Parallel Computations on GPU - Project - N-Body-CUDA

cuda gpu-acceleration gpu-computing nbody-simulation optimization parallel-computing pcg vut vut-fit

Last synced: 11 Mar 2025

https://github.com/matteopolak/stock-predict

Stock prediction with LSTM using TensorFlow and TypeScript.

ai artificial-intelligence cuda lstm machine-learning stock tensorflow typescript

Last synced: 09 May 2026

https://github.com/debanjan06/spatial-streamio

An optimized, out-of-core asynchronous data streaming pipeline for high-throughput 3D point cloud training loops. Features low-level numpy.memmap zero-copy reads and multi-threaded ring prefetching to eliminate I/O bottlenecks, delivering a 33.33% throughput efficiency gain on PyTorch CUDA workloads.

asynchronous-programming cuda data-engineering deep-learning-pipelines io-optimization memory-mapping point-cloud pytorch

Last synced: 11 Jun 2026

https://github.com/amitkumarj441/deep-learning-on-your-finger

A rich collection of dockerfiles for installing deep learning dependecies on your way :rocket:

cuda cudnn gcp

Last synced: 18 Apr 2026

https://github.com/lruizap/testcuda

Guide to install and use cuda for programming

cuda cudnn nvidia pytorch

Last synced: 12 May 2026

https://github.com/sangioai/sph

CUDA and OpenMP versions of SPH (Smoothed Particle Hydrodynamics) serial algorithm.

cuda openmp

Last synced: 27 Apr 2026

https://github.com/kirubhakaranm/vision-pipeline-cuda

High-performance camera processing pipeline with CUDA GPU acceleration, CPU multithreading, and real-time TCP/IP telemetry monitoring (1,200+ FPS, <1ms latency)

computer-vision cpp17 cuda edge-detection gpu-acceleration image-processing multithreading networking opencv performance-optimization real-time robotics tcp-ip telemetry

Last synced: 12 Apr 2026

https://github.com/mxm-tr/docker-darknet-opencv

Accelerated objects detection on streams and files, using a Docker darknet YOLO container

cuda docker docker-compose object-recognition opencv-python python3 yolo

Last synced: 10 Apr 2026

https://github.com/kar-dim/CAS-2D

Implementation of the AMD FidelityFX CAS (Contrast Adaptive Sharpening) algorithm on CUDA, for sharpening static images.

cpp cuda dll fidelityfx gpu image-processing parallel-computing sharpen

Last synced: 01 Nov 2025

https://github.com/chiragajain/gpu-optimization-roadmap

This repository is part of a structured curriculum designed to master GPU optimization, Triton, Deep Learning, and LLMs. This section focuses on GPU fundamentals, CUDA programming, and PyTorch optimizations.

cuda deeplearning gpu-acceleration learning python pytorch triton

Last synced: 18 Feb 2026

https://github.com/muneeb706/cuda

sample programs implemented using cuda (gpu)

cplusplus cuda gpu-programming

Last synced: 19 May 2026

https://github.com/patriciobcs/mini-aevol

Parallel implementation of a reduced version of the Aevol simulator

aevol cuda simulation

Last synced: 19 May 2026

https://github.com/drilonaliu/parallel-fractal-tree

GPU-accelerated fractal tree generation with CUDA and OpenGL interoperability.

cuda fractal-tree fractals gpu

Last synced: 19 May 2026

https://github.com/grindelfp/cuda-n-body-simulation

Simulation of N-Body movement using CUDA.

cuda n-body-simulation

Last synced: 06 Apr 2025

https://github.com/naetherm/derelictcurand

Dynamic bindings to the CuRAND library for the D Programming Language.

cuda curand d derelict dlang

Last synced: 27 Mar 2025

https://github.com/ivanfioravanti/tflops_mps

TFLOPs testing on MPS and CUDA

cuda mps tflops

Last synced: 19 May 2026

https://github.com/amypad/miutil

Basic functionality needed for AMYPAD

cuda matlab medical-imaging python

Last synced: 13 May 2025

https://github.com/storterald/neural-network

Simple neural network implementation in C++ and CUDA

asm asmx86 c-plus-plus cmake cpp cuda machine-learning neural-network

Last synced: 28 Mar 2025

https://github.com/ramyacp14/document-based-question-and-answers

Developed a document question answering system that utilizes Llama and LangChain for contextual and accurate answers. The system supports .txt documents, intelligent text splitting, and context-aware querying through an easy-to-use Streamlit interface.

chroma cuda hugging-face langchain llama python recursivecharactertextsplitter streamlit

Last synced: 07 Mar 2026

https://github.com/naetherm/derelictcublas

Dynamic bindings to the CuBLAS library for the D Programming Language.

cublas cuda d derelict dlang

Last synced: 25 Jun 2026

https://github.com/eastonman/tensorrt-pytorch-wrapper

A wrapper makes TensorRT engine accept PyTorch Cuda Tensor.

cuda pytorch tensorrt

Last synced: 06 May 2026

https://github.com/mahdi-hasan-shuvo/ml-opensource-project

is an open source repository focused on providing practical and educational machine learning resources. The project aims to make learning and applying machine learning more accessible through well-documented code, tutorials, and real-world examples.

cuda machine-learning machine-learning-algorithms ml-projects open-source python

Last synced: 19 May 2026

https://github.com/bd2720/accesspatterns

Comparing chunked vs. striped memory access patterns for CPU and GPU code using the CUDA toolkit in C.

c cache cuda cuda-toolkit performance-analysis performance-testing profiling

Last synced: 16 May 2026

https://github.com/uva-trasgo/controllers

Read-only mirror of the official repository: https://gitlab.com/trasgo-group-valladolid/controllers. Controllers is a library written in C11 that provides a simplified way to program applications that can exploit heterogeneous computational platforms including accelerators and/or multi-core CPUs.

cuda heterogeneous-computing heterogeneous-parallel-programming hip opencl openmp

Last synced: 12 May 2026

https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem

A CUDA C++ demo showing how to overlap data transfer and kernel execution using multiple streams and pinned (page-locked) host memory. This project illustrates asynchronous memcpy, event timing, and performance benefits of concurrent GPU execution — essential for building high-throughput pipelines.

asynchronous-execution cuda cuda-streams gpu parallel-programming performance-optimization pinned-memory

Last synced: 13 May 2026

https://github.com/nabilshadman/cuda-4-dummies

Lecture slides and exercise files of the CUDA 4 Dummies course (2025)

cuda gpu-computing high-performance-computing nsight-systems nvidia-gpu parallel-computing

Last synced: 31 Oct 2025

https://github.com/juliankarrer/reyn

CUDA-based Implementation of Smoothed Particle Hydrodynamics for Fluid Simulation

cuda fluid lagrangian simulation sph

Last synced: 31 Oct 2025

https://github.com/myselfaryan/attention-mechanism

Accelerating Scaled Dot-Product Attention using OpenMP and CUDA

cuda openmp

Last synced: 27 Apr 2026

https://github.com/ludekcizinsky/fast-cg-solver

Implementation of Conjugate Gradient (CG) algorithm for solving sparse linear systems using MPI and CUDA.

conjugate-gradient cuda mpi

Last synced: 17 May 2026

https://github.com/rainlumostaipei/cuda-qnet-a2c

Qnet and A2C impl in cuda

a2c cuda qnet

Last synced: 26 Jun 2025

https://github.com/yash-1335/qwen600

🚀 Build a fast inference engine for the QWEN3-0.6B model using CUDA, optimizing performance with minimal dependencies for efficient learning and practice.

cuda cuda-programming gpu llamacpp llm llm-inference qwen qwen3 transformer

Last synced: 16 May 2026

https://github.com/nxoti1/points-reader-ocr

🖥️ Extract text from images easily with POINTS-Reader OCR, a high-accuracy application for seamless document conversion and processing.

cuda gradio huggingface-transformers ocr open-source points-reader reportlab spaces tencent vision-language-model vlm

Last synced: 20 May 2026

https://github.com/andreasholt/cuda-matmul-benchmarking

Implementing and benchmarking various matmul implementations in CUDA

cuda matrix-multiplication

Last synced: 01 Nov 2025