CUDA
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.
- GitHub: https://github.com/topics/cuda
- Wikipedia: https://en.wikipedia.org/wiki/CUDA
- Created by: Nvidia
- Released: June 23, 2007
- Related Topics: nvcc,
- Last updated: 2026-07-01 00:07:09 UTC
- JSON Representation
https://github.com/fabulani/360ip-with-cuda
360° Image Processing with CUDA and OpenCV.
360-image 360-video cpp cuda image-processing opencv
Last synced: 11 May 2026
https://github.com/islamshahil/live-video-analysis
Live Video Analysis using PyTorch
cuda deeplearning neural-network opencv-python python pytorch video-processing webcam
Last synced: 11 May 2026
https://github.com/apws25/accelmoe
This repository is for CUDA kernel re-implementation of CPU-based MoE model.
Last synced: 11 May 2026
https://github.com/daniilvorontsov/fourier-option-pricing
MSc thesis project concerned with option pricing for Levy Jump models. Package includes pricing implementations for European Call and Put options for Carr-Madan, COS and Fourier Time Stepping.
carr-madan cuda fourier-transform monte-carlo option-pricing
Last synced: 11 May 2026
https://github.com/theogravity/dual-rtx-6000-blackwell-gemma-4-31b-it-nvfp4
Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.
am5 amd blackwell cuda docker fp4 gemma gemma4 llm-inference multi-token-prediction nvfp4 prefix-caching rtx-6000 speculative-decoding tensor-parallel vllm
Last synced: 11 May 2026
https://github.com/ironjr/minimal-cuda-pytorch
Repository-level snippet for minimal implementation of a PyTorch CUDA extension.
Last synced: 04 May 2026
https://github.com/aeyage/intraday_prices
GPU-accelerated portfolio optimisation
Last synced: 05 Apr 2025
https://github.com/ahmadrafidev/learn-cuda
A place where I learn about CUDA
cuda cuda-programming gpu os parallel-programming
Last synced: 13 Apr 2025
https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter
CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.
cpp cuda cuda-kernels cuda-streams deep-learning-inference gpu gpu-optimization gpu-profiling high-performance-computing nsight nvidia parrallel-computing pinned-memory
Last synced: 16 May 2026
https://github.com/ribin-baby/cuda_cudnn_installation_on_ubuntu20.04
Installation of CUDA-11.8 with cuDNN-8.7 for ubuntu(20.04) server A30 GPU, and onnx gpu installation guide
cuda gpu linux onnxruntime server
Last synced: 16 May 2026
https://github.com/rkarahul/person-detector-faceverifier
Person-Detector-FaceVerifier is a sophisticated system for detecting and verifying faces in images. Ideal for applications like passport control and security, it combines advanced face detection with precise verification techniques.
bootstrap5 css3 cuda django html5 javascipt opencv-python os python pytorch yolov8
Last synced: 07 Apr 2026
https://github.com/kanchishimono/python-images
Ubuntu based Python container images, including CUDA images
container-image cuda docker dockerfile machine-learning python python3
Last synced: 30 Apr 2026
https://github.com/kar-dim/cas-2d
Implementation of the AMD FidelityFX CAS (Contrast Adaptive Sharpening) algorithm on CUDA/OpenCL, for sharpening static images.
cpp cuda dll fidelityfx gpu image-processing parallel-computing sharpen
Last synced: 22 Jun 2025
https://github.com/kratugautam99/logiclink-project
LogicLink is a conversational AI chatbot developed by Kratu Gautam (AIML Engineer). Powered by the TinyLlama-1.1B-Chat-v1.0 model, it provides an interactive interface for engaging conversations, query resolution, and task assistance. Version 5 features streaming responses, conversation management, and a sleek GUI.
antd-design chatbot-application conversational-ai cuda gradio graphical-user-interface huggingface-spaces huggingface-transformers jupyter-notebooks keras large-language-models mlops model-service-controller modelscope-studio natural-language-generation natural-language-processing pytorch reasoning-agent tensorflow
Last synced: 07 Apr 2026
https://github.com/drilonaliu/parallel-image-edge-detection
cuda edge-detection gpu image-processing
Last synced: 17 May 2026
https://github.com/chensongpoixs/cmedia_transcode
媒体服务转码版本GPU(cuda) 支持H264与H265转码
cuda gpu h264 h265 media transcode-media
Last synced: 19 May 2026
https://github.com/aaditya29/parallel-computing-and-cuda
Learning about Parallel Computing and GPU programming using CUDA.
c cpp cuda cuda-kernels cuda-programming nvidia-cuda openmp openmpi parallel-computing parallel-programming
Last synced: 18 Jul 2025
https://github.com/phantom7knight/cuda-fusion
This project is for learning CUDA to understand the GPU work better.
cuda cuda-programming gpgpu gpu
Last synced: 17 May 2026
https://github.com/drilonaliu/parallel-permutation-cipher
cryptography cuda gpu parallel-programming permutation
Last synced: 19 Jul 2025
https://github.com/drilonaliu/bachelor-thesis
Parallel Programming Fractals
cuda fractals gpu parallel-programming
Last synced: 15 May 2026
https://github.com/drilonaliu/parallel-permuation-cipher-attack
attack cryptography cuda gpu parallel-computing
Last synced: 21 Mar 2025
https://github.com/programmergnome/kutyai
This is a python dog breed recognizer graphical application with 420 breeds and 42000 images.
cuda deep-learning image-classification python3 qt5-gui tensorflow transfer-learning
Last synced: 11 May 2026
https://github.com/ubermorgott/morgottalk
Cross-platform desktop push-to-talk voice transcription. Single binary. GPU accelerated (CUDA/Vulkan/Metal/ROCm/OpenCL). Powered by whisper.cpp.
cuda desktop go gpu speech-to-text svelte transcription voice wails whisper
Last synced: 07 Apr 2026
https://github.com/ergus/algorithms
Set of multiple algorithms implemented in multiple paradigms
algorithms cmake concurrency cpp cuda gpgpu inter-language metaprogramming multithreading pthreads stl testing
Last synced: 17 May 2026
https://github.com/versi379/optimized-matrix-multiplication
This project utilizes CUDA and cuBLAS to optimize matrix multiplication, achieving up to a 5x speedup on large matrices by leveraging GPU acceleration. It also improves memory efficiency and reduces data transfer times between CPU and GPU.
cublas cuda cuda-programming hpc matrix-multiplication parallel-computing parallel-programming
Last synced: 17 May 2026
https://github.com/tianzonglin/cloud-control-gui
A tool to compute, visualize, analyse and drag points (high-dimensional data)
cuda interaction-design visualization
Last synced: 25 Apr 2026
https://github.com/drilonaliu/parallel-caesar-cipher
caesar-cipher cryptography cuda gpu parallel-programming
Last synced: 21 Mar 2025
https://github.com/drilonaliu/parallel-s_aes-ccm-xts
aes cryptography cuda gpu parallel-programming saes
Last synced: 21 Mar 2025
https://github.com/flagro/paralleltasks
CUDA/OpenMP parallel tasks
algorithms compression cpp cuda openmp parallel-computing unique-values
Last synced: 17 May 2026
https://github.com/miferreiro/cdap-cuda
CUDA exercises for the subject of "Computación Distribuída e de Altas Prestacións" in the Master Degree of Computer Engineering of the University of Vigo in 2020
Last synced: 17 May 2026
https://github.com/puzzlef/vector-max-cuda
Performance of sequential vs CUDA-based vector element max.
basics cuda element experiment max vector
Last synced: 17 May 2026
https://github.com/rushirg/cuda-matrix-multiplication
Matrix Multiplication on GPGPU in CUDA
cpu cuda gpu parallel-processing
Last synced: 17 May 2026
https://github.com/ivanbgd/cuda_quad_c
Calculates a definite integral by using three different rules. Compares sequential to parallel implementations.
cuda integrals parallel-implementations
Last synced: 28 Mar 2025
https://github.com/ludgerpaehler/lulesh-enzyme
AD with Enzyme through Lulesh.
automatic-differentiation cuda cuda-programming gpu-computing high-performance-computing llvm-enzyme scientific-computing
Last synced: 15 Jun 2026
https://github.com/jadc/cuda-raytracer
A simple path tracer written in CUDA.
cpp cuda gpu-programming graphics parallel-programming path-tracing raytracing
Last synced: 16 May 2026
https://github.com/moshiba/fmindex
ultra fast parallel FM index generation for DNA reads
Last synced: 18 May 2026
https://github.com/obj-wtf/gan-architecture
APP For training GAN Models on Architecture Plan
architecture building cuda gan pix2pix-tensorflow plan
Last synced: 18 May 2026
https://github.com/demetriantitus/machine-vision---yolov8
This project provides a comprehensive guide to object detection in cluttered environments using YOLOv8. It demonstrates how to identify and classify objects in both still images and video streams
computer-vision cuda dataset image-classification machine-learning nvidia-gpu object-detection surveillance traffic-monitoring video-analysis yolov8
Last synced: 18 May 2026
https://github.com/tfogal/gemm-db
For creating a cacheable GEMM cost model.
Last synced: 18 May 2026
https://github.com/toshikinakamura0412/dockerfiles
Development environment using Docker for some Linux distributions
alpine bash cuda debian devcontainer devcontainers docker docker-compose fedora opencv opensuse ros ros-humble ros-noetic ros2 ubuntu ubuntu2004 ubuntu2204 vscode zsh
Last synced: 10 Jul 2025
https://github.com/cppshizoids/cuda
This is my basic lessons of CUDA
cuda cuda-demo cuda-programming
Last synced: 15 Jul 2025
https://github.com/wiktor2718/matrix_flow
Matrix Flow is a simple machine learning library written in Rust and CUDA. It was created as a portfolio project to deepen my understanding of machine learning, GPU programming, and Rust. It provides an API for matrix manipulation and includes specially optimized neural networks.
adam-optimizer benchmarking cuda deep-learning gpu-computing machine-learning matrix-operations neural-networks portfolio-project rust
Last synced: 18 May 2026
https://github.com/loveboyme/yolov5-tensorrt-accelerator
基于TensorRT加速的YOLOv5高性能推理框架 | High-performance YOLOv5 inference framework accelerated by TensorRT with dynamic optimization
cuda dynamic-shapes-cuda-stream fp16 int8 pycuda tensorrt yolov5
Last synced: 29 Mar 2025
https://github.com/avarga1/vllm-hb
vLLM-compatible inference runtime in pure Rust. Zero Python. Zero libtorch. CUDA via candle.
candle cuda inference llm openai-api rust tokio vllm
Last synced: 07 Apr 2026
https://github.com/ne0nwinds/gpupuzzles
My solutions to srush/GPU-Puzzles using CUDA
Last synced: 16 May 2026
https://github.com/aayes89/pyllm
Entrena tu propio LLM desde cero
cpu cuda llm llm-training pip python3
Last synced: 18 May 2026
https://github.com/edcalderin/huggingface_ragflow
This project implements a classic Retrieval-Augmented Generation (RAG) system using HuggingFace models with quantization techniques. The system processes PDF documents, extracts their content, and enables interactive question-answering through a Streamlit web application.
bitsandbytes cuda huggingface huggingface-embeddings langchain langchain-community large-language-models llm nf4 python qdrant quantization rag retrieval-augmented-generation ruff streamlit text-generation
Last synced: 15 Jul 2025
https://github.com/jiriklepl/bits-knn-jpdc2024
Replication package for the paper Towards Optimal GPU-accelerated K-Nearest Neighbors Search
bitonic-sort cuda gpu k-nearest-neighbors knn-search top-k
Last synced: 21 Mar 2025
https://github.com/amruthapatil/nyu-cudaconvolution
Implementing convolution operations on an image using CUDA, exploiting different methodologies - basic, tiled, and cuDNN
Last synced: 13 Mar 2025
https://github.com/rajshrestha86/kmeans-clusterize-cuda
Implementation of K-Means algorithm from scratch using CUDA.
Last synced: 18 May 2026
https://github.com/brendanm12345/simple_renderer_cs149
Simple CUDA renderer implementation. 19th most efficient out of 150+ submissions
Last synced: 18 May 2026
https://github.com/simonschoelly/poisson-solver
A solver for a modified poisson equation using cuda.
cpp cuda finite-difference gpgpu pgc poisson-equation preconditioned-conjugate-gradient thomas-algorithm
Last synced: 18 May 2026
https://github.com/xstupi00/N-Body-CUDA
PCG - Parallel Computations on GPU - Project - N-Body-CUDA
cuda gpu-acceleration gpu-computing nbody-simulation optimization parallel-computing pcg vut vut-fit
Last synced: 11 Mar 2025
https://github.com/matteopolak/stock-predict
Stock prediction with LSTM using TensorFlow and TypeScript.
ai artificial-intelligence cuda lstm machine-learning stock tensorflow typescript
Last synced: 09 May 2026
https://github.com/debanjan06/spatial-streamio
An optimized, out-of-core asynchronous data streaming pipeline for high-throughput 3D point cloud training loops. Features low-level numpy.memmap zero-copy reads and multi-threaded ring prefetching to eliminate I/O bottlenecks, delivering a 33.33% throughput efficiency gain on PyTorch CUDA workloads.
asynchronous-programming cuda data-engineering deep-learning-pipelines io-optimization memory-mapping point-cloud pytorch
Last synced: 11 Jun 2026
https://github.com/amitkumarj441/deep-learning-on-your-finger
A rich collection of dockerfiles for installing deep learning dependecies on your way :rocket:
Last synced: 18 Apr 2026
https://github.com/lruizap/testcuda
Guide to install and use cuda for programming
Last synced: 12 May 2026
https://github.com/sangioai/sph
CUDA and OpenMP versions of SPH (Smoothed Particle Hydrodynamics) serial algorithm.
Last synced: 27 Apr 2026
https://github.com/kirubhakaranm/vision-pipeline-cuda
High-performance camera processing pipeline with CUDA GPU acceleration, CPU multithreading, and real-time TCP/IP telemetry monitoring (1,200+ FPS, <1ms latency)
computer-vision cpp17 cuda edge-detection gpu-acceleration image-processing multithreading networking opencv performance-optimization real-time robotics tcp-ip telemetry
Last synced: 12 Apr 2026
https://github.com/mxm-tr/docker-darknet-opencv
Accelerated objects detection on streams and files, using a Docker darknet YOLO container
cuda docker docker-compose object-recognition opencv-python python3 yolo
Last synced: 10 Apr 2026
https://github.com/kar-dim/CAS-2D
Implementation of the AMD FidelityFX CAS (Contrast Adaptive Sharpening) algorithm on CUDA, for sharpening static images.
cpp cuda dll fidelityfx gpu image-processing parallel-computing sharpen
Last synced: 01 Nov 2025
https://github.com/chiragajain/gpu-optimization-roadmap
This repository is part of a structured curriculum designed to master GPU optimization, Triton, Deep Learning, and LLMs. This section focuses on GPU fundamentals, CUDA programming, and PyTorch optimizations.
cuda deeplearning gpu-acceleration learning python pytorch triton
Last synced: 18 Feb 2026
https://github.com/hnthap/vietnamese-word-segment
Vietnamese word segmentation package.
cuda torch transformers vietnamese vietnamese-nlp vietnamese-tokenizer word-segmentation
Last synced: 19 May 2026
https://github.com/muneeb706/cuda
sample programs implemented using cuda (gpu)
cplusplus cuda gpu-programming
Last synced: 19 May 2026
https://github.com/patriciobcs/mini-aevol
Parallel implementation of a reduced version of the Aevol simulator
Last synced: 19 May 2026
https://github.com/drilonaliu/parallel-fractal-tree
GPU-accelerated fractal tree generation with CUDA and OpenGL interoperability.
cuda fractal-tree fractals gpu
Last synced: 19 May 2026
https://github.com/grindelfp/cuda-n-body-simulation
Simulation of N-Body movement using CUDA.
Last synced: 06 Apr 2025
https://github.com/ivanfioravanti/tflops_mps
TFLOPs testing on MPS and CUDA
Last synced: 19 May 2026
https://github.com/amypad/miutil
Basic functionality needed for AMYPAD
cuda matlab medical-imaging python
Last synced: 13 May 2025
https://github.com/storterald/neural-network
Simple neural network implementation in C++ and CUDA
asm asmx86 c-plus-plus cmake cpp cuda machine-learning neural-network
Last synced: 28 Mar 2025
https://github.com/ramyacp14/document-based-question-and-answers
Developed a document question answering system that utilizes Llama and LangChain for contextual and accurate answers. The system supports .txt documents, intelligent text splitting, and context-aware querying through an easy-to-use Streamlit interface.
chroma cuda hugging-face langchain llama python recursivecharactertextsplitter streamlit
Last synced: 07 Mar 2026
https://github.com/eastonman/tensorrt-pytorch-wrapper
A wrapper makes TensorRT engine accept PyTorch Cuda Tensor.
Last synced: 06 May 2026
https://github.com/sneha-at-hub/bruteforce_passwordcracking_in-milliseconds
Last synced: 28 Apr 2026
https://github.com/mahdi-hasan-shuvo/ml-opensource-project
is an open source repository focused on providing practical and educational machine learning resources. The project aims to make learning and applying machine learning more accessible through well-documented code, tutorials, and real-world examples.
cuda machine-learning machine-learning-algorithms ml-projects open-source python
Last synced: 19 May 2026
https://github.com/bd2720/accesspatterns
Comparing chunked vs. striped memory access patterns for CPU and GPU code using the CUDA toolkit in C.
c cache cuda cuda-toolkit performance-analysis performance-testing profiling
Last synced: 16 May 2026
https://github.com/uva-trasgo/controllers
Read-only mirror of the official repository: https://gitlab.com/trasgo-group-valladolid/controllers. Controllers is a library written in C11 that provides a simplified way to program applications that can exploit heterogeneous computational platforms including accelerators and/or multi-core CPUs.
cuda heterogeneous-computing heterogeneous-parallel-programming hip opencl openmp
Last synced: 12 May 2026
https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem
A CUDA C++ demo showing how to overlap data transfer and kernel execution using multiple streams and pinned (page-locked) host memory. This project illustrates asynchronous memcpy, event timing, and performance benefits of concurrent GPU execution — essential for building high-throughput pipelines.
asynchronous-execution cuda cuda-streams gpu parallel-programming performance-optimization pinned-memory
Last synced: 13 May 2026
https://github.com/nabilshadman/cuda-4-dummies
Lecture slides and exercise files of the CUDA 4 Dummies course (2025)
cuda gpu-computing high-performance-computing nsight-systems nvidia-gpu parallel-computing
Last synced: 31 Oct 2025
https://github.com/lu-m-dev/cuda-molecular-simulation
CUDA accelerated molecular simulation of materials
cuda materials-science molecular-dynamics molecular-simulation monte-carlo
Last synced: 25 Jun 2026
https://github.com/juliankarrer/reyn
CUDA-based Implementation of Smoothed Particle Hydrodynamics for Fluid Simulation
cuda fluid lagrangian simulation sph
Last synced: 31 Oct 2025
https://github.com/myselfaryan/attention-mechanism
Accelerating Scaled Dot-Product Attention using OpenMP and CUDA
Last synced: 27 Apr 2026
https://github.com/ludekcizinsky/fast-cg-solver
Implementation of Conjugate Gradient (CG) algorithm for solving sparse linear systems using MPI and CUDA.
Last synced: 17 May 2026
https://github.com/yash-1335/qwen600
🚀 Build a fast inference engine for the QWEN3-0.6B model using CUDA, optimizing performance with minimal dependencies for efficient learning and practice.
cuda cuda-programming gpu llamacpp llm llm-inference qwen qwen3 transformer
Last synced: 16 May 2026
https://github.com/nxoti1/points-reader-ocr
🖥️ Extract text from images easily with POINTS-Reader OCR, a high-accuracy application for seamless document conversion and processing.
cuda gradio huggingface-transformers ocr open-source points-reader reportlab spaces tencent vision-language-model vlm
Last synced: 20 May 2026
https://github.com/andreasholt/cuda-matmul-benchmarking
Implementing and benchmarking various matmul implementations in CUDA
Last synced: 01 Nov 2025