CUDA

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.
- GitHub: https://github.com/topics/cuda
- Wikipedia: https://en.wikipedia.org/wiki/CUDA
- Created by: Nvidia
- Released: June 23, 2007
- Related Topics: nvcc,
- Last updated: 2025-04-25 00:07:07 UTC
- JSON Representation
https://github.com/ddemidov/vexcl
VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
c-plus-plus cpp11 cuda gpgpu opencl scientific-computing
Last synced: 14 Apr 2025
https://github.com/shibatch/sleef
SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT
aarch64 avx2 avx512 cuda elementary-functions fft fourier-transform fourier-transform-library math-library powerpc quadruple-precision s390x simd sse2 vector-math vectorization vsx
Last synced: 13 Apr 2025
https://github.com/cresset-template/cresset
Template repository to build PyTorch projects from source on any version of PyTorch/CUDA/cuDNN.
build cuda deep-learning deep-learning-tutorial docker docker-compose machine-learning makefile mlops mlops-template python pytorch source source-python template template-repository wheel
Last synced: 04 Apr 2025
https://github.com/xtra-computing/thundergbm
ThunderGBM: Fast GBDTs and Random Forests on GPUs
cuda gbdt gpu machine-learning random-forest
Last synced: 11 Apr 2025
https://github.com/Xtra-Computing/thundergbm
ThunderGBM: Fast GBDTs and Random Forests on GPUs
cuda gbdt gpu machine-learning random-forest
Last synced: 12 Apr 2025
https://github.com/santosh-gupta/speedtorch
Library for faster pinned CPU <-> GPU transfer in Pytorch
cpu-gpu-transfer cpu-pinned-tensors cuda cuda-tensors cuda-variables cupy data-transfer embeddings embeddings-trained gpu gpu-transfer machine-learning natural-language-processing nlp pinned-cpu-tensors pytorch pytorch-tensors pytorch-variables sparse sparse-modeling
Last synced: 12 Apr 2025
https://github.com/thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
attention cuda inference-acceleration llm quantization triton video-generation
Last synced: 15 Dec 2024
https://github.com/Santosh-Gupta/SpeedTorch
Library for faster pinned CPU <-> GPU transfer in Pytorch
cpu-gpu-transfer cpu-pinned-tensors cuda cuda-tensors cuda-variables cupy data-transfer embeddings embeddings-trained gpu gpu-transfer machine-learning natural-language-processing nlp pinned-cpu-tensors pytorch pytorch-tensors pytorch-variables sparse sparse-modeling
Last synced: 15 Nov 2024
https://github.com/uxlfoundation/onemath
oneAPI Math Library (oneMath)
api blas cpu cuda dpcpp gpu hpc intel math-libraries oneapi onemkl parallel-computing parallel-programming performance rng
Last synced: 14 Apr 2025
https://github.com/maghoumi/pytorch-softdtw-cuda
Fast CUDA implementation of (differentiable) soft dynamic time warping for PyTorch
cuda deep-learning dynamic-time-warping pytorch soft-dtw
Last synced: 04 Apr 2025
https://github.com/cern/tigre
TIGRE: Tomographic Iterative GPU-based Reconstruction Toolbox
cuda gpus image-reconstruction matlab python tigre tomography toolbox x-ray
Last synced: 14 Apr 2025
https://github.com/hedronvision/bazel-compile-commands-extractor
Goal: Enable awesome tooling for Bazel users of the C language family.
bazel bazel-build c ccls clang clang-tidy clang-tooling clangd contributions-welcome cpp cross-platform cuda hacktoberfest objective-c objective-c-plus-plus tools
Last synced: 23 Mar 2025
https://github.com/insight-platform/Savant
Python Computer Vision & Video Analytics Framework With Batteries Included
computer-vision cuda deep-learning deepstream edge-computing inference-engine instance-segmentation machine-learning nvidia nvidia-deepstream-sdk object-detection opencv peoplenet tensorrt video yolo yolov5-face yolov8 yolov8-face
Last synced: 21 Apr 2025
https://github.com/oneapi-src/onemkl
oneAPI Math Kernel Library (oneMKL) Interfaces
api blas cpu cuda dpcpp gpu hpc intel math-libraries oneapi onemkl parallel-computing parallel-programming performance rng
Last synced: 01 Dec 2024
https://github.com/nvidia/nvbench
CUDA Kernel Benchmarking Library
benchmark cuda cuda-kernels gpu kernel-benchmark nvidia performance
Last synced: 14 Apr 2025
https://github.com/cudamat/cudamat
Python module for performing basic dense linear algebra computations on the GPU using CUDA.
Last synced: 14 Mar 2025
https://github.com/hpcaitech/fastfold
Optimizing AlphaFold Training and Inference on GPU Clusters
alphafold2 cuda evoformer gpu habana-gaudi parallelism protein-folding protein-structure pytorch
Last synced: 13 Apr 2025
https://github.com/inducer/loopy
A code generator for array-based code on CPUs and GPUs
array code-generation code-generator code-optimization code-transformation cuda ispc loop-optimization multidimensional-arrays opencl performance performance-analysis prefix-sum python reduction scan scientific-computing
Last synced: 11 Apr 2025
https://github.com/tpoisonooo/how-to-optimize-gemm
row-major matmul optimization
arm64 armv7 cuda cuda-kernel gemm-optimization int4 ptx vulkan
Last synced: 04 Apr 2025
https://github.com/codingonion/awesome-llm-and-aigc
🚀🚀🚀A collection of some wesome public projects about Large Language Model(LLM), Vision Language Model(VLM), Vision Language Action(VLA), AI Generated Content(AIGC), the related Datasets and Applications.
aigc awesome-list chatgpt computer-vision cuda datasets deepseek deepseek-v3 gpt hugging-face langchain large-language-models llama llm openai sora triton vla vlm yolo
Last synced: 06 Feb 2025
https://github.com/hpcaitech/FastFold
Optimizing AlphaFold Training and Inference on GPU Clusters
alphafold2 cuda evoformer gpu habana-gaudi parallelism protein-folding protein-structure pytorch
Last synced: 12 Nov 2024
https://github.com/rapidsai/rmm
RAPIDS Memory Manager
cuda memory-allocation memory-management rapids
Last synced: 15 Apr 2025
https://github.com/gprmax/gprmax
gprMax is open source software that simulates electromagnetic wave propagation using the Finite-Difference Time-Domain (FDTD) method for numerical modelling of Ground Penetrating Radar (GPR)
antenna cuda electromagnetic fdtd gpr gpu modelling nvidia python simulation soil
Last synced: 07 Apr 2025
https://github.com/gprMax/gprMax
gprMax is open source software that simulates electromagnetic wave propagation using the Finite-Difference Time-Domain (FDTD) method for numerical modelling of Ground Penetrating Radar (GPR)
antenna cuda electromagnetic fdtd gpr gpu modelling nvidia python simulation soil
Last synced: 17 Nov 2024
https://github.com/sergio0694/neuralnetwork.net
A TensorFlow-inspired neural network library built from scratch in C# 7.3 for .NET Standard 2.0, with GPU support through cuDNN
ai backpropagation-algorithm classification-algorithims cnn convolutional-neural-networks csharp cuda gpu-acceleration gradient-descent machine-learning net-framework netstandard neural-network supervised-learning visual-studio
Last synced: 08 Apr 2025
https://github.com/Sergio0694/NeuralNetwork.NET
A TensorFlow-inspired neural network library built from scratch in C# 7.3 for .NET Standard 2.0, with GPU support through cuDNN
ai backpropagation-algorithm classification-algorithims cnn convolutional-neural-networks csharp cuda gpu-acceleration gradient-descent machine-learning net-framework netstandard neural-network supervised-learning visual-studio
Last synced: 02 Apr 2025
https://github.com/stochasticai/x-stable-diffusion
Real-time inference for Stable Diffusion - 0.88s latency. Covers AITemplate, nvFuser, TensorRT, FlashAttention. Join our Discord communty: https://discord.com/invite/TgHXuSJEk6
aitemplate automl cuda docker inference notebook nvfuser onnx onnxruntime pytorch stable-diffusion tensorrt
Last synced: 04 Apr 2025
https://github.com/kwea123/gaussian_splatting_notes
A detailed formulae explanation on gaussian splatting
Last synced: 05 Apr 2025
https://github.com/laugh12321/TensorRT-YOLO
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast inference speeds.
cuda cuda-graph cuda-kernels cuda-programming detection onnx ppyoloe tensorrt yolov10 yolov3 yolov5 yolov6 yolov7 yolov8 yolov9
Last synced: 18 Mar 2025
https://github.com/laugh12321/tensorrt-yolo
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast inference speeds.
cuda cuda-graph cuda-kernels cuda-programming detection onnx ppyoloe tensorrt yolov10 yolov3 yolov5 yolov6 yolov7 yolov8 yolov9
Last synced: 10 Apr 2025
https://github.com/tencent/forward
A library for high performance deep learning inference on NVIDIA GPUs.
cuda deep-learning forward gpu inference inference-engine keras neural-network onnx pytorch tensorflow tensorrt
Last synced: 05 Apr 2025
https://github.com/Tencent/Forward
A library for high performance deep learning inference on NVIDIA GPUs.
cuda deep-learning forward gpu inference inference-engine keras neural-network onnx pytorch tensorflow tensorrt
Last synced: 18 Apr 2025
https://github.com/cvxgrp/pymde
Minimum-distortion embedding with PyTorch
cuda dimensionality-reduction embedding feature-vectors gpu graph-embedding machine-learning pytorch visualization
Last synced: 08 Apr 2025
https://github.com/mariosieg/magnetron
(WIP) A small but powerful, homemade PyTorch from scratch.
artificial-intelligence cpp cuda high-performance-computing machine-learning neuronal-network python pytorch research-project tensorflow tiny
Last synced: 08 Apr 2025
https://github.com/zhihu/cubert
Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL
bert cuda deep-learning inference mkl predict tensorflow transformer
Last synced: 05 Apr 2025
https://github.com/zeux/calm
CUDA/Metal accelerated language model inference
Last synced: 10 Apr 2025
https://github.com/nvidia/jitify
A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).
cpp cuda jit-compilation nvrtc runtime-compilation single-header
Last synced: 08 Apr 2025
https://github.com/nvidia/cucollections
cpp cpp17 cuda datastructures gpu hashmap hashset hashtable
Last synced: 14 Apr 2025
https://github.com/luisagroup/luisarender
High-Performance Cross-Platform Monte Carlo Renderer Based on LuisaCompute
cpp cuda gpu high-performance ispc metal optix path-tracing ray-tracing renderer rendering siggraph-asia-2022
Last synced: 12 Apr 2025
https://github.com/openhackathons-org/gpubootcamp
This repository consists for gpu bootcamp material for HPC and AI
ai4hpc cuda data-science deep-learning deepstream gpu hpc machine-learning mpi openacc openmp rapidsai
Last synced: 27 Mar 2025
https://github.com/spcl/dace
DaCe - Data Centric Parallel Programming
cuda fpga high-level-synthesis high-performance-computing programming-language vivado-hls
Last synced: 11 Apr 2025
https://github.com/zhihu/cuBERT
Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL
bert cuda deep-learning inference mkl predict tensorflow transformer
Last synced: 02 Apr 2025
https://github.com/NVIDIA/nvbench
CUDA Kernel Benchmarking Library
benchmark cuda cuda-kernels gpu kernel-benchmark nvidia performance
Last synced: 19 Nov 2024
https://github.com/gridhead/nvidia-auto-installer-for-fedora-linux
A CLI tool which lets you install proprietary NVIDIA drivers and much more easily on Fedora Linux (32 or above and Rawhide)
cuda fedora hacktoberfest nvidia optimus rpmfusion
Last synced: 07 Apr 2025
https://github.com/Kaixhin/dockerfiles
Compilation of Dockerfiles with automated builds enabled on the Docker Registry
cuda deep-learning docker dockerfiles machine-learning vnc
Last synced: 20 Mar 2025
https://github.com/rnd-team-dev/plotoptix
Data visualisation and ray tracing in Python based on OptiX 8.1 framework.
3d-graphics animation cuda generative-art gpu nvidia optix path-tracing pathtracing plot ray-tracing raytracer raytracing real-time rtx visualization
Last synced: 10 Apr 2025
https://github.com/ashvardanian/less_slow.cpp
Learning how to write "Less Slow" code in C++ 20, C 99, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
assembly assembly-language avx512 benchmark coroutines cpp cpp-programming cpp17 cpp20 cuda gcc google-benchmark hpc io-uring linux-kernel llvm ptx ranges tutorial tutorials
Last synced: 08 Apr 2025
https://github.com/gorgonia/cu
package cu provides an idiomatic interface to the CUDA Driver API.
cuda cuda-driver-api go golang
Last synced: 04 Apr 2025
https://zielon.github.io/insta/
INSTA - Instant Volumetric Head Avatars [CVPR2023]
3dmm avatars cuda flame instant-ngp nerf neural-network volumetric-rendering
Last synced: 26 Mar 2025
https://github.com/cloudcores/cuassembler
An unofficial cuda assembler, for all generations of SASS, hopefully :)
Last synced: 05 Apr 2025
https://github.com/brucefan1983/GPUMD
Graphics Processing Units Molecular Dynamics
cuda gpu gpumd heat-transport high-performance-computing machine-learning machine-learning-potential molecular-dynamics molecular-dynamics-simulation natural-evolution-strategies neural-network neuroevolution phonon physics-simulation simulation
Last synced: 13 Nov 2024
https://github.com/salesforce/warp-drive
Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)
cuda deep-learning gpu high-throughput multiagent-reinforcement-learning numba pytorch reinforcement-learning
Last synced: 09 Apr 2025
https://github.com/h2oai/h2o4gpu
H2Oai GPU Edition
c-plus-plus cpu cuda elastic-net glm gpu lasso machine-learning pca python r rstats svd
Last synced: 13 Apr 2025
https://github.com/cloudcores/CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully :)
Last synced: 20 Mar 2025
https://github.com/mumax/3
GPU-accelerated micromagnetic simulator
cuda finite-difference-time-domain go micromagnetics scientific-computing
Last synced: 28 Mar 2025
https://github.com/ccsb-scripps/autodock-gpu
AutoDock for GPUs and other accelerators
autodock4 cuda gpu-computing molecular-docking multicore-cpu opencl
Last synced: 07 Apr 2025
https://github.com/ginkgo-project/ginkgo
Numerical linear algebra software package
cuda dpcpp gpu-computing hip hpc krylov-methods linear-algebra oneapi openmp preconditioning sparse-linear-systems spmv
Last synced: 14 Apr 2025
https://github.com/sinkingsugar/nimtorch
PyTorch - Python + Nim
artificial-intelligence artificial-neural-networks cuda machine-learning nim pytorch wasm
Last synced: 07 Apr 2025
https://github.com/cnstark/pytorch-docker
Pure Pytorch Docker Images.
centos cuda deep-learning docker nvidia pytorch ubuntu
Last synced: 25 Jan 2025
https://github.com/huggingface/large_language_model_training_playbook
An open collection of implementation tips, tricks and resources for training large language models
cuda large-language-models llm nccl nlp performance python pytorch scalability troubleshooting
Last synced: 11 Nov 2024
https://github.com/termoshtt/accel
(Mirror of GitLab) GPGPU Framework for Rust
Last synced: 25 Jan 2025
https://github.com/petercunha/pine
:evergreen_tree: Aimbot powered by real-time object detection with neural networks, GPU accelerated with Nvidia. Optimized for use with CS:GO.
aimbot csgo cuda darknet detection fortnite fps game-hacking hacking neural-network neural-networks nvidia object-detection opencl opencv overwatch pine python yolo yolov3
Last synced: 09 Apr 2025
https://github.com/petercunha/Pine
:evergreen_tree: Aimbot powered by real-time object detection with neural networks, GPU accelerated with Nvidia. Optimized for use with CS:GO.
aimbot csgo cuda darknet detection fortnite fps game-hacking hacking neural-network neural-networks nvidia object-detection opencl opencv overwatch pine python yolo yolov3
Last synced: 17 Apr 2025
https://github.com/MegviiRobot/MegBA
MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment
bundleadjustment cuda distributed gpu-acceleration graph-optimization high-performance
Last synced: 14 Nov 2024
https://github.com/patwie/tensorflow-cmake
TensorFlow examples in C, C++, Go and Python without bazel but with cmake and FindTensorFlow.cmake
c cmake cpp cuda deep-learning golang inference opencv tensorflow tensorflow-cc tensorflow-cmake tensorflow-examples tensorflow-gpu
Last synced: 06 Apr 2025
https://github.com/shi-labs/natten
Neighborhood Attention Extension. Bringing attention to a neighborhood near you!
cuda neighborhood-attention pytorch
Last synced: 13 Apr 2025
https://github.com/tlkh/ai-lab
All-in-one AI container for rapid prototyping
cuda data-science deep-learning docker jupyter nvidia pytorch tensorflow
Last synced: 05 Apr 2025
https://github.com/colin97/msn-point-cloud-completion
Morphing and Sampling Network for Dense Point Cloud Completion (AAAI2020)
3d-reconstruction auction-algorithm cuda earth-mover-distance earth-movers-distance minimum-spanning-tree point-cloud point-cloud-completion point-cloud-processing shape-completion
Last synced: 07 Apr 2025
https://github.com/toverainc/willow-inference-server
Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS
cuda deep-learning llama llm privacy speech-recognition speech-to-text text-to-speech vicuna webrtc whisper willow
Last synced: 05 Apr 2025
https://github.com/DerryHub/BEVFormer_tensorrt
BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).
bevformer cuda int8-inference pytorch quantization tensorrt-plugins
Last synced: 20 Mar 2025
https://github.com/vectorch-ai/scalellm
A high-performance inference system for large language models, designed for production environments.
cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer
Last synced: 14 Apr 2025
https://github.com/arrayfire/arrayfire-python
Python bindings for ArrayFire: A general purpose GPU library.
arrayfire cuda gpgpu gpu hpc opencl python python-bindings
Last synced: 02 Apr 2025
https://github.com/uncomplicate/deep-diamond
A fast Clojure Tensor & Deep Learning library
clojure cuda deep-learning deep-neural-networks dnnl gpu java nvidia
Last synced: 12 Apr 2025
https://github.com/alicevision/popsift
PopSift is an implementation of the SIFT algorithm in CUDA.
computer-vision cuda feature-extraction gpu image-processing sift
Last synced: 05 Apr 2025
https://github.com/serverlessllm/serverlessllm
Serverless LLM Serving for Everyone.
cuda huggingface-transformers large-language-models model-as-a-service model-serving pytorch serverless-inference
Last synced: 10 Apr 2025
https://github.com/xmrig/xmrig-cuda
NVIDIA CUDA plugin for XMRig miner
cryptonight cuda randomx xmrig
Last synced: 07 Apr 2025
https://github.com/ingonyama-zk/icicle
A hardware acceleration library for compute intensive cryptography :ice_cube:
cpu cryptography cuda golang msm ntt rust zero-knowledge
Last synced: 13 Apr 2025
https://github.com/JuliaGPU/CUDAnative.jl
Julia support for native CUDA programming
cuda cuda-toolkit julia julia-library
Last synced: 29 Nov 2024
https://github.com/rapidsai/cucim
cuCIM - RAPIDS GPU-accelerated image processing library
computer-vision cuda digital-pathology gpu image-analysis image-data image-processing medical-imaging microscopy multidimensional-image-processing nvidia segmentation
Last synced: 11 Apr 2025
https://github.com/dfm/extending-jax
Extending JAX with custom C++ and CUDA code
Last synced: 05 Apr 2025
https://github.com/lambdalabsml/distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm
Last synced: 08 Apr 2025
https://github.com/nvidia/cuquantum
Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
cuda cuquantum custatevec cutensornet nvidia quantum-computing
Last synced: 12 Apr 2025
https://github.com/vectorch-ai/ScaleLLM
A high-performance inference system for large language models, designed for production environments.
cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer
Last synced: 16 Nov 2024
https://github.com/NVIDIA/cuQuantum
Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
cuda cuquantum custatevec cutensornet nvidia quantum-computing
Last synced: 02 Apr 2025
https://github.com/fixstars/cuda-bundle-adjustment
A CUDA implementation of Bundle Adjustment
bundle-adjustment cuda g2o slam structure-from-motion visual-slam
Last synced: 05 Apr 2025
https://github.com/ibm/aihwkit
IBM Analog Hardware Acceleration Kit
ai analog-devices cuda neural-networks pytorch
Last synced: 08 Apr 2025
https://github.com/osai-ai/tensor-stream
A library for real-time video stream decoding to CUDA memory
c-plus-plus cuda python pytorch video video-processing
Last synced: 14 Nov 2024
https://github.com/bruce-lee-ly/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core
Last synced: 05 Apr 2025
https://github.com/luoyetx/mini-caffe
Minimal runtime core of Caffe, Forward only, GPU support and Memory efficiency.
android caffe cuda cudnn forward-only linux mini-caffe openblas windows
Last synced: 15 Mar 2025
https://github.com/nosferalatu/SimpleGPUHashTable
A simple GPU hash table implemented in CUDA using lock free techniques
cuda cuda-programming data-structures gpu gpu-cuda-programs
Last synced: 14 Nov 2024
https://github.com/glotzerlab/hoomd-blue
Molecular dynamics and Monte Carlo soft matter simulation on GPUs.
conda-forge cuda docker gpu hard-particle hoomd-blue molecular-dynamics monte-carlo-simulation particle-system python simulation singularity
Last synced: 14 Apr 2025
https://github.com/uncomplicate/bayadera
High-performance Bayesian Data Analysis on the GPU in Clojure
bayesian bayesian-data-analysis bayesian-inference clojure clojure-library cuda gpu gpu-acceleration gpu-computing high-performance-computing machine-learning markov-chain-monte-carlo mcmc opencl statistics
Last synced: 09 Apr 2025
https://github.com/rapidsai/cuvs
cuVS - a library for vector search and clustering on the GPU
anns clustering cuda distance gpu information-retrieval llm machine-learning nearest-neighbors neighborhood-methods similarity-search sparse statistics vector-search vector-similarity vector-store
Last synced: 13 Apr 2025
https://github.com/LambdaLabsML/distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm
Last synced: 08 Mar 2025
https://github.com/nersc/timemory
Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ template API is essentially a framework to creating tools: it is designed to provide a unifying interface for recording various performance measurements alongside data logging and interfaces to other tools.
analysis c cplusplus cpp cross-language cross-platform cuda cupti gotcha hardware-counters instrumentation-api memory-measurements modular-design mpi papi performance performance-measurement python roofline
Last synced: 23 Jan 2025
https://github.com/alpaka-group/alpaka
Abstraction Library for Parallel Kernel Acceleration :llama:
cpp cpp17 cuda gpu header-only heterogeneous-parallel-programming hip hpc openacc openmp rocm tbb
Last synced: 07 Apr 2025
https://github.com/NERSC/timemory
Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ template API is essentially a framework to creating tools: it is designed to provide a unifying interface for recording various performance measurements alongside data logging and interfaces to other tools.
analysis c cplusplus cpp cross-language cross-platform cuda cupti gotcha hardware-counters instrumentation-api memory-measurements modular-design mpi papi performance performance-measurement python roofline
Last synced: 14 Nov 2024