CUDA | Ecosyste.ms: Awesome

https://github.com/ddemidov/vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

c-plus-plus cpp11 cuda gpgpu opencl scientific-computing

Last synced: 14 Apr 2025

https://github.com/shibatch/sleef

SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT

aarch64 avx2 avx512 cuda elementary-functions fft fourier-transform fourier-transform-library math-library powerpc quadruple-precision s390x simd sse2 vector-math vectorization vsx

Last synced: 13 Apr 2025

https://github.com/cresset-template/cresset

Template repository to build PyTorch projects from source on any version of PyTorch/CUDA/cuDNN.

build cuda deep-learning deep-learning-tutorial docker docker-compose machine-learning makefile mlops mlops-template python pytorch source source-python template template-repository wheel

Last synced: 04 Apr 2025

https://github.com/xtra-computing/thundergbm

ThunderGBM: Fast GBDTs and Random Forests on GPUs

cuda gbdt gpu machine-learning random-forest

Last synced: 11 Apr 2025

https://github.com/Xtra-Computing/thundergbm

ThunderGBM: Fast GBDTs and Random Forests on GPUs

cuda gbdt gpu machine-learning random-forest

Last synced: 12 Apr 2025

https://github.com/santosh-gupta/speedtorch

Library for faster pinned CPU <-> GPU transfer in Pytorch

cpu-gpu-transfer cpu-pinned-tensors cuda cuda-tensors cuda-variables cupy data-transfer embeddings embeddings-trained gpu gpu-transfer machine-learning natural-language-processing nlp pinned-cpu-tensors pytorch pytorch-tensors pytorch-variables sparse sparse-modeling

Last synced: 12 Apr 2025

https://github.com/thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

attention cuda inference-acceleration llm quantization triton video-generation

Last synced: 15 Dec 2024

https://github.com/Santosh-Gupta/SpeedTorch

Library for faster pinned CPU <-> GPU transfer in Pytorch

cpu-gpu-transfer cpu-pinned-tensors cuda cuda-tensors cuda-variables cupy data-transfer embeddings embeddings-trained gpu gpu-transfer machine-learning natural-language-processing nlp pinned-cpu-tensors pytorch pytorch-tensors pytorch-variables sparse sparse-modeling

Last synced: 15 Nov 2024

https://github.com/uxlfoundation/onemath

oneAPI Math Library (oneMath)

api blas cpu cuda dpcpp gpu hpc intel math-libraries oneapi onemkl parallel-computing parallel-programming performance rng

Last synced: 14 Apr 2025

https://github.com/maghoumi/pytorch-softdtw-cuda

Fast CUDA implementation of (differentiable) soft dynamic time warping for PyTorch

cuda deep-learning dynamic-time-warping pytorch soft-dtw

Last synced: 04 Apr 2025

https://github.com/cern/tigre

TIGRE: Tomographic Iterative GPU-based Reconstruction Toolbox

cuda gpus image-reconstruction matlab python tigre tomography toolbox x-ray

Last synced: 14 Apr 2025

https://github.com/hedronvision/bazel-compile-commands-extractor

Goal: Enable awesome tooling for Bazel users of the C language family.

bazel bazel-build c ccls clang clang-tidy clang-tooling clangd contributions-welcome cpp cross-platform cuda hacktoberfest objective-c objective-c-plus-plus tools

Last synced: 23 Mar 2025

https://github.com/insight-platform/Savant

Python Computer Vision & Video Analytics Framework With Batteries Included

computer-vision cuda deep-learning deepstream edge-computing inference-engine instance-segmentation machine-learning nvidia nvidia-deepstream-sdk object-detection opencv peoplenet tensorrt video yolo yolov5-face yolov8 yolov8-face

Last synced: 21 Apr 2025

https://github.com/oneapi-src/onemkl

oneAPI Math Kernel Library (oneMKL) Interfaces

api blas cpu cuda dpcpp gpu hpc intel math-libraries oneapi onemkl parallel-computing parallel-programming performance rng

Last synced: 01 Dec 2024

https://github.com/nvidia/nvbench

CUDA Kernel Benchmarking Library

benchmark cuda cuda-kernels gpu kernel-benchmark nvidia performance

Last synced: 14 Apr 2025

https://github.com/cudamat/cudamat

Python module for performing basic dense linear algebra computations on the GPU using CUDA.

cuda linear-algebra python

Last synced: 14 Mar 2025

https://github.com/hpcaitech/fastfold

Optimizing AlphaFold Training and Inference on GPU Clusters

alphafold2 cuda evoformer gpu habana-gaudi parallelism protein-folding protein-structure pytorch

Last synced: 13 Apr 2025

https://github.com/inducer/loopy

A code generator for array-based code on CPUs and GPUs

array code-generation code-generator code-optimization code-transformation cuda ispc loop-optimization multidimensional-arrays opencl performance performance-analysis prefix-sum python reduction scan scientific-computing

Last synced: 11 Apr 2025

https://github.com/tpoisonooo/how-to-optimize-gemm

row-major matmul optimization

arm64 armv7 cuda cuda-kernel gemm-optimization int4 ptx vulkan

Last synced: 04 Apr 2025

https://github.com/codingonion/awesome-llm-and-aigc

🚀🚀🚀A collection of some wesome public projects about Large Language Model(LLM), Vision Language Model(VLM), Vision Language Action(VLA), AI Generated Content(AIGC), the related Datasets and Applications.

aigc awesome-list chatgpt computer-vision cuda datasets deepseek deepseek-v3 gpt hugging-face langchain large-language-models llama llm openai sora triton vla vlm yolo

Last synced: 06 Feb 2025

https://github.com/hpcaitech/FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters

alphafold2 cuda evoformer gpu habana-gaudi parallelism protein-folding protein-structure pytorch

Last synced: 12 Nov 2024

https://github.com/rapidsai/rmm

RAPIDS Memory Manager

cuda memory-allocation memory-management rapids

Last synced: 15 Apr 2025

https://github.com/gprmax/gprmax

gprMax is open source software that simulates electromagnetic wave propagation using the Finite-Difference Time-Domain (FDTD) method for numerical modelling of Ground Penetrating Radar (GPR)

antenna cuda electromagnetic fdtd gpr gpu modelling nvidia python simulation soil

Last synced: 07 Apr 2025

https://github.com/gprMax/gprMax

gprMax is open source software that simulates electromagnetic wave propagation using the Finite-Difference Time-Domain (FDTD) method for numerical modelling of Ground Penetrating Radar (GPR)

antenna cuda electromagnetic fdtd gpr gpu modelling nvidia python simulation soil

Last synced: 17 Nov 2024

https://github.com/sergio0694/neuralnetwork.net

A TensorFlow-inspired neural network library built from scratch in C# 7.3 for .NET Standard 2.0, with GPU support through cuDNN

ai backpropagation-algorithm classification-algorithims cnn convolutional-neural-networks csharp cuda gpu-acceleration gradient-descent machine-learning net-framework netstandard neural-network supervised-learning visual-studio

Last synced: 08 Apr 2025

https://github.com/Sergio0694/NeuralNetwork.NET

A TensorFlow-inspired neural network library built from scratch in C# 7.3 for .NET Standard 2.0, with GPU support through cuDNN

ai backpropagation-algorithm classification-algorithims cnn convolutional-neural-networks csharp cuda gpu-acceleration gradient-descent machine-learning net-framework netstandard neural-network supervised-learning visual-studio

Last synced: 02 Apr 2025

https://github.com/stochasticai/x-stable-diffusion

Real-time inference for Stable Diffusion - 0.88s latency. Covers AITemplate, nvFuser, TensorRT, FlashAttention. Join our Discord communty: https://discord.com/invite/TgHXuSJEk6

aitemplate automl cuda docker inference notebook nvfuser onnx onnxruntime pytorch stable-diffusion tensorrt

Last synced: 04 Apr 2025

https://github.com/kwea123/gaussian_splatting_notes

A detailed formulae explanation on gaussian splatting

cuda gaussian-splatting

Last synced: 05 Apr 2025

https://github.com/laugh12321/TensorRT-YOLO

🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下，享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast inference speeds.

cuda cuda-graph cuda-kernels cuda-programming detection onnx ppyoloe tensorrt yolov10 yolov3 yolov5 yolov6 yolov7 yolov8 yolov9

Last synced: 18 Mar 2025

https://github.com/laugh12321/tensorrt-yolo

🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下，享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast inference speeds.

cuda cuda-graph cuda-kernels cuda-programming detection onnx ppyoloe tensorrt yolov10 yolov3 yolov5 yolov6 yolov7 yolov8 yolov9

Last synced: 10 Apr 2025

https://github.com/tencent/forward

A library for high performance deep learning inference on NVIDIA GPUs.

cuda deep-learning forward gpu inference inference-engine keras neural-network onnx pytorch tensorflow tensorrt

Last synced: 05 Apr 2025

https://github.com/Tencent/Forward

A library for high performance deep learning inference on NVIDIA GPUs.

cuda deep-learning forward gpu inference inference-engine keras neural-network onnx pytorch tensorflow tensorrt

Last synced: 18 Apr 2025

https://github.com/cvxgrp/pymde

Minimum-distortion embedding with PyTorch

cuda dimensionality-reduction embedding feature-vectors gpu graph-embedding machine-learning pytorch visualization

Last synced: 08 Apr 2025

https://github.com/mariosieg/magnetron

(WIP) A small but powerful, homemade PyTorch from scratch.

artificial-intelligence cpp cuda high-performance-computing machine-learning neuronal-network python pytorch research-project tensorflow tiny

Last synced: 08 Apr 2025

https://github.com/zhihu/cubert

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

bert cuda deep-learning inference mkl predict tensorflow transformer

Last synced: 05 Apr 2025

https://github.com/zeux/calm

CUDA/Metal accelerated language model inference

cuda llm-inference ml

Last synced: 10 Apr 2025

https://github.com/nvidia/jitify

A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).

cpp cuda jit-compilation nvrtc runtime-compilation single-header

Last synced: 08 Apr 2025

https://github.com/nvidia/cucollections

cpp cpp17 cuda datastructures gpu hashmap hashset hashtable

Last synced: 14 Apr 2025

https://github.com/luisagroup/luisarender

High-Performance Cross-Platform Monte Carlo Renderer Based on LuisaCompute

cpp cuda gpu high-performance ispc metal optix path-tracing ray-tracing renderer rendering siggraph-asia-2022

Last synced: 12 Apr 2025

https://github.com/openhackathons-org/gpubootcamp

This repository consists for gpu bootcamp material for HPC and AI

ai4hpc cuda data-science deep-learning deepstream gpu hpc machine-learning mpi openacc openmp rapidsai

Last synced: 27 Mar 2025

https://github.com/spcl/dace

DaCe - Data Centric Parallel Programming

cuda fpga high-level-synthesis high-performance-computing programming-language vivado-hls

Last synced: 11 Apr 2025

https://github.com/zhihu/cuBERT

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

bert cuda deep-learning inference mkl predict tensorflow transformer

Last synced: 02 Apr 2025

https://github.com/NVIDIA/nvbench

CUDA Kernel Benchmarking Library

benchmark cuda cuda-kernels gpu kernel-benchmark nvidia performance

Last synced: 19 Nov 2024

https://github.com/gridhead/nvidia-auto-installer-for-fedora-linux

A CLI tool which lets you install proprietary NVIDIA drivers and much more easily on Fedora Linux (32 or above and Rawhide)

cuda fedora hacktoberfest nvidia optimus rpmfusion

Last synced: 07 Apr 2025

https://github.com/Kaixhin/dockerfiles

Compilation of Dockerfiles with automated builds enabled on the Docker Registry

cuda deep-learning docker dockerfiles machine-learning vnc

Last synced: 20 Mar 2025

https://github.com/rnd-team-dev/plotoptix

Data visualisation and ray tracing in Python based on OptiX 8.1 framework.

3d-graphics animation cuda generative-art gpu nvidia optix path-tracing pathtracing plot ray-tracing raytracer raytracing real-time rtx visualization

Last synced: 10 Apr 2025

https://github.com/ashvardanian/less_slow.cpp

Learning how to write "Less Slow" code in C++ 20, C 99, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO

assembly assembly-language avx512 benchmark coroutines cpp cpp-programming cpp17 cpp20 cuda gcc google-benchmark hpc io-uring linux-kernel llvm ptx ranges tutorial tutorials

Last synced: 08 Apr 2025

https://github.com/gorgonia/cu

package cu provides an idiomatic interface to the CUDA Driver API.

cuda cuda-driver-api go golang

Last synced: 04 Apr 2025

https://github.com/cryinkfly/solidworks-for-linux

This is a project, where I give you a way to use SOLIDWORKS on Linux!

archlinux cuda fedora international linux linuxmint manjaro nvidia opengl opensuse ubuntu wine

Last synced: 08 Apr 2025

https://zielon.github.io/insta/

INSTA - Instant Volumetric Head Avatars [CVPR2023]

3dmm avatars cuda flame instant-ngp nerf neural-network volumetric-rendering

Last synced: 26 Mar 2025

https://github.com/cloudcores/cuassembler

An unofficial cuda assembler, for all generations of SASS, hopefully ：）

assembler cuda nvidia sass

Last synced: 05 Apr 2025

https://github.com/brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

cuda gpu gpumd heat-transport high-performance-computing machine-learning machine-learning-potential molecular-dynamics molecular-dynamics-simulation natural-evolution-strategies neural-network neuroevolution phonon physics-simulation simulation

Last synced: 13 Nov 2024

https://github.com/salesforce/warp-drive

Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)

cuda deep-learning gpu high-throughput multiagent-reinforcement-learning numba pytorch reinforcement-learning

Last synced: 09 Apr 2025

https://github.com/h2oai/h2o4gpu

H2Oai GPU Edition

c-plus-plus cpu cuda elastic-net glm gpu lasso machine-learning pca python r rstats svd

Last synced: 13 Apr 2025

https://github.com/cloudcores/CuAssembler

An unofficial cuda assembler, for all generations of SASS, hopefully ：）

assembler cuda nvidia sass

Last synced: 20 Mar 2025

https://github.com/mumax/3

GPU-accelerated micromagnetic simulator

cuda finite-difference-time-domain go micromagnetics scientific-computing

Last synced: 28 Mar 2025

https://github.com/ccsb-scripps/autodock-gpu

AutoDock for GPUs and other accelerators

autodock4 cuda gpu-computing molecular-docking multicore-cpu opencl

Last synced: 07 Apr 2025

https://github.com/ginkgo-project/ginkgo

Numerical linear algebra software package

cuda dpcpp gpu-computing hip hpc krylov-methods linear-algebra oneapi openmp preconditioning sparse-linear-systems spmv

Last synced: 14 Apr 2025

https://github.com/sinkingsugar/nimtorch

PyTorch - Python + Nim

artificial-intelligence artificial-neural-networks cuda machine-learning nim pytorch wasm

Last synced: 07 Apr 2025

https://github.com/cnstark/pytorch-docker

Pure Pytorch Docker Images.

centos cuda deep-learning docker nvidia pytorch ubuntu

Last synced: 25 Jan 2025

https://github.com/huggingface/large_language_model_training_playbook

An open collection of implementation tips, tricks and resources for training large language models

cuda large-language-models llm nccl nlp performance python pytorch scalability troubleshooting

Last synced: 11 Nov 2024

https://github.com/termoshtt/accel

(Mirror of GitLab) GPGPU Framework for Rust

cuda gpgpu rust-lang

Last synced: 25 Jan 2025

https://github.com/petercunha/pine

:evergreen_tree: Aimbot powered by real-time object detection with neural networks, GPU accelerated with Nvidia. Optimized for use with CS:GO.

aimbot csgo cuda darknet detection fortnite fps game-hacking hacking neural-network neural-networks nvidia object-detection opencl opencv overwatch pine python yolo yolov3

Last synced: 09 Apr 2025

https://github.com/petercunha/Pine

:evergreen_tree: Aimbot powered by real-time object detection with neural networks, GPU accelerated with Nvidia. Optimized for use with CS:GO.

aimbot csgo cuda darknet detection fortnite fps game-hacking hacking neural-network neural-networks nvidia object-detection opencl opencv overwatch pine python yolo yolov3

Last synced: 17 Apr 2025

https://github.com/MegviiRobot/MegBA

MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment

bundleadjustment cuda distributed gpu-acceleration graph-optimization high-performance

Last synced: 14 Nov 2024

https://github.com/patwie/tensorflow-cmake

TensorFlow examples in C, C++, Go and Python without bazel but with cmake and FindTensorFlow.cmake

c cmake cpp cuda deep-learning golang inference opencv tensorflow tensorflow-cc tensorflow-cmake tensorflow-examples tensorflow-gpu

Last synced: 06 Apr 2025

https://github.com/shi-labs/natten

Neighborhood Attention Extension. Bringing attention to a neighborhood near you!

cuda neighborhood-attention pytorch

Last synced: 13 Apr 2025

https://github.com/tlkh/ai-lab

All-in-one AI container for rapid prototyping

cuda data-science deep-learning docker jupyter nvidia pytorch tensorflow

Last synced: 05 Apr 2025

https://github.com/colin97/msn-point-cloud-completion

Morphing and Sampling Network for Dense Point Cloud Completion (AAAI2020)

3d-reconstruction auction-algorithm cuda earth-mover-distance earth-movers-distance minimum-spanning-tree point-cloud point-cloud-completion point-cloud-processing shape-completion

Last synced: 07 Apr 2025

https://github.com/toverainc/willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

cuda deep-learning llama llm privacy speech-recognition speech-to-text text-to-speech vicuna webrtc whisper willow

Last synced: 05 Apr 2025

https://github.com/DerryHub/BEVFormer_tensorrt

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).

bevformer cuda int8-inference pytorch quantization tensorrt-plugins

Last synced: 20 Mar 2025

https://github.com/vectorch-ai/scalellm

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 14 Apr 2025

https://github.com/arrayfire/arrayfire-python

Python bindings for ArrayFire: A general purpose GPU library.

arrayfire cuda gpgpu gpu hpc opencl python python-bindings

Last synced: 02 Apr 2025

https://github.com/libocca/occa

Portable and vendor neutral framework for parallel programming on heterogeneous platforms.

c cpp cuda dpcpp fortran gpgpu gpu hip hpc jit metal multithreading oneapi opencl openmp sycl

Last synced: 04 Apr 2025

https://github.com/uncomplicate/deep-diamond

A fast Clojure Tensor & Deep Learning library

clojure cuda deep-learning deep-neural-networks dnnl gpu java nvidia

Last synced: 12 Apr 2025

https://github.com/alicevision/popsift

PopSift is an implementation of the SIFT algorithm in CUDA.

computer-vision cuda feature-extraction gpu image-processing sift

Last synced: 05 Apr 2025

https://github.com/serverlessllm/serverlessllm

Serverless LLM Serving for Everyone.

cuda huggingface-transformers large-language-models model-as-a-service model-serving pytorch serverless-inference

Last synced: 10 Apr 2025

https://github.com/xmrig/xmrig-cuda

NVIDIA CUDA plugin for XMRig miner

cryptonight cuda randomx xmrig

Last synced: 07 Apr 2025

https://github.com/ingonyama-zk/icicle

A hardware acceleration library for compute intensive cryptography :ice_cube:

cpu cryptography cuda golang msm ntt rust zero-knowledge

Last synced: 13 Apr 2025

https://github.com/JuliaGPU/CUDAnative.jl

Julia support for native CUDA programming

cuda cuda-toolkit julia julia-library

Last synced: 29 Nov 2024

https://github.com/rapidsai/cucim

cuCIM - RAPIDS GPU-accelerated image processing library

computer-vision cuda digital-pathology gpu image-analysis image-data image-processing medical-imaging microscopy multidimensional-image-processing nvidia segmentation

Last synced: 11 Apr 2025

https://github.com/dfm/extending-jax

Extending JAX with custom C++ and CUDA code

cuda jax xla

Last synced: 05 Apr 2025

https://github.com/lambdalabsml/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: 08 Apr 2025

https://github.com/nvidia/cuquantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

cuda cuquantum custatevec cutensornet nvidia quantum-computing

Last synced: 12 Apr 2025

https://github.com/vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 16 Nov 2024

https://github.com/NVIDIA/cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

cuda cuquantum custatevec cutensornet nvidia quantum-computing

Last synced: 02 Apr 2025

https://github.com/fixstars/cuda-bundle-adjustment

A CUDA implementation of Bundle Adjustment

bundle-adjustment cuda g2o slam structure-from-motion visual-slam

Last synced: 05 Apr 2025

https://github.com/ibm/aihwkit

IBM Analog Hardware Acceleration Kit

ai analog-devices cuda neural-networks pytorch

Last synced: 08 Apr 2025

https://github.com/osai-ai/tensor-stream

A library for real-time video stream decoding to CUDA memory

c-plus-plus cuda python pytorch video video-processing

Last synced: 14 Nov 2024

https://github.com/bruce-lee-ly/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 05 Apr 2025

https://github.com/luoyetx/mini-caffe

Minimal runtime core of Caffe, Forward only, GPU support and Memory efficiency.

android caffe cuda cudnn forward-only linux mini-caffe openblas windows

Last synced: 15 Mar 2025

https://github.com/ekondis/mixbench

A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)

benchmark cuda gpu hip opencl openmp sycl

Last synced: 04 Apr 2025

https://github.com/nosferalatu/SimpleGPUHashTable

A simple GPU hash table implemented in CUDA using lock free techniques

cuda cuda-programming data-structures gpu gpu-cuda-programs

Last synced: 14 Nov 2024

https://github.com/glotzerlab/hoomd-blue

Molecular dynamics and Monte Carlo soft matter simulation on GPUs.

conda-forge cuda docker gpu hard-particle hoomd-blue molecular-dynamics monte-carlo-simulation particle-system python simulation singularity

Last synced: 14 Apr 2025

https://github.com/uncomplicate/bayadera

High-performance Bayesian Data Analysis on the GPU in Clojure

bayesian bayesian-data-analysis bayesian-inference clojure clojure-library cuda gpu gpu-acceleration gpu-computing high-performance-computing machine-learning markov-chain-monte-carlo mcmc opencl statistics

Last synced: 09 Apr 2025

https://github.com/rapidsai/cuvs

cuVS - a library for vector search and clustering on the GPU

anns clustering cuda distance gpu information-retrieval llm machine-learning nearest-neighbors neighborhood-methods similarity-search sparse statistics vector-search vector-similarity vector-store

Last synced: 13 Apr 2025

https://github.com/LambdaLabsML/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: 08 Mar 2025

https://github.com/nersc/timemory

Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ template API is essentially a framework to creating tools: it is designed to provide a unifying interface for recording various performance measurements alongside data logging and interfaces to other tools.

analysis c cplusplus cpp cross-language cross-platform cuda cupti gotcha hardware-counters instrumentation-api memory-measurements modular-design mpi papi performance performance-measurement python roofline

Last synced: 23 Jan 2025

https://github.com/alpaka-group/alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:

cpp cpp17 cuda gpu header-only heterogeneous-parallel-programming hip hpc openacc openmp rocm tbb

Last synced: 07 Apr 2025

https://github.com/NERSC/timemory

Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ template API is essentially a framework to creating tools: it is designed to provide a unifying interface for recording various performance measurements alongside data logging and interfaces to other tools.

analysis c cplusplus cpp cross-language cross-platform cuda cupti gotcha hardware-counters instrumentation-api memory-measurements modular-design mpi papi performance performance-measurement python roofline

Last synced: 14 Nov 2024