CUDA | Ecosyste.ms: Awesome

https://github.com/OpenPPL/ppq

PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.

caffe cuda deep-learning neural-network onnx open-source pytorch quantization

Last synced: 28 Oct 2024

https://github.com/sniklaus/3d-ken-burns

an implementation of 3D Ken Burns Effect from a Single Image using PyTorch

cuda cupy deep-learning python pytorch

Last synced: 25 Jan 2025

https://github.com/openppl-public/ppq

PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.

caffe cuda deep-learning neural-network onnx open-source pytorch quantization

Last synced: 05 Oct 2024

https://github.com/kevmo314/scuda

SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.

cublas cuda cudnn gpu mlops networking nvml remote-access

Last synced: 25 Jan 2025

https://github.com/wangzhaode/mnn-llm

llm deploy project based mnn.

baichuan2-7b chatglm-6b chatglm2-6b codegeex2-6b cpp cuda mnn opencl qwen-7b

Last synced: 23 Jan 2025

https://github.com/pytorch/ao

PyTorch native quantization and sparsity for training and inference

brrr cuda dtypes float8 inference llama mx offloading optimizer pytorch quantization sparsity training transformer

Last synced: 23 Jan 2025

https://github.com/m4rs-mt/ilgpu

ILGPU JIT Compiler for high-performance .Net GPU programs

amd cil compiler cpu cuda dotnet gpgpu gpgpu-computing gpu ilgpu intel jit kernels msil nvidia opencl parallel ptx

Last synced: 23 Jan 2025

https://github.com/nvidia/cccl

CUDA Core Compute Libraries

accelerated-computing cpp cpp-programming cuda cuda-cpp cuda-kernels cuda-library cuda-programming gpu gpu-acceleration gpu-computing gpu-programming hpc modern-cpp nvidia nvidia-gpu parallel-algorithm parallel-computing parallel-programming

Last synced: 24 Jan 2025

https://github.com/godweiyang/nn-cuda-example

Several simple examples for popular neural network toolkits calling custom CUDA operators.

cpp cuda neural-network python pytorch tensorflow

Last synced: 27 Jan 2025

https://github.com/m4rs-mt/ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs

amd cil compiler cpu cuda dotnet gpgpu gpgpu-computing gpu ilgpu intel jit kernels msil nvidia opencl parallel ptx

Last synced: 11 Nov 2024

https://github.com/AlexiaJM/Deep-learning-with-cats

Deep learning with cats (^._.^)

cat cuda deep-learning gan picture

Last synced: 27 Nov 2024

https://github.com/alexiajm/deep-learning-with-cats

Deep learning with cats (^._.^)

cat cuda deep-learning gan picture

Last synced: 27 Jan 2025

https://github.com/flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

cuda flash-attention gpu jit large-large-models llm-inference pytorch

Last synced: 23 Jan 2025

https://github.com/koide3/fast_gicp

A collection of GICP-based fast point cloud registration algorithms

cpp cuda gicp gpu icp multithreading pcl point-cloud python registration scan-matching vgicp

Last synced: 23 Jan 2025

https://github.com/FeiYull/TensorRT-Alpha

🔥🔥🔥TensorRT for YOLOv8、YOLOv8-Pose、YOLOv8-Seg、YOLOv8-Cls、YOLOv7、YOLOv6、YOLOv5、YOLONAS......🚀🚀🚀CUDA IS ALL YOU NEED.🍎🍎🍎

cuda efficientdet libfacedetection rt-detr tensorrt u2net yolonas yolor yolov3 yolov4 yolov5 yolov6 yolov7 yolov8 yolov8-pose yolov8-seg yolox

Last synced: 09 Nov 2024

https://github.com/DefTruth/CUDA-Learn-Notes

🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.

block-reduce cuda cuda-programming elementwise flash-attention flash-attention-2 flash-attention-3 gemm gemv layernorm pytorch rmsnorm softmax triton warp-reduce

Last synced: 27 Oct 2024

https://github.com/godweiyang/NN-CUDA-Example

Several simple examples for popular neural network toolkits calling custom CUDA operators.

cpp cuda neural-network python pytorch tensorflow

Last synced: 28 Oct 2024

https://github.com/feiyull/tensorrt-alpha

🔥🔥🔥TensorRT for YOLOv8、YOLOv8-Pose、YOLOv8-Seg、YOLOv8-Cls、YOLOv7、YOLOv6、YOLOv5、YOLONAS......🚀🚀🚀CUDA IS ALL YOU NEED.🍎🍎🍎

cuda efficientdet libfacedetection rt-detr tensorrt u2net yolonas yolor yolov3 yolov4 yolov5 yolov6 yolov7 yolov8 yolov8-pose yolov8-seg yolox

Last synced: 24 Jan 2025

https://github.com/kwea123/ngp_pl

Instant-ngp in pytorch+cuda trained with pytorch-lightning (high quality with high speed, with only few lines of legible code)

3d-reconstruction cuda instant-ngp nerf novel-view-synthesis pytorch pytorch-lightning

Last synced: 26 Jan 2025

https://github.com/andyzeng/tsdf-fusion-python

Python code to fuse multiple RGB-D images into a TSDF voxel volume.

3d 3d-deep-learning 3d-reconstruction artificial-intelligence cuda depth-camera kinect-fusion rgbd tsdf vision volumetric-data

Last synced: 26 Jan 2025

https://github.com/NVIDIA/cccl

CUDA Core Compute Libraries

accelerated-computing cpp cpp-programming cuda cuda-cpp cuda-kernels cuda-library cuda-programming gpu gpu-acceleration gpu-computing gpu-programming hpc modern-cpp nvidia nvidia-gpu parallel-algorithm parallel-computing parallel-programming

Last synced: 19 Nov 2024

https://github.com/marian-nmt/marian

Fast Neural Machine Translation in C++

cuda fast gpu neural-machine-translation

Last synced: 25 Jan 2025

https://github.com/deepgraphlearning/graphvite

GraphVite: A General and High-performance Graph Embedding System

cuda data-visualization gpu knowledge-graph machine-learning network-embedding representation-learning

Last synced: 24 Jan 2025

https://github.com/aphrodite-engine/aphrodite-engine

Large-scale LLM inference engine

api-rest cuda inference-engine inferentia intel lora machine-learning rocm speculative-decoding tpu

Last synced: 22 Jan 2025

https://github.com/juliagpu/cuda.jl

CUDA programming in Julia.

cuda gpu hacktoberfest julia

Last synced: 22 Jan 2025

https://github.com/DeepGraphLearning/graphvite

GraphVite: A General and High-performance Graph Embedding System

cuda data-visualization gpu knowledge-graph machine-learning network-embedding representation-learning

Last synced: 07 Nov 2024

https://github.com/JuliaGPU/CUDA.jl

CUDA programming in Julia.

cuda gpu hacktoberfest julia

Last synced: 19 Nov 2024

https://github.com/chengzeyi/stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

cuda deeplearnng diffusers inference-engines openai-triton performance-optimizations pytorch stable-diffusion stable-video-diffusion torch

Last synced: 23 Jan 2025

https://github.com/beehive-lab/tornadovm

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages

ai cuda gpu-acceleration gpu-computing gpus graalvm java levelzero multi-core opencl parallel-computing parallel-programming spirv

Last synced: 23 Jan 2025

https://github.com/nvidia/matx

An efficient C++17 GPU numerical computing library with Python-like syntax

cuda gpgpu gpu gpu-computing hpc

Last synced: 24 Jan 2025

https://github.com/pygmalionai/aphrodite-engine

Large-scale LLM inference engine

api-rest cuda inference-engine inferentia intel lora machine-learning rocm speculative-decoding tpu

Last synced: 02 Jan 2025

https://github.com/NVIDIA/MatX

An efficient C++17 GPU numerical computing library with Python-like syntax

cuda gpgpu gpu gpu-computing hpc

Last synced: 30 Oct 2024

https://github.com/beehive-lab/TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages

ai cuda gpu-acceleration gpu-computing gpus graalvm java levelzero multi-core opencl spirv

Last synced: 05 Nov 2024

https://mratsim.github.io/Arraymancer/

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends

autograd automatic-differentiation cuda cudnn deep-learning gpgpu gpu-computing high-performance-computing iot linear-algebra machine-learning matrix-library multidimensional-arrays ndarray neural-networks nim opencl openmp parallel-computing tensor

Last synced: 14 Nov 2024

https://github.com/mratsim/arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends

autograd automatic-differentiation cuda cudnn deep-learning gpgpu gpu-computing high-performance-computing iot linear-algebra machine-learning matrix-library multidimensional-arrays ndarray neural-networks nim opencl openmp parallel-computing tensor

Last synced: 25 Jan 2025

https://github.com/stotko/stdgpu

stdgpu: Efficient STL-like Data Structures on the GPU

cpp cpp17 cpp20 cuda data-structures gpgpu gpu gpu-acceleration gpu-computing hip modern-cpp openmp rocm stl stl-containers stl-like

Last synced: 24 Jan 2025

https://github.com/mratsim/Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends

autograd automatic-differentiation cuda cudnn deep-learning gpgpu gpu-computing high-performance-computing iot linear-algebra machine-learning matrix-library multidimensional-arrays ndarray neural-networks nim opencl openmp parallel-computing tensor

Last synced: 08 Nov 2024

https://github.com/luxcorerender/luxcore

LuxCore source repository

3d-graphics bidirectional-path-tracing cuda gpu-computing luxcorerender luxrender opencl optix path-tracing pathtracer ray ray-tracer ray-tracing raytracer raytracing rtx visualization

Last synced: 23 Jan 2025

https://github.com/withcatai/node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level

ai bindings catai cmake cmake-js cuda embedding function-calling gguf gpu grammar json-schema llama llama-cpp llm metal nodejs prebuilt-binaries self-hosted vulkan

Last synced: 22 Jan 2025

https://github.com/fff-rs/juice

The Hacker's Machine Learning Engine

agnostic coaster cuda extinsible framework hacktoberfest juice machine-learning opencl rust

Last synced: 23 Jan 2025

https://github.com/PygmalionAI/aphrodite-engine

Large-scale LLM inference engine

api-rest cuda inference-engine inferentia intel lora machine-learning rocm speculative-decoding tpu

Last synced: 03 Nov 2024

https://github.com/uncomplicate/neanderthal

Fast Clojure Matrix Library

api clojure clojure-library cuda gpgpu gpu gpu-computing high-performance-computing java matrix matrix-calculations matrix-factorization matrix-functions matrix-multiplication opencl vectorization

Last synced: 22 Jan 2025

https://github.com/inducer/pyopencl

OpenCL integration for Python, plus shiny features

amd array cuda gpu heterogeneous-parallel-programming multidimensional-arrays nvidia opencl opengl parallel-algorithm parallel-computing performance prefix-sum pyopencl python reduction scientific-computing shared-memory sorting

Last synced: 21 Jan 2025

https://github.com/markus-perl/ffmpeg-build-script

The FFmpeg build script provides an easy way to build a static FFmpeg on OSX and Linux with non-free codecs included.

apple-m1-silicon av1 cuda debian fdk-aac ffmpeg ffmpeg-installer ffmpeg-linux ffmpeg-mac h264 h265 mp3 mp3-to-pcm ogg osx theora webm webm-conversion x264 x265

Last synced: 24 Jan 2025

https://github.com/BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

cuda llm

Last synced: 27 Oct 2024

https://github.com/sniklaus/sepconv-slomo

an implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

cuda cupy deep-learning python pytorch

Last synced: 26 Jan 2025

https://github.com/gunrock/gunrock

Programmable CUDA/C++ GPU Graph Analytics

algorithm algorithms cpp cuda cxx essentials gnn gpu graph graph-algorithms graph-analytics graph-engine graph-neural-networks graph-primitives graph-processing gunrock hpc parallel-computing sparse-matrix

Last synced: 24 Jan 2025

https://github.com/lebedov/scikit-cuda

Python interface to GPU-powered libraries

blas cublas cuda cufft cusolver gpu lapack numerical pycuda python

Last synced: 24 Jan 2025

https://github.com/oneapi-src/oneapi-samples

Samples for Intel® oneAPI Toolkits

ai cpu cuda fortran fpga gpu jupyter oneapi oneccl onedal onednn onedpl onemkl onetbb python pytorch rendering scikit-learn sycl tensorflow

Last synced: 23 Jan 2025

https://github.com/anibali/docker-pytorch

A Docker image for PyTorch

cuda docker docker-image pytorch

Last synced: 26 Jan 2025

https://github.com/mrnerf/gaussian-splatting-cuda

3D Gaussian Splatting, reimagined: Unleashing unmatched speed with C++ and CUDA from the ground up!

computer-graphics computer-vision cuda gaussian-splatting nerf optimization

Last synced: 24 Jan 2025

https://github.com/neka-nat/cupoch

Robotics with GPU computing

collision-detection cuda distance-transform gpgpu gpu jetson occupancy-grid-map odometry pathfinding point-cloud pybind11 python registration robotics ros triangle-mesh visual-odometry voxel

Last synced: 23 Jan 2025

https://github.com/mp3guy/kintinuous

Real-time large scale dense visual SLAM system

cuda reconstruction slam

Last synced: 27 Jan 2025

https://github.com/mp3guy/Kintinuous

Real-time large scale dense visual SLAM system

cuda reconstruction slam

Last synced: 07 Nov 2024

https://github.com/acceleratehs/accelerate

Embedded language for high-performance array computations

accelerate cuda gpu gpu-computing hacktoberfest haskell llvm parallel-computing

Last synced: 22 Jan 2025

https://github.com/MrNeRF/gaussian-splatting-cuda

3D Gaussian Splatting, reimagined: Unleashing unmatched speed with C++ and CUDA from the ground up!

computer-graphics computer-vision cuda gaussian-splatting nerf optimization

Last synced: 07 Nov 2024

https://github.com/oneapi-src/oneAPI-samples

Samples for Intel® oneAPI Toolkits

ai cpu cuda fortran fpga gpu jupyter oneapi oneccl onedal onednn onedpl onemkl onetbb python pytorch rendering scikit-learn sycl tensorflow

Last synced: 27 Oct 2024

https://github.com/AccelerateHS/accelerate

Embedded language for high-performance array computations

accelerate cuda gpu gpu-computing hacktoberfest haskell llvm parallel-computing

Last synced: 18 Nov 2024

https://github.com/mind/wheels

Performance-optimized wheels for TensorFlow (SSE, AVX, FMA, XLA, MPI)

ai avx avx2 cuda fma gpu machine-learning ml optimization sse41 sse42 tensorflow wheel

Last synced: 24 Jan 2025

https://github.com/jgbit/vuda

VUDA is a header-only library based on Vulkan that provides a CUDA Runtime API interface for writing GPU-accelerated applications.

cuda vuda vulkan

Last synced: 27 Jan 2025

https://github.com/cyclenerd/ethereum_nvidia_miner

💰 USB flash drive ISO image for Ethereum, Zcash and Monero mining with NVIDIA graphics cards and Ubuntu GNU/Linux (headless)

cuda ethereum ethereum-mining ethminer graphics-card iso iso-image linux mining monero monero-mining nvidia nvidia-card nvidia-gpu nvidia-gpus nvidia-smi ubuntu ubuntu1604 zcash zcash-mining

Last synced: 19 Jan 2025

https://github.com/babitmf/bmf

Cross-platform, customizable multimedia/video processing framework. With strong GPU acceleration, heterogeneous design, multi-language support, easy to use, multi-framework compatible and high performance, the framework is ideal for transcoding, AI inference, algorithm integration, live video streaming, and more.

ai arm bmf bytedance cpp cross-platform cuda ffmpeg gpu heterogeneous live-video mediacodec multimedia numpy nvidia opencv python tensorrt transcode x86-64

Last synced: 23 Jan 2025

https://github.com/Cyclenerd/ethereum_nvidia_miner

💰 USB flash drive ISO image for Ethereum, Zcash and Monero mining with NVIDIA graphics cards and Ubuntu GNU/Linux (headless)

cuda ethereum ethereum-mining ethminer graphics-card iso iso-image linux mining monero monero-mining nvidia nvidia-card nvidia-gpu nvidia-gpus nvidia-smi ubuntu ubuntu1604 zcash zcash-mining

Last synced: 19 Nov 2024

https://github.com/thu-ml/sageattention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

attention cuda inference-acceleration llm quantization triton video-generation

Last synced: 24 Jan 2025

https://github.com/e-ago/bitcracker

BitCracker is the first open source password cracking tool for memory units encrypted with BitLocker

attack bitcracker bitlocker cracking cryptography cuda decryption-algorithm gpgpu gpu hash john-the-ripper microsoft opencl password-cracker passwords windows

Last synced: 27 Jan 2025

https://github.com/Celebrandil/CudaSift

A CUDA implementation of SIFT for NVidia GPUs (1.2 ms on a GTX 1060)

cuda gpu nvidia sift vision

Last synced: 13 Nov 2024

https://github.com/tracel-ai/cubecl

Multi-platform high-performance compute language extension for Rust.

cuda gpgpu gpu jit linalg rust webgpu

Last synced: 25 Jan 2025

https://github.com/zhihu/zhilight

A highly optimized LLM inference acceleration engine for Llama and its variants.

cpm cuda gpt inference-engine llama llm llm-serving minicpm pytorch qwen

Last synced: 25 Jan 2025

https://github.com/arrayfire/arrayfire-rust

Rust wrapper for ArrayFire

arrayfire cuda gpgpu gpu hpc opencl rust rust-bindings

Last synced: 22 Jan 2025

https://github.com/src-d/kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA

afk-mc2 cuda hacktoberfest kmeans knn-search machine-learning python yinyang

Last synced: 25 Jan 2025

https://github.com/luisagroup/luisacompute

High-Performance Rendering Framework on Stream Architectures

cpu cross-platform cuda directx dsl dxr gpu graphics high-performance ispc llvm metal optix raytracing rendering rtx siggraph-asia-2022

Last synced: 23 Jan 2025

https://github.com/BabitMF/bmf

Cross-platform, customizable multimedia/video processing framework. With strong GPU acceleration, heterogeneous design, multi-language support, easy to use, multi-framework compatible and high performance, the framework is ideal for transcoding, AI inference, algorithm integration, live video streaming, and more.

ai arm bmf bytedance cpp cross-platform cuda ffmpeg gpu heterogeneous live-video mediacodec multimedia numpy nvidia opencv python tensorrt transcode x86-64

Last synced: 07 Nov 2024

https://github.com/deadsix27/waifu2x-converter-cpp

Improved fork of Waifu2X C++ using OpenCL and OpenCV

2x amd cpp cuda cv intel nvidia opencl opencv upscale upscaler w2x waifu waifu2x waifu2x-converter-cpp

Last synced: 18 Jan 2025

https://github.com/DeadSix27/waifu2x-converter-cpp

Improved fork of Waifu2X C++ using OpenCL and OpenCV

2x amd cpp cuda cv intel nvidia opencl opencv upscale upscaler w2x waifu waifu2x waifu2x-converter-cpp

Last synced: 28 Oct 2024

https://github.com/eyalroz/cuda-api-wrappers

Thin, unified, C++-flavored wrappers for the CUDA APIs

api-wrapper cuda cuda-api-wrappers cuda-device cuda-driver cuda-driver-api cuda-programming cuda-runtime-api cuda-toolkit gpgpu gpgpu-computing gpu gpu-computing gpu-memory modern-cpp

Last synced: 09 Nov 2024

https://github.com/jasmcaus/caer

High-performance Vision library in Python. Scale your research, not boilerplate.

ai artificial-intelligence augmentation caer computer-vision cuda data-science deep-learning gpu image-classification image-processing image-segmentation machine-learning neural-network opencv python segmentation type-checking video-processing vision

Last synced: 25 Jan 2025

https://github.com/bheisler/rustacuda

Rusty wrapper for the CUDA Driver API

cuda cuda-api gpu rust

Last synced: 24 Jan 2025

https://github.com/bheisler/RustaCUDA

Rusty wrapper for the CUDA Driver API

cuda cuda-api gpu rust

Last synced: 05 Nov 2024

https://github.com/qengineering/jetson-nano-ubuntu-20-image

Jetson Nano with Ubuntu 20.04 image

cuda deep-learning jetson-nano opencv pytorch sd-card-image team-viewer tensorflow tensorrt torch torchvision ubuntu2004

Last synced: 24 Jan 2025

https://github.com/ddemidov/amgcl

C++ library for solving large sparse linear systems with algebraic multigrid method

amg c-plus-plus cpp cuda gpgpu linear-solvers mpi multigrid opencl openmp scientific-computing sparse-linear-systems

Last synced: 25 Jan 2025

https://github.com/rapidsai/raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

anns building-blocks clustering cuda distance gpu information-retrieval linear-algebra llm machine-learning nearest-neighbors neighborhood-methods primitives random-sampling solvers sparse statistics vector-search vector-similarity vector-store

Last synced: 23 Jan 2025

https://github.com/andyzeng/tsdf-fusion

Fuse multiple depth frames into a TSDF voxel volume.

3d 3d-deep-learning 3d-reconstruction artificial-intelligence cuda depth-camera kinect-fusion rgbd tsdf vision volumetric-data

Last synced: 26 Jan 2025

https://github.com/QPT-Family/QPT

[内测中]QPT - 致力于让开源项目更好通往互联网世界的Python to EXE工具（Python打包）。

cuda deep-learning dml gpu noavx paddlepaddle pypi python qpt

Last synced: 18 Dec 2024

https://github.com/qpt-family/qpt

[内测中]QPT - 致力于让开源项目更好通往互联网世界的Python to EXE工具（Python打包）。

cuda deep-learning dml gpu noavx paddlepaddle pypi python qpt

Last synced: 27 Jan 2025

https://github.com/mryab/efficient-dl-systems

Efficient Deep Learning Systems course materials (HSE, YSDA)

cuda deep-learning distributed-training efficient-deep-learning machine-learning ml-infrastructure mlops pytorch

Last synced: 27 Jan 2025

https://github.com/LuisaGroup/LuisaCompute

High-Performance Rendering Framework on Stream Architectures

cpu cross-platform cuda directx dsl dxr gpu graphics high-performance ispc llvm metal optix raytracing rendering rtx siggraph-asia-2022

Last synced: 20 Nov 2024

https://github.com/efeslab/nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference llama2 llm llm-serving model-serving

Last synced: 25 Jan 2025

https://github.com/xmrig/xmrig-nvidia

Monero (XMR) NVIDIA miner

aeon cryptonight cuda electroneum gpu-mining monero nvidia-miner sumokoin xmr xmrig

Last synced: 22 Jan 2025

https://github.com/ddemidov/vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

c-plus-plus cpp11 cuda gpgpu opencl scientific-computing

Last synced: 24 Jan 2025

https://github.com/cresset-template/cresset

Template repository to build PyTorch projects from source on any version of PyTorch/CUDA/cuDNN.

build cuda deep-learning deep-learning-tutorial docker docker-compose machine-learning makefile mlops mlops-template python pytorch source source-python template template-repository wheel

Last synced: 05 Nov 2024

https://github.com/xtra-computing/thundergbm

ThunderGBM: Fast GBDTs and Random Forests on GPUs

cuda gbdt gpu machine-learning random-forest

Last synced: 22 Jan 2025

https://github.com/mp3guy/icpcuda

Super fast implementation of ICP in CUDA for compute capable devices 3.5 or higher

cuda icp

Last synced: 25 Jan 2025

https://github.com/Xtra-Computing/thundergbm

ThunderGBM: Fast GBDTs and Random Forests on GPUs

cuda gbdt gpu machine-learning random-forest

Last synced: 07 Nov 2024

https://github.com/santosh-gupta/speedtorch

Library for faster pinned CPU <-> GPU transfer in Pytorch

cpu-gpu-transfer cpu-pinned-tensors cuda cuda-tensors cuda-variables cupy data-transfer embeddings embeddings-trained gpu gpu-transfer machine-learning natural-language-processing nlp pinned-cpu-tensors pytorch pytorch-tensors pytorch-variables sparse sparse-modeling

Last synced: 22 Jan 2025

https://github.com/coreylowman/cudarc

Safe rust wrapper around CUDA toolkit

cublas cuda cuda-kernels cuda-programming cuda-toolkit cudnn curand gpu gpu-acceleration nccl nvrtc rust

Last synced: 23 Jan 2025

https://github.com/shibatch/sleef

SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT

aarch64 android arm avx avx512 cuda elementary-functions fft ios math-library neon powerpc quadruple-precision s390x simd sse2 sve vector-math vectorization vsx

Last synced: 24 Jan 2025

https://github.com/thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

attention cuda inference-acceleration llm quantization triton video-generation