CUDA | Ecosyste.ms: Awesome

https://github.com/uxlfoundation/onemath

oneAPI Math Library (oneMath)

api blas cpu cuda dpcpp gpu hpc intel math-libraries oneapi onemkl parallel-computing parallel-programming performance rng

Last synced: 26 Jan 2025

https://github.com/oneapi-src/onemkl

oneAPI Math Kernel Library (oneMKL) Interfaces

api blas cpu cuda dpcpp gpu hpc intel math-libraries oneapi onemkl parallel-computing parallel-programming performance rng

Last synced: 01 Dec 2024

https://github.com/oneapi-src/oneMKL

oneAPI Math Kernel Library (oneMKL) Interfaces

api blas cpu cuda dpcpp gpu hpc intel math-libraries oneapi onemkl parallel-computing parallel-programming performance rng

Last synced: 30 Oct 2024

https://github.com/cudamat/cudamat

Python module for performing basic dense linear algebra computations on the GPU using CUDA.

cuda linear-algebra python

Last synced: 25 Oct 2024

https://github.com/beam-cloud/beta9

Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world

autoscaler cloudrun cuda developer-productivity distributed-computing faas fine-tuning functions-as-a-service generative-ai gpu lambda large-language-models llm llm-inference ml-platform paas self-hosted serverless serverless-containers

Last synced: 26 Jan 2025

https://github.com/inducer/loopy

A code generator for array-based code on CPUs and GPUs

array code-generation code-generator code-optimization code-transformation cuda ispc loop-optimization multidimensional-arrays opencl performance performance-analysis prefix-sum python reduction scan scientific-computing

Last synced: 23 Jan 2025

https://github.com/tpoisonooo/how-to-optimize-gemm

row-major matmul optimization

arm64 armv7 cuda cuda-kernel gemm-optimization int4 ptx vulkan

Last synced: 25 Jan 2025

https://github.com/hpcaitech/fastfold

Optimizing AlphaFold Training and Inference on GPU Clusters

alphafold2 cuda evoformer gpu habana-gaudi parallelism protein-folding protein-structure pytorch

Last synced: 26 Jan 2025

https://github.com/cern/tigre

TIGRE: Tomographic Iterative GPU-based Reconstruction Toolbox

cuda gpus image-reconstruction matlab python tigre tomography toolbox x-ray

Last synced: 17 Jan 2025

https://github.com/hpcaitech/FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters

alphafold2 cuda evoformer gpu habana-gaudi parallelism protein-folding protein-structure pytorch

Last synced: 12 Nov 2024

https://github.com/gprmax/gprmax

gprMax is open source software that simulates electromagnetic wave propagation using the Finite-Difference Time-Domain (FDTD) method for numerical modelling of Ground Penetrating Radar (GPR)

antenna cuda electromagnetic fdtd gpr gpu modelling nvidia python simulation soil

Last synced: 25 Jan 2025

https://github.com/gprMax/gprMax

gprMax is open source software that simulates electromagnetic wave propagation using the Finite-Difference Time-Domain (FDTD) method for numerical modelling of Ground Penetrating Radar (GPR)

antenna cuda electromagnetic fdtd gpr gpu modelling nvidia python simulation soil

Last synced: 17 Nov 2024

https://github.com/insight-platform/Savant

Python Computer Vision & Video Analytics Framework With Batteries Included

computer-vision cuda deep-learning deepstream edge-computing inference-engine instance-segmentation machine-learning nvidia nvidia-deepstream-sdk object-detection opencv peoplenet tensorrt video yolo yolov5-face yolov8 yolov8-face

Last synced: 09 Nov 2024

https://github.com/laugh12321/TensorRT-YOLO

🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下，享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast inference speeds.

cuda cuda-graph cuda-kernels cuda-programming detection onnx ppyoloe tensorrt yolov10 yolov3 yolov5 yolov6 yolov7 yolov8 yolov9

Last synced: 27 Oct 2024

https://github.com/sergio0694/neuralnetwork.net

A TensorFlow-inspired neural network library built from scratch in C# 7.3 for .NET Standard 2.0, with GPU support through cuDNN

ai backpropagation-algorithm classification-algorithims cnn convolutional-neural-networks csharp cuda gpu-acceleration gradient-descent machine-learning net-framework netstandard neural-network supervised-learning visual-studio

Last synced: 25 Jan 2025

https://github.com/stochasticai/x-stable-diffusion

Real-time inference for Stable Diffusion - 0.88s latency. Covers AITemplate, nvFuser, TensorRT, FlashAttention. Join our Discord communty: https://discord.com/invite/TgHXuSJEk6

aitemplate automl cuda docker inference notebook nvfuser onnx onnxruntime pytorch stable-diffusion tensorrt

Last synced: 26 Jan 2025

https://github.com/tencent/forward

A library for high performance deep learning inference on NVIDIA GPUs.

cuda deep-learning forward gpu inference inference-engine keras neural-network onnx pytorch tensorflow tensorrt

Last synced: 26 Jan 2025

https://github.com/nvidia/nvbench

CUDA Kernel Benchmarking Library

benchmark cuda cuda-kernels gpu kernel-benchmark nvidia performance

Last synced: 25 Jan 2025

https://github.com/Sergio0694/NeuralNetwork.NET

A TensorFlow-inspired neural network library built from scratch in C# 7.3 for .NET Standard 2.0, with GPU support through cuDNN

ai backpropagation-algorithm classification-algorithims cnn convolutional-neural-networks csharp cuda gpu-acceleration gradient-descent machine-learning net-framework netstandard neural-network supervised-learning visual-studio

Last synced: 02 Nov 2024

https://github.com/Tencent/Forward

A library for high performance deep learning inference on NVIDIA GPUs.

cuda deep-learning forward gpu inference inference-engine keras neural-network onnx pytorch tensorflow tensorrt

Last synced: 09 Nov 2024

https://github.com/zhihu/cubert

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

bert cuda deep-learning inference mkl predict tensorflow transformer

Last synced: 27 Jan 2025

https://github.com/cvxgrp/pymde

Minimum-distortion embedding with PyTorch

cuda dimensionality-reduction embedding feature-vectors gpu graph-embedding machine-learning pytorch visualization

Last synced: 25 Jan 2025

https://github.com/kwea123/gaussian_splatting_notes

A detailed formulae explanation on gaussian splatting

cuda gaussian-splatting

Last synced: 26 Jan 2025

https://github.com/nvidia/jitify

A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).

cpp cuda jit-compilation nvrtc runtime-compilation single-header

Last synced: 25 Jan 2025

https://github.com/zhihu/cuBERT

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

bert cuda deep-learning inference mkl predict tensorflow transformer

Last synced: 02 Nov 2024

https://github.com/NVIDIA/nvbench

CUDA Kernel Benchmarking Library

benchmark cuda cuda-kernels gpu kernel-benchmark nvidia performance

Last synced: 19 Nov 2024

https://github.com/gridhead/nvidia-auto-installer-for-fedora-linux

A CLI tool which lets you install proprietary NVIDIA drivers and much more easily on Fedora Linux (32 or above and Rawhide)

cuda fedora hacktoberfest nvidia optimus rpmfusion

Last synced: 25 Jan 2025

https://github.com/nvidia/cucollections

cpp cpp17 cuda datastructures gpu hashmap hashset hashtable

Last synced: 25 Jan 2025

https://github.com/openhackathons-org/gpubootcamp

This repository consists for gpu bootcamp material for HPC and AI

ai4hpc cuda data-science deep-learning deepstream gpu hpc machine-learning mpi openacc openmp rapidsai

Last synced: 30 Oct 2024

https://github.com/Kaixhin/dockerfiles

Compilation of Dockerfiles with automated builds enabled on the Docker Registry

cuda deep-learning docker dockerfiles machine-learning vnc

Last synced: 28 Oct 2024

https://github.com/spcl/dace

DaCe - Data Centric Parallel Programming

cuda fpga high-level-synthesis high-performance-computing programming-language vivado-hls

Last synced: 23 Jan 2025

https://github.com/mariosieg/magnetron

(WIP) A small but powerful, homemade PyTorch from scratch.

artificial-intelligence cpp cuda high-performance-computing machine-learning neuronal-network python pytorch research-project tensorflow tiny

Last synced: 20 Jan 2025

https://github.com/gorgonia/cu

package cu provides an idiomatic interface to the CUDA Driver API.

cuda cuda-driver-api go golang

Last synced: 26 Jan 2025

https://github.com/brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

cuda gpu gpumd heat-transport high-performance-computing machine-learning machine-learning-potential molecular-dynamics molecular-dynamics-simulation natural-evolution-strategies neural-network neuroevolution phonon physics-simulation simulation

Last synced: 13 Nov 2024

https://github.com/salesforce/warp-drive

Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)

cuda deep-learning gpu high-throughput multiagent-reinforcement-learning numba pytorch reinforcement-learning

Last synced: 24 Jan 2025

https://github.com/h2oai/h2o4gpu

H2Oai GPU Edition

c-plus-plus cpu cuda elastic-net glm gpu lasso machine-learning pca python r rstats svd

Last synced: 24 Jan 2025

https://github.com/mumax/3

GPU-accelerated micromagnetic simulator

cuda finite-difference-time-domain go micromagnetics scientific-computing

Last synced: 31 Oct 2024

https://github.com/sinkingsugar/nimtorch

PyTorch - Python + Nim

artificial-intelligence artificial-neural-networks cuda machine-learning nim pytorch wasm

Last synced: 30 Oct 2024

https://github.com/cnstark/pytorch-docker

Pure Pytorch Docker Images.

centos cuda deep-learning docker nvidia pytorch ubuntu

Last synced: 25 Jan 2025

https://github.com/huggingface/large_language_model_training_playbook

An open collection of implementation tips, tricks and resources for training large language models

cuda large-language-models llm nccl nlp performance python pytorch scalability troubleshooting

Last synced: 11 Nov 2024

https://github.com/termoshtt/accel

(Mirror of GitLab) GPGPU Framework for Rust

cuda gpgpu rust-lang

Last synced: 25 Jan 2025

https://github.com/petercunha/Pine

:evergreen_tree: Aimbot powered by real-time object detection with neural networks, GPU accelerated with Nvidia. Optimized for use with CS:GO.

aimbot csgo cuda darknet detection fortnite fps game-hacking hacking neural-network neural-networks nvidia object-detection opencl opencv overwatch pine python yolo yolov3

Last synced: 08 Nov 2024

https://github.com/cryinkfly/solidworks-for-linux

This is a project, where I give you a way to use SOLIDWORKS on Linux!

archlinux cuda fedora international linux linuxmint manjaro nvidia opengl opensuse ubuntu wine

Last synced: 26 Jan 2025

https://github.com/MegviiRobot/MegBA

MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment

bundleadjustment cuda distributed gpu-acceleration graph-optimization high-performance

Last synced: 14 Nov 2024

https://github.com/petercunha/pine

:evergreen_tree: Aimbot powered by real-time object detection with neural networks, GPU accelerated with Nvidia. Optimized for use with CS:GO.

aimbot csgo cuda darknet detection fortnite fps game-hacking hacking neural-network neural-networks nvidia object-detection opencl opencv overwatch pine python yolo yolov3

Last synced: 06 Nov 2024

https://github.com/patwie/tensorflow-cmake

TensorFlow examples in C, C++, Go and Python without bazel but with cmake and FindTensorFlow.cmake

c cmake cpp cuda deep-learning golang inference opencv tensorflow tensorflow-cc tensorflow-cmake tensorflow-examples tensorflow-gpu

Last synced: 20 Jan 2025

https://github.com/tlkh/ai-lab

All-in-one AI container for rapid prototyping

cuda data-science deep-learning docker jupyter nvidia pytorch tensorflow

Last synced: 26 Jan 2025

https://github.com/ccsb-scripps/autodock-gpu

AutoDock for GPUs and other accelerators

autodock4 cuda gpu-computing molecular-docking multicore-cpu opencl

Last synced: 25 Jan 2025

https://zielon.github.io/insta/

INSTA - Instant Volumetric Head Avatars [CVPR2023]

3dmm avatars cuda flame instant-ngp nerf neural-network volumetric-rendering

Last synced: 30 Oct 2024

https://github.com/ginkgo-project/ginkgo

Numerical linear algebra software package

cuda dpcpp gpu-computing hip hpc krylov-methods linear-algebra oneapi openmp preconditioning sparse-linear-systems spmv

Last synced: 24 Jan 2025

https://github.com/rapidsai/rmm

RAPIDS Memory Manager

cuda memory-allocation memory-management rapids

Last synced: 25 Jan 2025

https://github.com/cloudcores/cuassembler

An unofficial cuda assembler, for all generations of SASS, hopefully ：）

assembler cuda nvidia sass

Last synced: 20 Jan 2025

https://github.com/arrayfire/arrayfire-python

Python bindings for ArrayFire: A general purpose GPU library.

arrayfire cuda gpgpu gpu hpc opencl python python-bindings

Last synced: 03 Nov 2024

https://github.com/uncomplicate/deep-diamond

A fast Clojure Tensor & Deep Learning library

clojure cuda deep-learning deep-neural-networks dnnl gpu java nvidia

Last synced: 27 Jan 2025

https://github.com/alicevision/popsift

PopSift is an implementation of the SIFT algorithm in CUDA.

computer-vision cuda feature-extraction gpu image-processing sift

Last synced: 27 Jan 2025

https://github.com/toverainc/willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

cuda deep-learning llama llm privacy speech-recognition speech-to-text text-to-speech vicuna webrtc whisper willow

Last synced: 25 Jan 2025

https://github.com/cloudcores/CuAssembler

An unofficial cuda assembler, for all generations of SASS, hopefully ：）

assembler cuda nvidia sass

Last synced: 28 Oct 2024

https://github.com/shi-labs/natten

Neighborhood Attention Extension. Bringing attention to a neighborhood near you!

cuda neighborhood-attention pytorch

Last synced: 24 Jan 2025

https://github.com/DerryHub/BEVFormer_tensorrt

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).

bevformer cuda int8-inference pytorch quantization tensorrt-plugins

Last synced: 28 Oct 2024

https://github.com/JuliaGPU/CUDAnative.jl

Julia support for native CUDA programming

cuda cuda-toolkit julia julia-library

Last synced: 29 Nov 2024

https://github.com/xmrig/xmrig-cuda

NVIDIA CUDA plugin for XMRig miner

cryptonight cuda randomx xmrig

Last synced: 26 Jan 2025

https://github.com/libocca/occa

Portable and vendor neutral framework for parallel programming on heterogeneous platforms.

c cpp cuda dpcpp fortran gpgpu gpu hip hpc jit metal multithreading oneapi opencl openmp sycl

Last synced: 05 Nov 2024

https://github.com/vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 16 Nov 2024

https://github.com/vectorch-ai/scalellm

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 24 Jan 2025

https://github.com/dfm/extending-jax

Extending JAX with custom C++ and CUDA code

cuda jax xla

Last synced: 26 Jan 2025

https://github.com/osai-ai/tensor-stream

A library for real-time video stream decoding to CUDA memory

c-plus-plus cuda python pytorch video video-processing

Last synced: 14 Nov 2024

https://github.com/luoyetx/mini-caffe

Minimal runtime core of Caffe, Forward only, GPU support and Memory efficiency.

android caffe cuda cudnn forward-only linux mini-caffe openblas windows

Last synced: 26 Oct 2024

https://github.com/nosferalatu/SimpleGPUHashTable

A simple GPU hash table implemented in CUDA using lock free techniques

cuda cuda-programming data-structures gpu gpu-cuda-programs

Last synced: 14 Nov 2024

https://github.com/rapidsai/cucim

cuCIM - RAPIDS GPU-accelerated image processing library

computer-vision cuda digital-pathology gpu image-analysis image-data image-processing medical-imaging microscopy multidimensional-image-processing nvidia segmentation

Last synced: 23 Jan 2025

https://github.com/ibm/aihwkit

IBM Analog Hardware Acceleration Kit

ai analog-devices cuda neural-networks pytorch

Last synced: 20 Jan 2025

https://github.com/uncomplicate/bayadera

High-performance Bayesian Data Analysis on the GPU in Clojure

bayesian bayesian-data-analysis bayesian-inference clojure clojure-library cuda gpu gpu-acceleration gpu-computing high-performance-computing machine-learning markov-chain-monte-carlo mcmc opencl statistics

Last synced: 22 Jan 2025

https://github.com/ingonyama-zk/icicle

A hardware acceleration library for compute intensive cryptography :ice_cube:

cpu cryptography cuda golang msm ntt rust zero-knowledge

Last synced: 24 Jan 2025

https://github.com/nvidia/cuquantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

cuda cuquantum custatevec cutensornet nvidia quantum-computing

Last synced: 23 Jan 2025

https://github.com/nersc/timemory

Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ template API is essentially a framework to creating tools: it is designed to provide a unifying interface for recording various performance measurements alongside data logging and interfaces to other tools.

analysis c cplusplus cpp cross-language cross-platform cuda cupti gotcha hardware-counters instrumentation-api memory-measurements modular-design mpi papi performance performance-measurement python roofline

Last synced: 23 Jan 2025

https://github.com/alpaka-group/alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:

cpp cpp17 cuda gpu header-only heterogeneous-parallel-programming hip hpc openacc openmp rocm tbb

Last synced: 25 Jan 2025

https://github.com/NERSC/timemory

Modular C++ Toolkit for Performance Analysis and Logging. Profiling API and Tools for C, C++, CUDA, Fortran, and Python. The C++ template API is essentially a framework to creating tools: it is designed to provide a unifying interface for recording various performance measurements alongside data logging and interfaces to other tools.

analysis c cplusplus cpp cross-language cross-platform cuda cupti gotcha hardware-counters instrumentation-api memory-measurements modular-design mpi papi performance performance-measurement python roofline

Last synced: 14 Nov 2024

https://github.com/glotzerlab/hoomd-blue

Molecular dynamics and Monte Carlo soft matter simulation on GPUs.

conda-forge cuda docker gpu hard-particle hoomd-blue molecular-dynamics monte-carlo-simulation particle-system python simulation singularity

Last synced: 25 Jan 2025

https://github.com/cybercongress/go-cyber

Your 🔵 Superintelligence

ai blockchain computation-graphs cosmos cosmos-sdk cuda cyber cyber-rank fuckgoogle great-web ipfs knowledge-graph protocol search search-engine soft3 supercomputer tendermint universe-mirror web3

Last synced: 25 Jan 2025

https://github.com/ekondis/mixbench

A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)

benchmark cuda gpu hip opencl openmp sycl

Last synced: 05 Nov 2024

https://github.com/luisagroup/luisarender

High-Performance Cross-Platform Monte Carlo Renderer Based on LuisaCompute

cpp cuda gpu high-performance ispc metal optix path-tracing ray-tracing renderer rendering siggraph-asia-2022

Last synced: 23 Jan 2025

https://github.com/NVIDIA/cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

cuda cuquantum custatevec cutensornet nvidia quantum-computing

Last synced: 03 Nov 2024

https://github.com/harrism/hemi

Simple utilities to enable code reuse and portability between CUDA C/C++ and standard C/C++.

c-plus-plus cuda cuda-device cuda-kernels gpu hemi

Last synced: 21 Jan 2025

https://github.com/IBM/aihwkit

IBM Analog Hardware Acceleration Kit

ai analog-devices cuda neural-networks pytorch

Last synced: 17 Nov 2024

https://github.com/lambdalabsml/distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

cluster cuda deepspeed distributed-training fsdp gpu gpu-cluster kuberentes lambdalabs mpi nccl pytorch sharding slurm

Last synced: 27 Jan 2025

https://github.com/bruce-lee-ly/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 27 Jan 2025

https://github.com/agenium-scale/nsimd

Agenium Scale vectorization library for CPUs and GPUs

aarch64 avx avx2 avx512 cpp20 cpp20-library cuda hpc neon neon128 rocm simd simd-instructions simd-library simd-programming sse2 sse42 sve vectorization-library

Last synced: 22 Jan 2025

https://github.com/a2flo/floor

A C++ Compute/Graphics Library and Toolchain enabling same-source CUDA/Host/Metal/OpenCL/Vulkan C++ programming and execution.

c-plus-plus compiler compute cuda graphics ios linux macos metal opencl openxr rendering spir spir-v virtual-reality vulkan windows

Last synced: 27 Jan 2025

https://github.com/UoB-HPC/BabelStream

STREAM, for lots of devices written in many programming models

benchmark cuda gpgpu gpu hpc kokkos memory-bandwidth openacc opencl openmp parallel-processing raja sycl

Last synced: 09 Nov 2024

https://github.com/omlins/ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs

cuda gpu julia multi-gpu multi-xpu parallel staggered-grids stencil-codes xpu

Last synced: 30 Oct 2024

https://github.com/knightcrawler25/optix-pathtracer

Simple physically based path tracer based on Nvidia's Optix Ray Tracing Engine

brdf cuda disney gpu optix pathtracing raytracing

Last synced: 24 Jan 2025

https://github.com/charles-r-earp/autograph

A machine learning library for Rust.

cuda machine-learning neural-networks rust

Last synced: 19 Nov 2024

https://github.com/favreau/Sol-R

Open-Source CUDA/OpenCL Speed Of Light Ray-tracer

3d 3d-graphics-engine cuda gpgpu gpu-acceleration gpu-computing graphics-engine interactive opencl path-tracing pathtracing ray-tracing raytracer raytracing raytracing-engine realtime-rendering rendering science virtual-reality vr

Last synced: 12 Nov 2024

https://github.com/kerneltuner/kernel_tuner

Kernel Tuner

auto-tuning autotuning c cplusplus cuda cuda-kernels gpu gpu-computing kernel-tuner machine-learning opencl opencl-kernels optimization python software-development testing

Last synced: 24 Jan 2025

https://github.com/lattice/quda

QUDA is a library for performing calculations in lattice QCD on GPUs.

c c-plus-plus cuda gpu mpi multi-gpu qcd

Last synced: 25 Jan 2025

https://github.com/Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

Last synced: 19 Nov 2024

https://github.com/QMCPACK/qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support

c-plus-plus cuda electronic-structure gpu high-performance-computing hpc mpi quantum-chemistry quantum-monte-carlo