{"id":13574806,"url":"https://github.com/wdmapp/gtensor","last_synced_at":"2025-04-04T18:32:16.437Z","repository":{"id":37245404,"uuid":"255336822","full_name":"wdmapp/gtensor","owner":"wdmapp","description":"GTensor is a multi-dimensional array C++14 header-only library for hybrid GPU development.","archived":false,"fork":false,"pushed_at":"2024-09-20T19:28:33.000Z","size":1244,"stargazers_count":34,"open_issues_count":48,"forks_count":9,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-11-05T10:44:41.719Z","etag":null,"topics":["cpp","cpp14","cuda","gpu","hacktoberfest","rocm","sycl"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wdmapp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-13T13:34:48.000Z","updated_at":"2024-09-20T19:28:37.000Z","dependencies_parsed_at":"2023-09-23T16:38:13.028Z","dependency_job_id":"3fc82206-db9b-408d-baeb-001b76f79dcf","html_url":"https://github.com/wdmapp/gtensor","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wdmapp%2Fgtensor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wdmapp%2Fgtensor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wdmapp%2Fgtensor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wdmapp%2Fgtensor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wdmapp","download_url":"https://codeload.github.com/wdmapp/gtensor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247229736,"owners_count":20905111,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","cpp14","cuda","gpu","hacktoberfest","rocm","sycl"],"created_at":"2024-08-01T15:00:54.832Z","updated_at":"2025-04-04T18:32:16.424Z","avatar_url":"https://github.com/wdmapp.png","language":"C++","funding_links":[],"categories":["Table of Contents"],"sub_categories":["Mathematics and Science"],"readme":"# gtensor\n\n![CCPP workflow](https://github.com/wdmapp/gtensor/actions/workflows/ccpp.yml/badge.svg)\n![SYCL workflow](https://github.com/wdmapp/gtensor/actions/workflows/sycl.yml/badge.svg)\n\ngtensor is a multi-dimensional array C++14 header-only library for hybrid GPU\ndevelopment. It was inspired by\n[xtensor](https://xtensor.readthedocs.io/en/latest/), and designed to support\nthe GPU port of the [GENE](http://genecode.org) fusion code.\n\nFeatures:\n- multi-dimensional arrays and array views, with easy interoperability\n  with Fortran and thrust\n- automatically generate GPU kernels based on array operations\n- define complex re-usable operations with lazy evaluation. This allows\n  operations to be composed in different ways and evaluated once as a single\n  kernel\n- easily support both CPU-only and GPU-CPU hybrid code in the same code base,\n  with only minimal use of #ifdef.\n- multi-dimensional array slicing similar to numpy\n- GPU support for nVidia via CUDA and AMD via HIP/ROCm,\n  and experimental Intel GPU support via SYCL.\n- [Experimental] C library cgtensor with wrappers around common GPU operations\n  (allocate and deallocate, device management, memory copy and set)\n- [Experimental] lightweight wrappers around GPU BLAS, LAPACK, and FFT\n  routines.\n\n## License\n\ngtensor is licensed under the 3-clause BSD license. See the [LICENSE](LICENSE)\nfile for details.\n\n## Installation (cmake)\n\ngtensor uses cmake 3.13+ to build the tests and install:\n```sh\ngit clone https://github.com/wdmapp/gtensor.git\ncd gtensor\ncmake -S . -B build -DGTENSOR_DEVICE=cuda \\\n  -DCMAKE_INSTALL_PREFIX=/opt/gtensor \\\n  -DBUILD_TESTING=OFF\ncmake --build build --target install\n```\nTo build for cpu/host only, use `-DGTENSOR_DEVICE=host`, for AMD/HIP use\n`-DGTENSOR_DEVICE=hip -DCMAKE_CXX_COMPILER=$(which hipcc)`, and for\nIntel/SYCL use `-DGTENSOR_DEVICE=sycl -DCMAKE_CXX_COMPILER=$(which dpcpp)`\nSee sections below for more device specific requirements.\n\nNote that gtensor can still be used by applications not using cmake -\nsee [Usage (GNU make)](#usage-gnu-make) for an example.\n\nTo use the internal data vector implementation instead of thrust, set\n`-DGTENSOR_USE_THRUST=OFF`. This has the advantage that device array\nallocations will not be zero initialized, which can improve performance\nsignificantly for some workloads, particularly when temporary arrays are\nused.\n\nTo enable experimental C/C++ library features,`GTENSOR_BUILD_CLIB`,\n`GTENSOR_BUILD_BLAS`, or `GTENSOR_BUILD_FFT` to `ON`. Note that BLAS\nincludes some LAPACK routines for LU factorization.\n\n### nVidia CUDA requirements\n\ngtensor for nVidia GPUs with CUDA requires\n[CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) 10.0+.\n\n### AMD HIP requirements\n\ngtensor for AMD GPUs with HIP requires ROCm 4.5.0+, and rocthrust and\nrocprim. See the [ROCm installation\nguide](https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html)\nfor details. In Ubuntu, after setting up the ROCm repository, the required\npackages can be installed like this:\n```\nsudo apt install rocm-dkms rocm-dev rocthrust\n```\nThe official packages install to `/opt/rocm`. If using a different install\nlocation, set the `ROCM_PATH` cmake variable. To use coarse grained\nmanaged memory, ROCm 5.0+ is required.\n\nTo use gt-fft and gt-blas, rocsolver, rocblas, and rocfft packages need\nto be installed as well.\n\n### Intel SYCL requirements\n\nThe current SYCL implementation requires Intel OneAPI/DPC++ 2022.0 or later, with some known issues in\ngt-blas and gt-fft (npvt getrf/rs, 2d fft). Using the latest available release is recommended.\nWhen using the instructions at\n[install via package managers](https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers.html),\ninstalling the `intel-oneapi-dpcpp-compiler` package will pull in all required packages\n(the rest of basekit is not required).\n\nThe reason for the dependence on Intel OneAPI is that the implementation uses\nthe USM extension, which is not part of the current SYCL standard.\nCodePlay ComputeCpp 2.0.0 has an experimental implementation that is\nsufficiently different to require extra work to support.\n\nThe default device selector is always used. To control device section, set the `SYCL_DEVICE_FILTER`\nenvironment variable. See the\n[intel llvm documentation](https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter)\nfor details.\n\nThe port is tested with an Intel iGPU, specifically UHD Graphics 630. It\nmay also work with the experimental CUDA backend for nVidia GPUs, but this\nis untested and it's recommended to use the gtensor CUDA backend instead.\n\nBetter support for other SYCL implementations like hipSYCL and ComputCPP should be possible to add, with the\npossible exception of gt-blas and gt-fft sub-libraries which require oneMKL.\n\n### HOST CPU (no device) requirements\n\ngtensor should build with any C++ compiler supporting C++14. It has been\ntested with g++ 7, 8, and 9 and clang++ 8, 9, and 10.\n\n### Advanced multi-device configuration\n\nBy default, gtensor will install support for the device specified by\nthe `GTENSOR_DEVICE` variable (default `cuda`), and also the `host` (cpu only)\ndevice. This can be configured with `GTENSOR_BUILD_DEVICES` as a semicolon (;)\nseparated list. For example, to build support for all four backends\n(assuming a machine with multi-vendor GPUs and associated toolkits installed).\n```\ncmake -S . -B build -DGTENSOR_DEVICE=cuda \\\n  -DGTENSOR_BUILD_DEVICES=host;cuda;hip;sycl \\\n  -DCMAKE_INSTALL_PREFIX=/opt/gtensor \\\n  -DBUILD_TESTING=OFF\n```\n\nThis will cause targets to be created for each device: `gtensor::gtensor_cuda`,\n`gtensor::gtensor_host`, `gtensor::gtensor_hip`, and `gtensor::gtensor_sycl`.\nThe main `gtensor::gtensor` target will be an alias for the default set by\n`GTENSOR_DEVICE` (the cuda target in the above example).\n\n## Usage (cmake)\n\nOnce installed, gtensor can be used by adding this to a project's\n`CMakeLists.txt`:\n\n```cmake\n# if using GTENSOR_DEVICE=cuda\nenable_language(CUDA)\n\nfind_library(gtensor)\n\n# for each C++ target using gtensor\ntarget_gtensor_sources(myapp PRIVATE src/myapp.cxx)\ntarget_link_libraries(myapp gtensor::gtensor)\n```\n\nWhen running `cmake` for a project, add the gtensor\ninstall prefix to `CMAKE_PREFIX_PATH`. For example:\n```bash\ncmake -S . -B build -DCMAKE_PREFIX_PATH=/opt/gtensor\n```\n\nThe default gtensor device, set with the `GTENSOR_DEVICE` cmake variable\nwhen installing gtensor, can be overridden by setting `GTENSOR_DEVICE`\nagain in the client application before the call to `find_library(gtensor)`,\ntypically via the `-D` cmake command line option. This can be useful to debug\nan application by setting `-DGTENSOR_DEVICE=host`, to see if the problem is\nrelated to the hybrid device model or is an algorithmic problem, or to run a\nhost-only interactive debugger. Note that only devices specified with\n`GTENSOR_BUILD_DEVICES` at gtensor install time are available (the default\ndevice and `host` if no option was specified).\n\n### Using gtensor as a subdirectory or git submodule\n\ngtensor also supports usage as a subdiretory of another cmake project. This\nis typically done via git submodules. For example:\n```sh\ncd /path/to/app\ngit submodule add https://github.com/wdmapp/gtensor.git external/gtensor\n```\n\nIn the application's `CMakeLists.txt`:\n```cmake\n# set here or on the cmake command-line with `-DGTENSOR_DEVICE=...`.\nset(GTENSOR_DEVICE \"cuda\" CACHE STRING \"\")\n\nif (${GTENSOR_DEVICE} STREQUAL \"cuda\")\n  enable_language(CUDA)\nendif()\n\n# after setting GTENSOR_DEVICE\nadd_subdirectory(external/gtensor)\n\n# for each C++ target using gtensor\ntarget_gtensor_sources(myapp PRIVATE src/myapp.cxx)\ntarget_link_libraries(myapp gtensor::gtensor)\n```\n\n## Usage (GNU make)\n\nAs a header only library, gtensor can be integrated into an existing\nGNU make project as a subdirectory fairly easily for cuda and host devices.\n\nThe subdirectory is typically managed via git submodules, for example:\n```sh\ncd /path/to/app\ngit submodule add https://github.com/wdmapp/gtensor.git external/gtensor\n```\n\nSee [examples/Makefile](examples/Makefile) for a good way of organizing a\nproject's Makefile to provide cross-device support. The examples can be\nbuilt for different devices by setting the `GTENSOR_DEVICE` variable,\ne.g. `cd examples; make GTENSOR_DEVICE=host`.\n\n## Getting Started\n\n### Basic Example (host CPU only)\n\nHere is a simple example that computes a matrix with the multiplication\ntable and prints it out row by row using array slicing:\n\n```c++\n#include \u003ciostream\u003e\n\n#include \u003cgtensor/gtensor.h\u003e\n\nint main(int argc, char **argv) {\n    const int n = 9;\n    gt::gtensor\u003cint, 2\u003e mult_table(gt::shape(n, n));\n\n    for (int i=0; i\u003cn; i++) {\n        for (int j=0; j\u003cn; j++) {\n            mult_table(i,j) = (i+1)*(j+1);\n        }\n    }\n\n    for (int i=0; i\u003cn; i++) {\n        std::cout \u003c\u003c mult_table.view(i, gt::all) \u003c\u003c std::endl;\n    }\n}\n\n```\n\nIt can be built like this, using gcc version 5 or later:\n```\ng++ -std=c++14 -I /path/to/gtensor/include -o mult_table mult_table.cxx\n```\n\nand produces the following output:\n```\n{ 1 2 3 4 5 6 7 8 9 }\n{ 2 4 6 8 10 12 14 16 18 }\n{ 3 6 9 12 15 18 21 24 27 }\n{ 4 8 12 16 20 24 28 32 36 }\n{ 5 10 15 20 25 30 35 40 45 }\n{ 6 12 18 24 30 36 42 48 54 }\n{ 7 14 21 28 35 42 49 56 63 }\n{ 8 16 24 32 40 48 56 64 72 }\n{ 9 18 27 36 45 54 63 72 81 }\n```\n\nSee the full [mult\\_table example](examples/src/mult_table.cxx) for different\nways of performing this operation, taking advantage of more gtensor features.\n\n### GPU and CPU Example\n\nThe following program computed vector product `a*x + y`, where `a` is a scalar\nand `x` and `y` are vectors. If build with `GTENSOR_HAVE_DEVICE` defined and\nusing the appropriate compiler (currently either nvcc or hipcc), it will run\nthe computation on a GPU device.\n\nSee the full [daxpy example](examples/src/daxpy.cxx) for more detailed comments\nand an example of using an explicit kernel.\n\n```\n#include \u003ciostream\u003e\n\n#include \u003cgtensor/gtensor.h\u003e\n\nusing namespace std;\n\n// provides convenient shortcuts for common gtensor functions, for example\n// underscore ('_') to represent open slice ends.\nusing namespace gt::placeholders;\n\ntemplate \u003ctypename S\u003e\ngt::gtensor\u003cdouble, 1, S\u003e daxpy(double a, const gt::gtensor\u003cdouble, 1, S\u003e \u0026x,\n                                const gt::gtensor\u003cdouble, 1, S\u003e \u0026y) {\n    return a * x + y;\n}\n\nint main(int argc, char **argv)\n{\n    int n = 1024 * 1024;\n    int nprint = 32;\n\n    double a = 0.5;\n\n    // Define and allocate two 1d vectors of size n on the host.\n    gt::gtensor\u003cdouble, 1, gt::space::host\u003e h_x(gt::shape(n));\n    gt::gtensor\u003cdouble, 1, gt::space::host\u003e h_y = gt::empty_like(h_x);\n    gt::gtensor\u003cdouble, 1, gt::space::host\u003e h_axpy;\n\n    // initialize host vectors\n    for (int i=0; i\u003cn; i++) {\n        h_x(i) = 2.0 * static_cast\u003cdouble\u003e(i);\n        h_y(i) = static_cast\u003cdouble\u003e(i);\n    }\n\n#ifdef GTENSOR_HAVE_DEVICE\n    cout \u003c\u003c \"gtensor have device\" \u003c\u003c endl;\n\n    // Define and allocate device versions of h_x and h_y, and declare\n    // a varaible for the result on gpu.\n    gt::gtensor\u003cdouble, 1, gt::space::device\u003e d_x(gt::shape(n));\n    gt::gtensor\u003cdouble, 1, gt::space::device\u003e d_y = gt::empty_like(d_x);\n    gt::gtensor\u003cdouble, 1, gt::space::device\u003e d_axpy;\n \n    // Explicit copies of input from host to device.\n    copy(h_x, d_x);\n    copy(h_y, d_y);\n\n    // This automatically generates a computation kernel to run on the\n    // device.\n    d_axpy = daxpy(a, d_x, d_y);\n\n    // Explicit copy of result to host\n    h_axpy = gt::empty_like(h_x);\n    copy(d_axpy, h_axpy);\n#else\n    // host implementation - simply call directly using host gtensors\n    h_axpy = daxpy(a, h_x, h_y);\n#endif // GTENSOR_HAVE_DEVICE\n\n    // Define a slice to print a subset of elements for checking result\n    auto print_slice = gt::gslice(_, _, n/nprint);\n    cout \u003c\u003c \"a       = \" \u003c\u003c a \u003c\u003c endl;\n    cout \u003c\u003c \"x       = \" \u003c\u003c h_x.view(print_slice)  \u003c\u003c endl;\n    cout \u003c\u003c \"y       = \" \u003c\u003c h_y.view(print_slice)  \u003c\u003c endl;\n    cout \u003c\u003c \"a*x + y = \" \u003c\u003c h_axpy.view(print_slice) \u003c\u003c endl;\n}\n```\n\nExample build for nVidia GPU using nvcc:\n```\nGTENSOR_HOME=/path/to/gtensor\nnvcc -x cu -std=c++14 --expt-extended-lambda --expt-relaxed-constexpr \\\n -DGTENSOR_HAVE_DEVICE -DGTENSOR_DEVICE_CUDA -DGTENSOR_USE_THRUST \\\n -DNDEBUG -O3 \\\n -I $GTENSOR_HOME/include \\\n -o daxpy_cuda daxpy.cxx\n```\n\nBuild for AMD GPU using hipcc:\n```\nhipcc -hc -std=c++14 \\\n -DGTENSOR_HAVE_DEVICE -DGTENSOR_DEVICE_HIP -DGTENSOR_USE_THRUST \\\n -DNDEBUG -O3 \\\n -I $GTENSOR_HOME/include \\\n -isystem /opt/rocm/rocthrust/include \\\n -isystem /opt/rocm/include \\\n -isystem /opt/rocm/rocprim/include \\\n -isystem /opt/rocm/hip/include \\\n -o daxpy_hip daxpy.cxx\n```\n\nBuild for Intel GPU using dpcpp:\n```\ndpcpp -fsycl -std=c++14 \\\n -DGTENSOR_HAVE_DEVICE -DGTENSOR_DEVICE_SYCL \\\n -DGTENSOR_DEVICE_SYCL_GPU \\\n -DNDEBUG -O3 \\\n -I $GTENSOR_HOME/include \\\n -o daxpy_sycl daxpy.cxx\n```\n\nBuild for host CPU:\n```\ng++ -std=c++14 \\\n -DNDEBUG -O3 \\\n -I $GTENSOR_HOME/include \\\n -o daxpy_host daxpy.cxx\n```\n\n### Example using gtensor with existing GPU code\n\nIf you have existing code written in CUDA or HIP, you can use the `gt::adapt`\nand `gt::adapt_device` functions to wrap existing allocated host and device\nmemory in gtensor span containers. This allows you to use the convenience of\ngtensor for new code without having to do an extensive rewrite.\n\nSee [trig.cu](examples/src/trig.cu) and\n[trig_adapted.cxx](examples/src/trig_adapted.cxx). The same approach will work\nfor HIP with minor modifications.\n\n# Data Types and mutability\n\ngtensor has two types of data objects - those which are containers that own the\nunderlying data, like `gtensor`, and those which behave like span objects or\npointers, like `gtensor_span`. The `gview` objects, which are generally\nconstructed via the helper method `gt::view` or the convenience `view` methods\non `gtensor`, implement the slicing, broadcasting, and axis manipulation\nfunctions, and have hybrid behavior based on the underlying expression. In\nparticular, a `gview` wrapping a `gtensor_span` object will have span-like\nbehavior, and in most other cases will have owning container behavior.\n\nBefore a data object can be passed to a GPU kernel, it must be converted to a\nspan-like object, and must be resident on the device. This generally happens\nautomatically when using expression evaluation and `gtensor_device`, but must\nbe done manually by calling the `to_kernel()` method when using custom kernels\nwith `gt::launch\u003cN\u003e`. What typically happens is that the underlying `gtensor`\nobjects get transformed to `gtensor_span` of the appropriate type. This happens\neven when they are wrapped inside complex `gview` and `gfunction` objects.\n\nThe objects with span like behavior also have shallow const behavior. This\nmeans that even if the outer object is const, they allow modification of the\nunderlying data. This is consistent with `std::span` standardized in C++20. The\nidea is that if copying does not copy the underlying data (shallow copy), all\nother aspects of the interface should behave similarly. This is called\n\"regularity\". This also allows non-mutable lambdas to be used for launch\nkernels. Non-mutable lambdas are important because SYCL requires const kernel\nfunctions, so the left hand side of expressions must allow mutation of the\nunderlying data even when const because they may be contained inside a\nnon-mutable lambda and forced to be const.\n\nTo ensure const-correctness whenever possible, the `to_kernel()` routine on\n`const gtensor\u003cT, N, S\u003e` is special cased to return a `gtensor_span\u003cconst T,\nN, S\u003e`. This makes it so even though a non-const reference is returned from the\nelement accessors (shallow const behavior of span like object), modification is\nstill not allowed since the underlying type is const.\n\nTo make this more concrete, here are some examples:\n\n```\ngtensor_device\u003cint, 1\u003e a{1, 2, 3};\nconst gtensor_device\u003cint, 1\u003e a_const_copy = a;\n\na(0) = 10; // fine\na_const_copy(0) = 1; // won't compile, because a_const_copy(0) is const int\u0026\n\nconst auto k_a = a.to_kernel(); // const gtensor_span\u003cint, 1\u003e\nk_a(0) = -1; // allowed, gtensor_span has shallow const behavior\n\nauto k_a_const_copy = a_const_copy.to_kernel(); // gtensor_span\u003cconst int, 1\u003e\nk_a_const_copy(0) = 10; // won't compile, type of LHS is const int\u0026\n\n```\n\n# Streams (experimental)\n\nTo facilitate interoperability with existing libraries and allow\nexperimentation with some advanced multi-stream use cases, there are classes\n`gt::stream` and `gt::stream_view`. The `gt::stream` will create a new stream\nin the default device backend and destroy the stream when the object is\ndestructed. The `gt::stream_view` is constructed with an existing native stream\nobject in the default backend (e.g. a `cudaStream\\_t` for the CUDA backend).\nThey can be used as optional arguments to `gt::launch` and `gt::assign`, in\nwhich case they will execute asynchronously with the default stream on device.\nNote that the equals operator form of assign does not work with alternate\nstreams - it will always use the default stream. For the SYCL backend, the\nnative stream object is a `sycl::queue`.\n\nSee also `tests/test_stream.cxx`. Note that this API is likely to change; in\nparticular, the stream objects will become templated on space type.\n\n# Library Wrapper Extensions\n\n## gt-blas\n\nProvides wrappers around commonly used blas routines. Requires cuBLAS, rocblas,\nor oneMKL, depending on the GPU backend. Interface is mostly C style taking\nraw pointers, for easy interoperability with Fortran, with a few higher level\ngtensor specific helpers.\n\n```\n#include \"gt-blas/blas.h\"\n\nvoid blas()\n{\n  gt::blas::handle_t h;\n  gt::gtensor_device\u003cdouble, 1\u003e x = gt::arange\u003cdouble\u003e(1, 11);\n  gt::gtensor_device\u003cdouble, 1\u003e y = gt::arange\u003cdouble\u003e(1, 11);\n  gt::blas::axpy(h, 2.0, x, y);\n  std::cout \u003c\u003c \"a*x+y = \" \u003c\u003c y \u003c\u003c std::endl;\n  /* a*x+y = { 3 6 9 12 15 18 21 24 27 30 } */\n}\n```\n\nA naive banded LU solver implementation is also provided, useful in cases where\nthe matrices are banded and the native GPU batched LU solve has not been\noptimized yet. Parallelism for this implementaiton is on batch only.\n\n## gt-fft\n\nProvides high level C++ style interface around cuFFT, rocFFT, and oneMKL DFT.\n\n```\n#include \"gt-fft/fft.h\"\n\nvoid fft()\n{\n  // python: x = np.array([2., 3., -1., 4.])\n  gt::gtensor_device\u003cdouble, 1\u003e x = {2, 3, -1, 4};\n  auto y = gt::empty_device\u003cgt::complex\u003cdouble\u003e\u003e({3});\n\n  // python: y = np.fft.fft(x)\n  gt::fft::FFTPlanMany\u003cgt::fft::Domain::REAL, double\u003e plan({x.shape(0)}, 1);\n  plan(x, y);\n  std::cout \u003c\u003c y \u003c\u003c std::endl;\n  /* { (8,0) (3,1) (-6,0) } */\n}\n```\n\n## gt-solver\n\nProvides a high level C++ style interface around batched LU solve, in\nparticular the case where a single set of matrices is used repeatedly to solve\nwith different right hand side vectors. It maintains it's own contiguous copy\nof the factored matrices in device memory, and device buffers for staging input\nand output right hand side vectors. This allows the application to build the\nmatrices on host and pass them off to the solver interface for all the device\nspecific handling. On some platforms, there are performance issues with solving\nout of managed memory, and the internal device buffers can significantly\nimprove performance on these platforms.\n\nThis should be preferred over directly calling gt-blas routines like getrf and\ngetrs, when the use case matches (single factor and many solves with different\nright hand sides).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwdmapp%2Fgtensor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwdmapp%2Fgtensor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwdmapp%2Fgtensor/lists"}