{"id":13670870,"url":"https://github.com/romeric/Fastor","last_synced_at":"2025-04-27T13:33:00.859Z","repository":{"id":38290619,"uuid":"59531018","full_name":"romeric/Fastor","owner":"romeric","description":"A lightweight high performance tensor algebra framework for modern C++","archived":false,"fork":false,"pushed_at":"2024-04-13T00:46:55.000Z","size":3357,"stargazers_count":751,"open_issues_count":32,"forks_count":69,"subscribers_count":28,"default_branch":"master","last_synced_at":"2024-11-11T08:43:17.782Z","etag":null,"topics":["fpga","hpc","multidimensional-arrays","simd","small-blas","tensor-contraction","tensors"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/romeric.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-05-24T01:40:32.000Z","updated_at":"2024-11-08T12:36:09.000Z","dependencies_parsed_at":"2024-04-13T01:59:31.804Z","dependency_job_id":null,"html_url":"https://github.com/romeric/Fastor","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/romeric%2FFastor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/romeric%2FFastor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/romeric%2FFastor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/romeric%2FFastor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/romeric","download_url":"https://codeload.github.com/romeric/Fastor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251145711,"owners_count":21543086,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fpga","hpc","multidimensional-arrays","simd","small-blas","tensor-contraction","tensors"],"created_at":"2024-08-02T09:00:51.315Z","updated_at":"2025-04-27T13:32:55.840Z","avatar_url":"https://github.com/romeric.png","language":"C++","funding_links":[],"categories":["Math","C++","Linear Algebra / Statistics Toolkit","[Libraries](#awesome-robotics-libraries)"],"sub_categories":["General Purpose Tensor Library","[Math](#awesome-robotics-libraries)"],"readme":"[![Build Status](https://travis-ci.com/romeric/Fastor.svg?branch=master)](https://travis-ci.com/romeric/Fastor)\n[![Build status](https://ci.appveyor.com/api/projects/status/hoj5lkq988kly121?svg=true)](https://ci.appveyor.com/project/romeric/fastor)\n![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)\n![GitHub release (latest by date)](https://img.shields.io/github/v/release/romeric/fastor)\n\n# Fastor\n**Fastor** is a high performance tensor (fixed multi-dimensional array) library for modern C++.\n\nFastor offers:\n\n- **High-level interface** for manipulating multi-dimensional arrays in C++ that look and feel native to scientific programmers\n- **Bare metal performance** for small matrix/tensor multiplications, contractions and tensor factorisations [LU, QR etc]. Refer to [benchmarks](https://github.com/romeric/Fastor/wiki/Benchmarks) to see how Fastor delivers performance on par with MKL JIT's dedicated API\n- **Compile time operation minimisation** such as graph optimisation, greedy matrix-chain products and nearly symbolic manipulations to reduce the complexity of evaluation of BLAS or non-BLAS type expressions by orders of magnitude\n- Explicit and configurable **SIMD vectorisation** supporting all numeric data types `float32`, `float64`, `complex float32` and `complex float64` as well as integral types with\n- Optional **SIMD backends** such as [sleef](https://github.com/shibatch/sleef), [Vc](https://github.com/VcDevel/Vc) or even [std::experimental::simd](https://en.cppreference.com/w/cpp/experimental/simd/simd)\n- **Optional JIT backend** using Intel's [MKL-JIT](https://software.intel.com/en-us/articles/intel-math-kernel-library-improved-small-matrix-performance-using-just-in-time-jit-code) and [LIBXSMM](https://github.com/hfp/libxsmm) for performance portable code\n- Ability to **wrap existing data** and operate on them using Fastor's highly optimised kernels\n- Suitable linear algebra library for **FPGAs, micro-controllers and embedded systems** due to absolutely no dynamic allocations and no RTTI\n- **Light weight header-only** library with no external dependencies offering **fast compilation times**\n- **Well-tested** on most compilers including GCC, Clang, Intel's ICC and MSVC\n\n\u003c!-- - **Operation minimisation or FLOP reducing algorithms:** Fastor relies on a domain-aware Expression Template (ET) engine that can not only perform lazy and delayed evaluation but also sophisticated mathematical transformations at *compile time* such as graph optimisation, nearly symbolic tensor algebraic manipulation to reduce the complexity of evaluation of BLAS and/or non-BLAS type expressions by orders of magnitude. Some of these functionalities are non-existent in other available C++ ET linear algebra libraries. For an example of what Fastor can do with expressions at compile time see the section on [smart expression templates](###Smart-expression-templates).\n- **Data parallelism for streaming architectures** Fastor utilises explicit SIMD instructions (from SSE all the way to AVX512 and FMA) through it's built-in `SIMDVector` layer. This backend is configurable and one can switch to a different implementation of SIMD types for instance to [Vc](https://github.com/VcDevel/Vc) or even to C++20 SIMD data types [std::experimental::simd](https://en.cppreference.com/w/cpp/experimental/simd/simd) which will cover ARM NEON, AltiVec and other potential streaming architectures like GPUs.\n- **High performance zero overhead tensor kernels** Combining sophisticated metaprogramming capabilities with statically dispatched bespoke kernels, makes Fastor a highly efficient framework for tensor operations whose performance can rival specialised vendor libraries such as [MKL-JIT](https://software.intel.com/en-us/articles/intel-math-kernel-library-improved-small-matrix-performance-using-just-in-time-jit-code) and [LIBXSMM](https://github.com/hfp/libxsmm). See the [benchmarks](https://github.com/romeric/Fastor/wiki/10.-Benchmarks) for standard BLAS routines and other specialised non-standard tensor kernels. In situations where jitted code is deemed more efficient or portable than the statically dispatched, the built-n BLAS layer can be easily configured with an optimised jitted vendor BLAS, see [[using the LIBXSMM/MKL JIT backend](https://github.com/romeric/Fastor/wiki/Using-the-LIBXSMM-backend)]. --\u003e\n\n### Documentation\nDocumenation can be found under the [Wiki](https://github.com/romeric/Fastor/wiki) pages.\n\n### High-level interface\nFastor provides a high level interface for tensor algebra. To get a glimpse, consider the following\n~~~c++\nTensor\u003cdouble\u003e scalar = 3.5;                // A scalar\nTensor\u003cfloat,3\u003e vector3 = {1,2,3};          // A vector\nTensor\u003cint,3,2\u003e matrix{{1,2},{3,4},{5,6}};  // A second order tensor\nTensor\u003cdouble,3,3,3\u003e tensor_3;              // A third order tensor with dimension 3x3x3\ntensor_3.arange(0);                         // fill tensor with sequentially ascending numbers\ntensor_3(0,2,1);                            // index a tensor\ntensor_3(all,last,seq(0,2));                // slice a tensor tensor_3[:,-1,:2]\ntensor_3.rank();                            // get rank of tensor, 3 in this case\nTensor\u003cfloat,2,2,2,2,2,2,4,3,2,3,3,6\u003e t_12; // A 12th order tensor\n~~~\n\u003c!-- a sample output of the above code would be\n~~~bash\n[0,:,:]\n⎡      0,       1,       2 ⎤\n⎢      3,       4,       5 ⎥\n⎣      6,       7,       8 ⎦\n[1,:,:]\n⎡      9,      10,      11 ⎤\n⎢     12,      13,      14 ⎥\n⎣     15,      16,      17 ⎦\n[2,:,:]\n⎡     18,      19,      20 ⎤\n⎢     21,      22,      23 ⎥\n⎣     24,      25,      26 ⎦\n~~~ --\u003e\n\n### Tensor contraction\nEinstein summation as well as summing over multiple (i.e. more than two) indices are supported. As a complete example consider\n~~~c++\n#include \u003cFastor/Fastor.h\u003e\nusing namespace Fastor;\nenum {I,J,K,L,M,N};\n\nint main() {\n    // An example of Einstein summation\n    Tensor\u003cdouble,2,3,5\u003e A; Tensor\u003cdouble,3,5,2,4\u003e B;\n    // fill A and B\n    A.random(); B.random();\n    auto C = einsum\u003cIndex\u003cI,J,K\u003e,Index\u003cJ,L,M,N\u003e\u003e(A,B);\n\n    // An example of summing over three indices\n    Tensor\u003cdouble,5,5,5\u003e D; D.random();\n    auto E = inner(D);\n\n    // An example of tensor permutation\n    Tensor\u003cfloat,3,4,5,2\u003e F; F.random();\n    auto G = permute\u003cIndex\u003cJ,K,I,L\u003e\u003e(F);\n\n    // Output the results\n    print(\"Our big tensors:\",C,E,G);\n\n    return 0;\n}\n~~~\nYou can compile this by providing the following flags to your compiler `-std=c++14 -O3 -march=native -DNDEBUG`.\n\n### Tensor views: A powerful indexing, slicing and broadcasting mechanism\nFastor provides powerful tensor views for block indexing, slicing and broadcasting familiar to scientific programmers. Consider the following examples\n~~~c++\nTensor\u003cdouble,4,3,10\u003e A, B;\nA.random(); B.random();\nTensor\u003cdouble,2,2,5\u003e C; Tensor\u003cdouble,4,3,1\u003e D;\n\n// Dynamic views -\u003e seq(first,last,step)\nC = A(seq(0,2),seq(0,2),seq(0,last,2));                              // C = A[0:2,0:2,0::2]\nD = B(all,all,0) + A(all,all,last);                                  // D = B[:,:,0] + A[:,:,-1]\nA(2,all,3) = 5.0;                                                    // A[2,:,3] = 5.0\n\n// Static views -\u003e fseq\u003cfirst,last,step\u003e\nC = A(fseq\u003c0,2\u003e(),fseq\u003c0,2\u003e(),fseq\u003c0,last,2\u003e());                     // C = A[0:2,0:2,0::2]\nD = B(all, all, fix\u003c0\u003e) + A(all, all, fix\u003clast\u003e());                  // D = B[:,:,0] + A[:,:,-1]\nA(2,all,3) = 5.0;                                                    // A[2,:,3] = 5.0\n\n// Overlapping is also allowed without having undefined/aliasing behaviour\nA(seq(2,last),all,all).noalias() += A(seq(0,last-2),all,all);        // A[2::,:,:] += A[::-2,:,:]\n// Note that in case of perfect overlapping noalias is not required\nA(seq(0,last-2),all,all) += A(seq(0,last-2),all,all);                // A[::2,:,:] += A[::2,:,:]\n\n// If instead of a tensor view, one needs an actual tensor the iseq could be used\n// iseq\u003cfirst,last,step\u003e\nC = A(iseq\u003c0,2\u003e(),iseq\u003c0,2\u003e(),iseq\u003c0,last,2\u003e());                     // C = A[0:2,0:2,0::2]\n// Note that iseq returns an immediate tensor rather than a tensor view and hence cannot appear\n// on the left hand side, for instance\nA(iseq\u003c0,2\u003e(),iseq\u003c0,2\u003e(),iseq\u003c0,last,2\u003e()) = 2; // Will not compile, as left operand is an rvalue\n\n// One can also index a tensor with another tensor(s)\nTensor\u003cfloat,10,10\u003e E; E.fill(2);\nTensor\u003cint,5\u003e it = {0,1,3,6,8};\nTensor\u003csize_t,10,10\u003e t_it; t_it.arange();\nE(it,0) = 2;\nE(it,seq(0,last,3)) /= -1000.;\nE(all,it) += E(all,it) * 15.;\nE(t_it) -= 42 + E;\n\n// Masked and filtered views are also supported\nTensor\u003cdouble,2,2\u003e F;\nTensor\u003cbool,2,2\u003e mask = {{true,false},{false,true}};\nF(mask) += 10;\n~~~\nAll possible combination of slicing and broadcasting is possible. For instance, one complex slicing and broadcasting example is given below\n~~~c++\nA(all,all) -= log(B(all,all,0)) + abs(B(all,all,1)) + sin(C(all,0,all,0)) - 102. - cos(B(all,all,0));\n~~~\n\n\u003c!-- It should be mentioned that since tensor views work on a view of (reference to) a tensor and do not copy any data in the background, the use of the keyword `auto` can be dangerous at times\n~~~c++\nauto B = A(all,all,seq(0,5),seq(0,3)); // the scope of view expressions ends with ; as view is a refrerence to an rvalue\nauto C = B + 2; // Hence this will sigfault as B refers to a non-existing piece of memory\n~~~\nTo solve this issue, use immediate construction from a view\n~~~c++\nTensor\u003cdouble,2,2,5,3\u003e B = A(all,all,seq(0,5),seq(0,3)); // B is now permanent\nauto C = B + 2; // This will behave as expected\n~~~ --\u003e\n\u003c!-- From a performance point of view, Fastor tries very hard to vectorise (read SIMD vectorisation) tensor views, but this heavily depends on the compilers ability to inline multiple recursive functions [as is the case for all expression templates]. If a view appears on the right hand side of an assignment, but not on the left, Fastor automatically vectorises the expression. However if a view appears on the left hand side of an assignment, Fastor does not by default vectorise the expression. To enable vectorisation across all tensor views use the compiler flag `-DFASTOR_USE_VECTORISE_EXPR_ASSIGN`. Also for performance reasons it is beneficial to avoid overlapping assignments, otherwise a copy will be made. If your code does not use any overlapping assignments, then this feature can be turned off completely by issusing `-DFASTOR_NO_ALIAS`. At this stage it is also beneficial to consider that while compiling complex and big expressions the inlining limit of the compiler should be increased and tested i.e. `-finline-limit=\u003cbig number\u003e` for GCC, `-mllvm -inline-threshold=\u003cbig number\u003e` for Clang and `-inline-forceinline` for ICC.\n\nTo see how efficient tensor views can be vectorised, as an example consider the following 4th order finite difference example for Laplace equation\n~~~c++\nTensor\u003cdouble,100,100\u003e u, v;\n// fill u and v\n// A complex assignment expression involving multiple tensor views\nu(seq(1,last-1),seq(1,last-1)) =\n    ((  v(seq(0,last-2),seq(1,last-1)) + v(seq(2,last),seq(1,last-1)) +\n        v(seq(1,last-1),seq(0,last-2)) + v(seq(1,last-1),seq(2,last)) )*4.0 +\n        v(seq(0,last-2),seq(0,last-2)) + v(seq(0,last-2),seq(2,last)) +\n        v(seq(2,last),seq(0,last-2))   + v(seq(2,last),seq(2,last)) ) / 20.0;\n~~~\nusing `-O3 -mavx2 -mfma -DNDEBUG -DFASTOR_NO_ALIAS -DFASTOR_USE_VECTORISE_EXPR_ASSIGN` the above expression compiles to\n~~~assembly\nL129:\n  leaq  -768(%rcx), %rdx\n  movq  %rsi, %rax\n  .align 4,0x90\nL128:\n  vmovupd 8(%rax), %ymm0\n  vmovupd (%rax), %ymm1\n  addq  $32, %rdx\n  addq  $32, %rax\n  vaddpd  1576(%rax), %ymm0, %ymm0\n  vaddpd  768(%rax), %ymm0, %ymm0\n  vaddpd  784(%rax), %ymm0, %ymm0\n  vfmadd132pd %ymm3, %ymm1, %ymm0\n  vaddpd  -16(%rax), %ymm0, %ymm0\n  vaddpd  1568(%rax), %ymm0, %ymm0\n  vaddpd  1584(%rax), %ymm0, %ymm0\n  vdivpd  %ymm2, %ymm0, %ymm0\n  vmovupd %ymm0, -32(%rdx)\n  cmpq  %rdx, %rcx\n  jne L128\n  vmovupd 2376(%rsi), %xmm0\n  vaddpd  776(%rsi), %xmm0, %xmm0\n  addq  $800, %rcx\n  addq  $800, %rsi\n  vaddpd  768(%rsi), %xmm0, %xmm0\n  vaddpd  784(%rsi), %xmm0, %xmm0\n  vfmadd213pd -32(%rsi), %xmm5, %xmm0\n  vaddpd  -16(%rsi), %xmm0, %xmm0\n  vaddpd  1568(%rsi), %xmm0, %xmm0\n  vaddpd  1584(%rsi), %xmm0, %xmm0\n  vdivpd  %xmm4, %xmm0, %xmm0\n  vmovups %xmm0, -800(%rcx)\n  cmpq  %r13, %rcx\n  jne L129\n~~~\nAside from unaligned load and store instructions (which are in fact equally fast as aligned load and store) which are also unavoidable in this specific case the rest of the generated code is as efficient as it gets for an `AVX2` architecture beating the perforamnce of Fortran. With the help of an optimising compiler, Fastor's functionalities come closest to the ideal metal performance for numerical tensor algebra code.\n --\u003e\n\n### SIMD optimised linear algebra kernels for fixed size tensors\nAll basic linear algebra subroutines for small matrices/tensors (where the overhead of calling vendor/optimised `BLAS` is typically high) are fully SIMD vectorised and efficiently implemented. Note that Fastor exposes two functionally equivalent interfaces for linear algebra functions, the more verbose names such as matmul, determinant, inverse etc that evaluate immediately and the less verbose ones (%, det, inv) that evaluate lazy\n~~~c++\nTensor\u003cdouble,3,3\u003e A,B;\n// fill A and B\nauto mulab = matmul(A,B);       // matrix matrix multiplication [or equivalently A % B]\nauto norma = norm(A);           // Frobenious norm of A\nauto detb  = determinant(B);    // determinant of B [or equivalently det(B)]\nauto inva  = inverse(A);        // inverse of A [or equivalently inv(A)]\nauto cofb  = cofactor(B);       // cofactor of B [or equivalently cof(B)]\nlu(A, L, U);                    // LU decomposition of A in to L and U\nqr(A, Q, R);                    // QR decomposition of A in to Q and R\n~~~\n\n\n### Boolean tensor algebra\nA set of boolean operations are available that whenever possible are performed at compile time\n~~~c++\nisuniform(A);                   // does the tensor expression span equally in all dimensions - generalisation of square matrices\nisorthogonal(A);                // is the tensor expression orthogonal\nisequal(A,B,tol);               // Are two tensor expressions equal within a tolerance\ndoesbelongtoSL3(A);             // does the tensor expression belong to the special linear 3D group\ndoesbelongtoSO3(A);             // does the tensor expression belong to the special orthogonal 3D group\nissymmetric\u003caxis_1, axis_3\u003e(A); // is the tensor expression symmetric in the axis_1 x axis_3 plane\nisdeviatoric(A);                // is the tensor expression deviatoric [trace free]\nisvolumetric(A);                // is the tensor expression volumetric [A = 1/3*trace(A) * I]\nall_of(A \u003c B);                  // Are all elements in A less than B\nany_of(A \u003e= B);                 // is any element in A greater than or equal to the corresponding element in B\nnone_of(A == B);                // is no element in A and B equal\n~~~\n\n### Interfacing with C arrays and external buffers\nAlernatively Fastor can be used as a pure wrapper over existing buffer. You can wrap C arrays or map any external piece of memory as Fastor tensors and operate on them just like you would on Fastor's tensors without making any copies, using the `Fastor::TensorMap` feature. For instance\n\n~~~c++\ndouble c_array[4] = {1,2,3,4};\n\n// Map to a Fastor vector\nTensorMap\u003cdouble,4\u003e tn1(c_array);\n\n// Map to a Fastor matrix of 2x2\nTensorMap\u003cdouble,2,2\u003e tn2(c_array);\n\n// You can now operate on these. This will also modify c_array\ntn1 += 1;\ntn2(0,1) = 5;\n~~~\n\n### Basic expression templates\nExpression templates are archetypal of array/tensor libraries in C++ as they provide a means for lazy evaluation of arbitrary chained operations. Consider the following expression\n\n~~~c++\nTensor\u003cfloat,16,16,16,16\u003e tn1 ,tn2, tn3;\ntn1.random(); tn2.random(); tn3.random();\nauto tn4 = 2*tn1+sqrt(tn2-tn3);\n~~~\n\nHere `tn4` is not another tensor but rather an expression that is not yet evaluated. The expression is evaluated if you explicitly assign it to another tensor or call the free function `evaluate` on the expression\n\n~~~c++\nTensor\u003cfloat,16,16,16,16\u003e tn5 = tn4;\n// or\nauto tn6 = evaluate(tn4);\n~~~\n\nthis mechanism helps chain the operations to avoid the need for intermediate memory allocations. Various re-structuring of the expression before evaluation is possible depending on the chosen policy.\n\n### Smart expression templates\n\nAside from basic expression templates, by employing further template metaprogrommaing techniques Fastor can mathematically transform expressions and/or apply compile time graph optimisation to find optimal contraction indices of complex tensor networks, for instance. This gives Fastor the ability to re-structure or completely re-write an expression and simplify it rather symbolically. As an example, consider the expression `trace(matmul(transpose(A),B))` which is `O(n^3)` in computational complexity. Fastor can determine this to be inefficient and will statically dispatch the call to an equivalent but much more efficient routine, in this case `A_ij*B_ij` or `inner(A,B)` which is `O(n^2)`. Further examples of such mathematical transformation include (but certainly not exclusive to)\n~~~c++\ndet(inv(A));             // transformed to 1/det(A), O(n^3) reduction in computation\ntrans(cof(A));           // transformed to adj(A), O(n^2) reduction in memory access\ntrans(adj(A));           // transformed to cof(A), O(n^2) reduction in memory access\nA % B % b;               // transformed to A % (B % b), O(n) reduction in computation [% is the operator matrix multiplication]\n// and many more\n~~~\nThese expressions are not treated as special cases but rather the **Einstein indicial notation** of the whole expression is constructed under the hood and by simply simplifying/collapsing the indices one obtains the most efficient form that an expression can be evaluated. The expression is then sent to an optimised kernel for evaluation. Note that there are situations that the user may write a complex chain of operations in the most verbose/obvious way perhaps for readibility purposes, but Fastor delays the evaluation of the expression and checks if an equivalent but efficient expression can be computed.\n\n### Operation minimisation for tensor networks\n\nFor tensor networks comprising of many higher rank tensors, a full generalisation of the above mathematical transformation can be performed through a constructive graph search optimisation. This typically involves finding the most optimal pattern of tensor contraction by studying the indices of contraction wherein tensor pairs are multiplied, summed over and factorised out in all possible combinations in order to come up with a cost model. Once again, knowing the dimensions of the tensor and the contraction pattern, Fastor performs this operation minimisation step at *compile time* and further checks the SIMD vectorisability of the tensor contraction loop nest (i.e. full/partial/strided vectorisation). In a nutshell, it not only minimises the the number of floating point operations but also generates the most optimal vectorisable loop nest for attaining theoretical peak for those remaining FLOPs. The following figures show the benefit of operation minimisation (FLOP optimal) over a single expression evaluation (Memory-optimal - as temporaries are not created) approach (for instance, NumPy's `einsum` uses the single expression evaluation technique where the whole expression in `einsum` is computed without being broken up in to smaller computations) in contracting a three-tensor-network fitting in `L1`, `L2` and `L3` caches, respectively\n\u003cp align=\"left\"\u003e\n  \u003cimg src=\"docs/imgs/05l1.png\" width=\"250\"\u003e\n  \u003cimg src=\"docs/imgs/05l2.png\" width=\"250\"\u003e\n  \u003cimg src=\"docs/imgs/05l3.png\" width=\"250\"\u003e\n\u003c/p\u003e\nThe x-axis shows the number FLOPS saved/reduced over single expression evaluation technique. Certainly, the bigger the size of tensors the more reduction in FLOPs is necessary to compensate for the temporaries created during by-pair (FLOP optimal) evaluation.\n\n\n\u003c!-- ### Domain-aware numerical analysis\nFastor tensors are not just multi-dimensional arrays like in other C++ libraries. Fastor tensors have a notion of index notation (which is why it is possible to perform various operatrion minimisations on them) and manifold transformation. For instance, in the field of computational mechanics it is customary to transform high order tensors to low rank tensors using a given transformation operator such as the Voigt transformation. Fastor has domain-specific features for such tensorial operations. For example, consider the dyadic product `A_ik*B_jl`, that can be computed in Fastor like\n~~~c++\nTensor\u003cdouble,3,3\u003e A,B;\nA.random(); B.random();\nTensor\u003cdouble,6,6\u003e C = einsum\u003cIndex\u003c0,2\u003e,Index\u003c1,3\u003e,Fastor::voigt\u003e(A,B);\n// or alternatively\nenum {I,J,K,L};\nTensor\u003cdouble,6,6\u003e D = einsum\u003cIndex\u003cI,K\u003e,Index\u003cJ,L\u003e,Fastor::voigt\u003e(A,B);\n~~~\n\nThis is generalised to any n-dimensional tensor. As you notice, all indices are resolved and the Voigt transformation is performed at compile time, keeping only the cost of computation at runtime. Equivalent implementation of this in C/Fortran requires either low-level for loop style programming that has an O(n^4) computational complexity and non-contiguous memory access. Here is the benchmark between Ctran (C/Fortran) for loop code and the equivalent Fastor implementation for the above example, run over a million times (both compiled using `-O3 -mavx`, on `Intel(R) Xeon(R) CPU E5-2650 v2 @2.60GHz` running `Ubuntu 14.04`):\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/imgs/cyclic_bench.png\" width=\"600\" align=\"middle\"\u003e\n\u003c/p\u003e\n\nThe performance of Fastor comes from the fact, that when a Voigt transformation is requested, Fastor does not compute the elements which are not needed.\n\n### The tensor cross product and it's associated algebra\nBuilding upon its domain specific features, Fastor implements the tensor cross product family of algebra by [Bonet et. al.](http://dx.doi.org/10.1016/j.ijsolstr.2015.12.030) which can significantly reduce the amount algebra involved in tensor derivatives of functionals which are forbiddingly complex to derive using a standard approach. The tensor cross product is a generalising of the vector cross product to multi-dimensional manifolds. The tensor cross product of two second order tensors is defined as `C_iI = e_ijk*e_IJK*A_jJ*b_kK` where `e` is the third order permutation tensor. As can be seen this product is O(n^6) in computational complexity. Using Fastor the equivalent code is only 81 SSE intrinsics\n~~~c++\n// A and B are second order tensors\nusing Fastor::LeviCivita_pd;\nTensor\u003cdouble,3,3\u003e E = einsum\u003cIndex\u003ci,j,k\u003e,Index\u003cI,J,K\u003e,Index\u003cj,J\u003e,Index\u003ck,K\u003e\u003e\n                       (LeviCivita_pd,LeviCivita_pd,A,B);\n// or simply\nTensor\u003cdouble,3,3\u003e F = cross(A,B);\n~~~\nHere is performance benchmark between Ctran (C/Fortran) code and the equivalent Fastor implementation for the above example, run over a million times (both compiled using `-O3 -mavx`, on `Intel(R) Xeon(R) CPU E5-2650 v2 @2.60GHz` running `Ubuntu 14.04`):\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/imgs/tensor_cross_bench.png\" width=\"600\" align=\"middle\"\u003e\n\u003c/p\u003e\n\n\nNotice the almost two orders of magnitude performance gain using Fastor. Again the real performance gain comes from the fact that Fastor eliminates zeros from the computation. --\u003e\n\n\n### Specialised tensors\nA set of specialised tensors are available that provide optimised tensor algebraic computations, for instance `SingleValueTensor` or `IdentityTensor`. Some of the computations performed on these tensors have almost zero cost no matter how big the tensor is. These tensors work in the exact same way as the `Tensor` class and can be assigned to one another. Consider for example the einsum between two `SingleValueTensor`s. A `SingleValueTensor` is a tensor of any dimension and size whose elements are all the same (a matrix of ones for instance).\n\n~~~c++\nSingleValueTensor\u003cdouble,20,20,30\u003e a(3.51);\nSingleValueTensor\u003cdouble,20,30\u003e b(2.76);\nauto c = einsum\u003cIndex\u003c0,1,2\u003e,Index\u003c0,2\u003e\u003e(a,b);\n~~~\n\nThis will incur almost no runtime cost. As where if the tensors were of type `Tensor` then a heavy computation would ensue.\n\n\n\u003c!-- ### Template meta-programming for powerful tensor contraction/permutation\nFastor utilises a bunch of meta-functions to perform most operations at compile time, consider the following examples\n~~~c++\nTensor\u003cdouble,3,4,5\u003e A;\nTensor\u003cdouble,5,3,4\u003e B;\nTensor\u003cdouble,3,3,3\u003e C;\nauto D = permutation\u003cIndex\u003c2,0,1\u003e\u003e(A); // type of D is deduced at compile time as Tensor\u003cdouble,5,3,4\u003e\nauto E = einsum\u003cIndex\u003cI,J,K\u003e,Index\u003cL,M,N\u003e\u003e(D,B); // type of E is deduced at compile time as Tensor\u003cdouble,5,3,4,5,3,4\u003e\nauto F = einsum\u003cIndex\u003cI,I,J\u003e\u003e(C); // type of F is deduced at compile time as Tensor\u003cdouble,3\u003e\nauto F2 = reduction(C); // type of F2 is deduced at compile time as scalar i.e. Tensor\u003cdouble\u003e\nauto E2 = reduction(D,B); // type of E2 is deduced at compile time as Tensor\u003cdouble\u003e\nTensor\u003cfloat,2,2\u003e G,H;\ntrace(H); // trace of H, in other words H_II\nreduction(G,H); // double contraction of G and H i.e. G_IJ*H_IJ\n~~~\nAs you can observe with combination of `permutation`, `contraction`, `reduction` and `einsum` (which itself is a glorified wrapper over the first three) any type of tensor contraction, and permutation is possible, and using meta-programming the right amount of stack memory to be allocated is deduced at compile time. --\u003e\n\n\u003c!-- ### A minimal framework\nFastor is extremely light weight, it is a *header-only* library, requires no build or compilation process and has no external dependencies. It is written in pure C++11 from the foundation. --\u003e\n\n### Tested Compilers\nFastor gets frequently tested against the following compilers (on Ubuntu 16.04/18.04/20.04, macOS 10.13+ and Windows 10)\n- GCC 5.1, GCC 5.2, GCC 5.3, GCC 5.4, GCC 6.2, GCC 7.3, GCC 8, GCC 9.1, GCC 9.2, GCC 9.3, GCC 10.1\n- Clang 3.6, Clang 3.7, Clang 3.8, Clang 3.9, Clang 5, Clang 7, Clang 8, Clang 10.0.0\n- Intel 16.0.1, Intel 16.0.2, Intel 16.0.3, Intel 17.0.1, Intel 18.2, Intel 19.3\n- MSVC 2019\n\n### References\nFor academic purposes, Fastor can be cited as\n````latex\n@Article{Poya2017,\n    author=\"Poya, Roman and Gil, Antonio J. and Ortigosa, Rogelio\",\n    title = \"A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics\",\n    journal = \"Computer Physics Communications\",\n    year=\"2017\",\n    doi = \"http://dx.doi.org/10.1016/j.cpc.2017.02.016\",\n    url = \"http://www.sciencedirect.com/science/article/pii/S0010465517300681\"\n}\n````\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fromeric%2FFastor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fromeric%2FFastor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fromeric%2FFastor/lists"}