Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mcabbott/Tullio.jl
⅀
https://github.com/mcabbott/Tullio.jl
automatic-differentiation einsum julia notation tensor
Last synced: 1 day ago
JSON representation
⅀
- Host: GitHub
- URL: https://github.com/mcabbott/Tullio.jl
- Owner: mcabbott
- License: mit
- Created: 2019-06-23T16:45:22.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-12-09T11:45:41.000Z (11 months ago)
- Last Synced: 2024-11-05T04:42:23.272Z (3 days ago)
- Topics: automatic-differentiation, einsum, julia, notation, tensor
- Language: Julia
- Homepage:
- Size: 966 KB
- Stars: 615
- Watchers: 10
- Forks: 29
- Open Issues: 69
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-sciml - mcabbott/Tullio.jl: ⅀
README
Tullio.jl
[![GitHub
CI](https://img.shields.io/github/actions/workflow/status/mcabbott/Tullio.jl/ci.yml?logo=github)](https://github.com/mcabbott/Tullio.jl/actions?query=workflow%3ACI)
[![Buildkite GPU CI](https://img.shields.io/buildkite/7f7fec35c774174a59cf616fc6e1711c70e94c088248088758?color=eee&label=gpu&logo=nvidia)](https://buildkite.com/julialang/tullio-dot-jl)
[![Tag Version](https://img.shields.io/github/v/tag/mcabbott/Tullio.jl?color=red&logo=)](https://github.com/mcabbott/Tullio.jl/releases)Tullio is a very flexible einsum macro. It understands many array operations written in index notation -- not just matrix multiplication and permutations, but also convolutions, stencils, scatter/gather, and broadcasting. For example:
```julia
@tullio M[x,y,c] := N[x+i, y+j,c] * K[i,j] # sum over i,j, and create M@tullio S[x] = P[x,y] * log(Q[x,y] / R[y]) # sum over y, and write into S
@tullio A[i,j] += B[i,k,l] * C[l,j] * D[k,j] # sum over k,l, and add to values in A
@tullio (*) Z[j] := X[ind[k],j] * exp(-Y[k]) # product over k
```Used by itself the macro writes ordinary nested loops much like [`Einsum.@einsum`](https://github.com/ahwillia/Einsum.jl).
One difference is that it can parse more expressions, and infer ranges for their indices.
Another is that it will use multi-threading (via [`Threads.@spawn`](https://julialang.org/blog/2019/07/multithreading/)) and recursive tiling, on large enough arrays.
But it also co-operates with various other packages, provided they are loaded before the macro is called:* It uses [`LoopVectorization.@avx`](https://github.com/chriselrod/LoopVectorization.jl) to speed many things up. (Disable with keyword `avx=false`.) On a good day this will match the speed of OpenBLAS for matrix multiplication.
* It uses [`KernelAbstractions.@kernel`](https://github.com/JuliaGPU/KernelAbstractions.jl) to make a GPU version. (Disable with `cuda=false`.) This is somewhat experimental, and may not be fast.
The macro also tries to provide a gradient for use with [Tracker](https://github.com/FluxML/Tracker.jl) or (via [ChainRules](https://github.com/JuliaDiff/ChainRules.jl)) for [Zygote](https://github.com/FluxML/Zygote.jl), [Yota](https://github.com/dfdx/Yota.jl), etc.
(Disable with `grad=false`, or `nograd=A`.) This is done in one of two ways:* By default it takes a symbolic derivative of the right hand side expression. This works for reductions over `+` or `min`/`max`. The functions as typed must be known, mostly from [DiffRules](https://github.com/JuliaDiff/DiffRules.jl).
* The option `grad=Dual` uses instead [ForwardDiff](https://github.com/JuliaDiff/ForwardDiff.jl) to differentiate the right hand side (only for reductions over `+`). This allows for more complicated expressions.
The entire right hand side is summed over the full possible range of any indices not appearing on the left.
Pipe operators `|>` or `<|` indicate functions to be performed *outside* the sum, for example:```julia
@tullio lse[j] := log <| exp(mat[i,j]) # vec(log.(sum(exp.(mat), dims=1)))
```The option `@tullio verbose=true` will cause it to print index ranges, symbolic derivatives,
and notices when it is unable to use the packages mentioned above.
And `verbose=2` will print everything.If it's useful in academic work, you can cite it with this DOI: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5047409.svg)](https://doi.org/10.5281/zenodo.5047409)
## Notation
Index notation for some simple functions:
```julia
using Pkg; Pkg.add("Tullio")
using Tullio, Test
M = rand(1:20, 3, 7)@tullio S[1,c] := M[r,c] # sum over r ∈ 1:3, for each c ∈ 1:7
@test S == sum(M, dims=1)@tullio Q[ρ,c] := M[ρ,c] + sqrt(S[1,c]) # loop over ρ & c, no sum -- broadcasting
@test Q ≈ M .+ sqrt.(S)mult(M,Q) = @tullio P[x,y] := M[x,c] * Q[y,c] # sum over c ∈ 1:7 -- matrix multiplication
@test mult(M,Q) ≈ M * transpose(Q)R = [rand(Int8, 3, 4) for δ in 1:5]
@tullio T[j,i,δ] := R[δ][i,j] + 10im # three nested loops -- concatenation
@test T == permutedims(cat(R...; dims=3), (2,1,3)) .+ 10im@tullio (max) X[i] := abs2(T[j,i,δ]) # reduce using max, over j and δ
@test X == dropdims(maximum(abs2, T, dims=(1,3)), dims=(1,3))dbl!(M, S) = @tullio M[r,c] = 2 * S[1,c] # write into existing matrix, M .= 2 .* S
dbl!(M, S)
@test all(M[r,c] == 2*S[1,c] for r ∈ 1:3, c ∈ 1:7)
```More complicated examples:
```julia
using Tullio
A = [abs2(i - 11) for i in 1:21]# Downsample -- range of i is that allowed by both terms:
@tullio B[i] := (A[2i] + A[2i+1])/2 # 1:10 == intersect(1:10, 0:10)# Shifts -- range of i calculated in terms of that given for j:
@tullio M[i,j] := A[i+j-1] (j in 1:15) # i in 1:7
@tullio M[i+_,j] := A[i+j] (j in 1:15) # i in 0:6, automatic shift "i+_"using OffsetArrays # Convolve a filter:
K = OffsetArray([1,-1,2,-1,1], -2:2)
@tullio C[i] := A[i+j] * K[j] # j ∈ -2:2 implies i ∈ 3:19# Index by the values in K
@tullio D[i,j] := A[2K[j]+i] ÷ K[j] # extrema(K)==(-1,2) implies i ∈ 3:17# Wrapped & padded:
@tullio M[i,j] := A[mod(i+j)] (j in 1:15, i in 1:15) # wraps around, mod(i+j, axes(A,1))
@tullio M[i,j] := A[clamp(i+j)] (j in 1:15, i in 1:15) # instead repeats "100"
@tullio M[i+_,j] := A[pad(i+j, 3)] (j in 1:15) # fills with zerosusing FFTW # Functions of the indices are OK:
S = [0,1,0,0, 0,0,0,0]
fft(S) ≈ @tullio F[k] := S[x] * exp(-im*pi/8 * (k-1) * x) (k ∈ axes(S,1))# Finalisers <| or |> are applied after sum (the two are equivalent):
@tullio N2[j] := sqrt <| M[i,j]^2 # N2 ≈ map(norm, eachcol(M))
@tullio n3[_] := A[i]^3 |> (_)^(1/3) # n3[1] ≈ norm(A,3), with _ anon. func.# Reduction over any function:
@tullio (*) P[i] := A[i+k] (k in 0:2) # product
@tullio (max) X[i,_] := D[i,j] # maximum(D, dims=2), almostmin1(x,y) = ifelse(first(x) < first(y), x, y); # findmin(D, dims=1), almost:
@tullio (min1) Ts[j+_] := (D[i,j], (i,j)) init=(typemax(Int), (0,0))# Access to fields & arrays -- this uses j ∈ eachindex(first(N).c)
N = [(a=i, b=i^2, c=fill(i^3,3)) for i in 1:10]
@tullio T[i,j] := (N[i].a // 1, N[i].c[j])# Functions which create arrays are evaluated once:
@tullio R[i,j] := abs.((rand(Int8, 5)[i], rand(Int8, 5)[j]))using NamedDims, AxisKeys # Dimension names, plus pretty printing:
@tullio M[row=i, col=j, z=k] := A[i+j-1] (j in 1:15, k in 1:2)
@tullio S[i] := M[col=j-i, z=k, row=i+1] # sum over j,k
```## Fast & Slow
When used with LoopVectorization, on straightforward matrix multiplication of real numbers,
`@tullio` tends to be about as fast as OpenBLAS. Depending on the size, and on your computer.
Here's a speed comparison on mine: [v2.5](https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/matmul-0.2.5-Float64-1.5.0.png).This race is a useful diagnostic, but isn't really the goal. There is little point in avoiding
using BLAS libraries, if you want precisely what they are optimised to give you.
One of the things `@tullio` is often very fast at is weird tensor contractions,
for which you would otherwise need `permutedims`:```julia
using Tullio, LoopVectorization, NNlib, BenchmarkTools# Batched matmul, with batch index first in B:
bmm_rev(A, B) = @tullio C[i,k,b] := A[i,j,b] * B[b,k,j] # (sum over j)A = randn(20,30,500); B = randn(500,40,30);
bmm_rev(A, B) ≈ NNlib.batched_mul(A, permutedims(B, (3,2,1))) # true@btime bmm_rev($A, $B); # 317.526 μs μs, same speed as un-permuted
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1))); # 1.478 ms, with MKL
```Complex numbers aren't handled by LoopVectorization, so will be much slower.
Chained multiplication is also very slow, because it doesn't know there's a better
algorithm. Here it just makes 4 loops, instead of multiplying sequentially,
`30^4` instead of `2 * 30^3` operations:```julia
M1, M2, M3 = randn(30,30), randn(30,30), randn(30,30);
@btime $M1 * $M2 * $M3; # 3.525 μs
@btime @tullio M4[i,l] := $M1[i,j] * $M2[j,k] * $M3[k,l]; # 30.401 μs
```Or slightly less obviously:
```julia
M, Σ = randn(100,100), randn(100,100);
@tullio R4[i, j] := (M[μ, i] - M[μ,j])' * Σ[μ,ν] * (M[ν, i] - M[ν, j]);
begin
S = M' * Σ * M # two N^3 operations, instead of one N^4
@tullio R3[i,j] := S[i,i] + S[j,j] - S[i,j] - S[j,i]
end;
R3 ≈ R4
```Another thing Tullio can be very fast at is broadcast reductions, where it can avoid large allocations. Here LoopVectorization is speeding up `log`, and Tullio is handling tiled memory access and multi-threading:
```julia
sum_opp(X, Y=X) = @tullio s := X[i,j] * log(Y[j,i])
sum_part(X, Y=X) = @tullio S[i] := X[i,j] * log(Y[j,i])X = rand(1000,1000);
@btime sum_opp($X) # 499.814 μs (93 allocations: 3.97 KiB)
@btime sum($X .* log.(transpose($X))) # 8.759 ms (2 allocations: 7.63 MiB)@btime sum_part($X)' # 1.599 ms (not the same computer!)
@btime sum($X .* log.(transpose($X)), dims=2) # 13.292 ms
```At present indices using `pad`, `clamp` or `mod` are also slow. These result in extra
checks or operations at every iteration, not just around the edges:```julia
conv1(x,k) = @tullio y[i+_, j+_] := x[i+a, j+b] * k[a,b]
conv2(x,k) = @tullio y[i+_, j+_] := x[2i-a, 2j-b] * k[a,b]
conv3(x,k) = @tullio y[i+_, j+_] := x[pad(i-a,3), pad(j-b,3)] * k[a,b]x100 = rand(100,100); k7 = randn(7,7);
@btime conv1($x100, $k7); # 25.574 μs
@btime conv2($x100, $k7); # 44.590 μs
@btime conv3($x100, $k7); # 86.228 μsusing Flux
x104 = reshape(x100,(100,100,1,1)); k74 = reshape(k7,(7,7,1,1));
conv1(x100, k7) ≈ @btime CrossCor($k74, false)($x104) # 586.694 μs
conv2(x100, k7) ≈ @btime Conv($k74, false, stride=2)($x104) # 901.573 μs
conv3(x100, k7) ≈ @btime Conv($k74, false, pad=3)($x104) # 932.658 μsusing DSP
@btime DSP.conv($x100, $k7); # 198.331 μs
```## Gradients & GPU
Derivatives & GPU```julia
using Tullio
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]A = rand(3,40); B = rand(40,500);
A * B ≈ mul(A, B) # trueusing Tracker # or Zygote
ΔA = Tracker.gradient((A,B) -> sum(mul(A, B)), A, B)[1]
ΔA ≈ ones(3,500) * B' # trueusing CUDA, KernelAbstractions # Now defined with a GPU version:
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]cu(A * B) ≈ mul(cu(A), cu(B)) # true
cu(ΔA) ≈ Tracker.gradient((A,B) -> sum(mul(A, B)), cu(A), cu(B))[1] # true
# Reduction over min/max:
Tracker.gradient(x -> (@tullio (max) res := x[i]^3), [1,2,3,-2,-1,3])[1]
```Some warnings are in order:
* Complete reductions to a number will not work on the GPU at present.
They were extremely slow, and a re-organisation of multi-threading for the CPU case killed them, sorry.
* Gradients are not calculated for scalars, only arrays.
Thus for example `gradient(a -> (@tullio _ := $a * A[i]), 3.14)` will be zero.
* When using `grad=Dual`, the right hand side is evaluated a second time during the backward pass.
This avoids needing memory to store partials, but if the function is expensive, it may be slow.## Larger Expressions
The expression need not be just one line, for example:
```julia
@tullio out[x, y] := @inbounds(begin # sum over k
a,b = off[k]
mat[mod(x+a), mod(y+b)]
end) (x in axes(mat,1), y in axes(mat,2)) grad=Dual nograd=off
```
Here the macro cannot infer the range of the output's indices `x,y`, so they must be provided explicitly.
(If writing into an existing array, with `out[x,y] = begin ...` or `+=`, then ranges would be taken from there.)
Because it sees assignment being made, it does not attempt to sum over `a,b`, and it assumes that indices could go out of bounds so does not add `@inbounds` for you.
(Although in fact `mod(x+a) == mod(x+a, axes(mat,1))` is safe.)
It will also not be able to take a symbolic derivative, but dual numbers will work fine.More examples:
```julia
using Tullio, OffsetArrays# A convolution with cyclic indices
mat = zeros(10,10,1); mat[2,2] = 101; mat[10,10] = 1;
@tullio kern[i,j] := 1/(1+i^2+j^2) (i in -3:3, j in -3:3)@tullio out[x,y,c] := begin
xi = mod(x+i, axes(mat,1)) # xi = ... means that it won't be summed,
# yj = mod(y+j, axes(mat,2))
@inbounds trunc(Int, mat[xi, mod(y+j), c] * kern[i,j]) # and disables automatic @inbounds,
end (x in 1:10, y in 1:10) # and prevents range of x from being inferred.# A stencil?
offsets = [(a,b) for a in -2:2 for b in -2:2 if a>=b] # vector of tuples@tullio out[x,y,1] = begin
a,b = offsets[k]
i = clamp(x+a, extrema(axes(mat,1))...)
# j = clamp(y+b, extrema(axes(mat,2))...) # can be written clamp(y+b)
@inbounds mat[i, clamp(y+b), 1] * 10
end # ranges of x,y read from out[x,y,1]# Applying a vector of functions
fs = [sin, cos, tan]
xs = randn(3,100)
@tullio ys[r,c] := (fs[r])(xs[r,c])using Zygote, ForwardDiff
rowmap(fs, xs) = @tullio ys[r,c] := (fs[r])(xs[r,c]) grad=Dual nograd=fs
Zygote.gradient(sum∘rowmap, fs, ones(3,2))
[f'(1) for f in fs] # agrees
```## Keyword Options
The default setting is:
```@tullio threads=true fastmath=true avx=true tensor=true cuda=256 grad=Base verbose=false A[i,j] := ...```
* `threads=false` turns off threading, while `threads=64^3` sets a threshold size at which to divide the work (replacing the macro's best guess).
* `avx=false` turns off the use of `LoopVectorization`, while `avx=4` inserts `@avx unroll=4 for i in ...`.
* `grad=false` turns off gradient calculation, and `grad=Dual` switches it to use `ForwardDiff` (which must be loaded).
* `nograd=A` turns of the gradient calculation just for `A`, and `nograd=(A,B,C)` does this for several arrays.
* `tensor=false` turns off the use of `TensorOperations`.
* Assignment `xi = ...` removes `xi` from the list of indices: its range is note calculated, and it will not be summed over. It also disables `@inbounds` since this is now up to you.
* `verbose=true` prints things like the index ranges inferred, and gradient calculations. `verbose=2` prints absolutely everything.
* `A[i,j] := ...` makes a new array, while `A[i,j] = ...` and `A[i,j] += ...` write into an existing one. `A[row=i, col=j] := ...` makes a new `NamedDimsArray`.
* `@tullio (*) A[i,j] := ...` is a product, as is `@tullio A[i,j] *= ...`. For other reductions, `@tullio (f) A[i,j] ^= ...` is an in-place update.
* `init=0.0` gives the initial value for reductions. For `+`, `*`, `min`, `min`, `&`, `|` it has sensible defaults, for other reductions uses zero.Implicit:
* Indices without shifts must have the same range everywhere they appear, but those with shifts (even `A[i+0]`) run over the intersection of possible ranges.
* Shifted output indices must start at 1, unless `OffsetArrays` is visible in the calling module.
* The use of `@avx`, and the calculation of gradients, are switched off by sufficiently complex syntax (such as arrays of arrays).
* Gradient hooks are attached for any or all of `ReverseDiff`, `Tracker` & `Zygote`. These packages need not be loaded when the macro is run.
* Gradients are only defined for reductions over `(+)` (default) and `min`, `max`.
* GPU kernels are only constructed when both `KernelAbstractions` and `CUDA` are visible. The default `cuda=256` is passed to `kernel(CUDA(), 256)`.
* The CPU kernels from `KernelAbstractions` are called only when `threads=false`; they are not at present very fast, but perhaps useful for testing.Extras:
* `A[i] := i^2 (i in 1:10)` is how you specify a range for indices when this can't be inferred.
* `A[i] := B[i, $col] - C[i, 2]` is how you fix one index to a constant (to prevent `col` being summed over).
* `A[i] := $d * B[i]` is the preferred way to include other constants. Note that no gradient is calculated for `d`.
* Within indexing, `A[mod(i), clamp(j)]` both maps `i` & `j` to lie within `axes(A)`, and disables inference of their ranges from `A`.
* Similarly, `A[pad(i,3)]` extends the range of `i`, inserting zeros outside of `A`. Instead of zero, `pad=NaN` uses this value as padding. The implementation of this (and `mod`, `clamp`) is not very fast at present.
* On the left, when making a new array, an underscore like `A[i+_] :=` inserts whatever shift is needed to make `A` one-based.
* `Tullio.@printgrad (x+y)*log(x/z) x y z` prints out how symbolic derivatives will be done.Macros:
* `Tullio.@tensor` is a macro which uses TensorOperations to evaluate expressions, but provides gradient definitions. (Previously this was automatic behaviour, when TensorOperations.jl was loaded & the expression was suitable.)
* `Tullio.@einsum` is a variant with a few changes, to allow the running of Einsum.jl's tests.## How it Works
The following three macros all end up calling the same functions as does `C = A * B`:
```julia
@tensor C[i,j] := A[i,k] * B[k,j] # TensorOperations.jl
@ein C[i,j] := A[i,k] * B[k,j] # OMEinsum.jl
@matmul C[i,j] := sum(k) A[i,k] * B[k,j] # TensorCast.jl
```But this one writes its own for-loops:
```julia
@einsum C[i,j] := A[i,k] * B[k,j] # Einsum.jl
```expanding out to roughly this:
```julia
T = promote_type(eltype(A), eltype(B))
C = Array{T}(undef, size(A,1), size(B,2))
@inbounds for j in 1:size(B,2)
for i in 1:size(A,1)
acc = zero(T)
for k in 1:size(A,2)
acc += A[i,k] * B[k,j]
end
C[i,j] = acc
end
end
```Tullio does something similar, but working through a few functions. Taking a slightly more complicated example, this:
```julia
@tullio C[i,j] := tanh <| A[i,k] * B[k,j]
```expands to roughly this:
```julia
function act!(::Type, C::AbstractArray{T}, A, B, ax_i, ax_j, ax_k, keep=nothing, final=true) where T
@inbounds @fastmath for i in ax_i
for j in ax_j
acc = isnothing(keep) ? zero(T) : C[i,j]
for k in ax_k
acc += A[i,k] * B[k,j]
end
C[i,j] = isnothing(final) ? acc : tanh(acc)
end
end
endfunction make(A, B)
ax_i = axes(A,1)
ax_j = axes(B,2)
ax_k = axes(A,2) # and check this is == axes(B,1)
rhs(A,B,i,j,k) = tanh(A[i,k] * B[k,j])
T = Core.Compiler.return_type(rhs, eltype.((A,B,1,1,1))) # plus a fallback
C = similar(A, T, (ax_i, ax_j))
Tullio.threader(act!, Array{T}, C, (A,B), (ax_i,ax_j), (ax_k,), +, 64^3)
return C
endC = Tullio.Eval(make, ∇make)(A, B)
```This division allows it to dispatch to other methods of `act!`: one generated with `@avx` if LoopVectorization is loaded, and one for `::CuArray` if KernelAbstractions is loaded.
It also allows `threader` to divide the work, calling `act!` many times, from different threads, on small tiles made by dividing the longest axis (say `ax_i`) in half, repeatedly. If it divides up `ax_k`, these are done sequentially, with `keep=true` on all ranges except the first, and `final=nothing` on all except the last. But `ax_i` and `ax_j` are safe to do in parallel.
Finally, `Eval` exists to give Zygote and friends somewhere to attach themselves. The gradient calculation looks roughly like this:
```julia
@adjoint function (e::Eval)(AB...)
C = e.fwd(AB...)
C, ΔC -> e.rev(ΔC, C, AB...)
endfunction ∇act!(::Type, ΔC, ΔA, ΔB, C, A, B, ax_i, ax_j, ax_k, keep)
for k in ax_k, i in ax_i, j in ax_j
ex = ΔC[i,j] * (1-C[i,j])^2
ΔA[i,k] += ex * B[k,j]
ΔB[k,j] += A[i,k] * ex
end
endfunction ∇make(ΔC, C, A, B)
ΔA = similar(A) .= 0
ΔB = similar(B) .= 0
ax_i, ax_k = axes(A); ax_j = axes(B,2)
Tullio.∇threader(∇act!, Array{T}, (ax_k,), (ax_i, ax_j), nothing)
return (ΔA, ΔB)
end
```In this case, it is the loop over `k` which can be safely broken among different threads, since both `ΔA` and `ΔB` have this index. Both `ΔA` and `ΔB` are filled in at once.
Notice that the derivative of `y = tanh(z)` has been written in terms of `y` (the final result of the forward pass) but free of `z` (the result of the sum, which was not saved). If this is not possible, it will fail.
If using `grad=Dual`, the gradient kernel looks different. This method cannot handle finalisers like `tanh` above, but for plain `@tullio C[i,j] := A[i,k] * B[k,j]` it would read:
```julia
function ∇act!(::Type, ΔC, ΔA, ΔB, C, A, B, ax_i, ax_j, ax_k, keep)
eps1 = ForwardDiff.Dual(0, (1,0))
eps2 = ForwardDiff.Dual(0, (0,1))
for k in ax_k, i in ax_i, j in ax_j
res = (A[i,k] + eps1) * (B[k,j] + eps2)
ΔA[i,k] += ForwardDiff.partials(res, 1) * ΔC[i,j]
ΔB[k,j] += ForwardDiff.partials(res, 2) * ΔC[i,j]
end
end
```Writing `@tullio verbose=2` will print all of these functions out.
Scalar reductions, such as `@tullio s := A[i,j] * log(B[j,i])`, are slightly different in that the `act!` function simply returns the sum, i.e. the variable `acc` above.
## Elsewhere
Back-end friends & relatives:
* [LoopVectorization.jl](https://github.com/chriselrod/LoopVectorization.jl) is used here, if available.
* [Gaius.jl](https://github.com/MasonProtter/Gaius.jl) and [PaddedMatrices.jl](https://github.com/chriselrod/PaddedMatrices.jl) build on that.
* [GPUifyLoops.jl](https://github.com/vchuravy/GPUifyLoops.jl) and [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) generate GPU-compatible kernels.
* [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) does threaded reductions, and much else.
* [Strided.jl](https://github.com/Jutho/Strided.jl) does multi-threaded broadcasting.
Front-end near-lookalikes:
* [Einsum.jl](https://github.com/ahwillia/Einsum.jl) makes simple loops. See [tests/einsum.jl](https://github.com/mcabbott/Tullio.jl/blob/master/test/einsum.jl) where `using Tullio: @einsum` is an almost-seamless replacement.
* [TensorOperations.jl](https://github.com/Jutho/TensorOperations.jl) and [OMEinsum.jl](https://github.com/under-Peter/OMEinsum.jl) identify patterns on which they can call various basic operations. [TensorRules.jl](https://github.com/ho-oto/TensorRules.jl) makes `@tensor` differentiable; see also [TensorGrad.jl](https://github.com/mcabbott/TensorGrad.jl) and [TensorTrack.jl](https://github.com/mcabbott/TensorTrack.jl) for earlier attempts.
* [TensorCast.jl](https://github.com/mcabbott/TensorCast.jl) expresses everything as Julia array operations, broadcasting and reduction. (OMEinsum.jl also treats some cases as a special lazy broadcast-reduction.)
Things you can't run:
* [Tortilla.jl](https://www.youtube.com/watch?v=Rp7sTl9oPNI) seems to exist, publicly, only in this very nice talk.
* [ArrayMeta.jl](https://github.com/shashi/ArrayMeta.jl) was a Julia 0.5 take on some of this.
* [Tokamak.jl](https://github.com/MikeInnes/Tokamak) was another, see [readme here](https://github.com/tkelman/Tokamak.jl).