{"id":13736680,"url":"https://github.com/mratsim/laser","last_synced_at":"2025-04-08T03:12:41.660Z","repository":{"id":38050439,"uuid":"152870087","full_name":"mratsim/laser","owner":"mratsim","description":"The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers","archived":false,"fork":false,"pushed_at":"2024-01-04T19:14:57.000Z","size":3827,"stargazers_count":285,"open_issues_count":19,"forks_count":14,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-30T02:03:01.411Z","etag":null,"topics":["assembler","blas","compiler-optimization","convolution","deep-learning","gemm","high-performance-computing","jit","matrix-multiplication","openmp","parallel","runtime-cpu-detection","simd","tensor"],"latest_commit_sha":null,"homepage":"","language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mratsim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-13T12:32:46.000Z","updated_at":"2025-02-11T12:32:44.000Z","dependencies_parsed_at":"2025-01-15T10:12:53.337Z","dependency_job_id":"cb179fbe-39bd-4449-8ac6-04f299f40336","html_url":"https://github.com/mratsim/laser","commit_stats":null,"previous_names":["numforge/laser"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Flaser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Flaser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Flaser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Flaser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mratsim","download_url":"https://codeload.github.com/mratsim/laser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247312077,"owners_count":20918344,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assembler","blas","compiler-optimization","convolution","deep-learning","gemm","high-performance-computing","jit","matrix-multiplication","openmp","parallel","runtime-cpu-detection","simd","tensor"],"created_at":"2024-08-03T03:01:26.449Z","updated_at":"2025-04-08T03:12:41.639Z","avatar_url":"https://github.com/mratsim.png","language":"Nim","funding_links":[],"categories":["Algorithms"],"sub_categories":["Deep Learning"],"readme":"# Laser - Primitives for high performance computing\n\nCarefully-tuned primitives for running tensor and image-processing code\non CPU, GPUs and accelerators.\n\nThe library is in heavy development. For now the CPU backend is being optimised.\n\n## Library content\n\n\u003c!-- TOC --\u003e\n\n- [Laser - Primitives for high performance computing](#laser---primitives-for-high-performance-computing)\n  - [Library content](#library-content)\n    - [SIMD intrinsics for x86 and x86-64](#simd-intrinsics-for-x86-and-x86-64)\n    - [OpenMP templates](#openmp-templates)\n    - [`cpuinfo` for runtime CPU feature detection for x86, x86-64 and ARM](#cpuinfo-for-runtime-cpu-feature-detection-for-x86-x86-64-and-arm)\n    - [JIT Assembler](#jit-assembler)\n    - [Loop-fusion and strided iterators for matrix and tensors](#loop-fusion-and-strided-iterators-for-matrix-and-tensors)\n    - [Raw tensor type](#raw-tensor-type)\n    - [Optimised floating point parallel reduction for sum, min and max](#optimised-floating-point-parallel-reduction-for-sum-min-and-max)\n    - [Optimised logarithmic, exponential, tanh, sigmoid, softmax ...](#optimised-logarithmic-exponential-tanh-sigmoid-softmax)\n    - [Optimised transpose, batched transpose and NCHW \u003c=\u003e NHWC format conversion](#optimised-transpose-batched-transpose-and-nchw--nhwc-format-conversion)\n    - [Optimised strided Matrix-Multiplication for integers and floats](#optimised-strided-matrix-multiplication-for-integers-and-floats)\n      - [In the future](#in-the-future)\n        - [Operation fusion](#operation-fusion)\n        - [Pre-packing](#pre-packing)\n        - [Batched matrix multiplication](#batched-matrix-multiplication)\n        - [Small matrix multiplication](#small-matrix-multiplication)\n    - [Optimised convolutions](#optimised-convolutions)\n    - [State-of-the art random distributions and weighted random sampling](#state-of-the-art-random-distributions-and-weighted-random-sampling)\n  - [Usage \u0026 Installation](#usage--installation)\n  - [License](#license)\n\n\u003c!-- /TOC --\u003e\n\n### SIMD intrinsics for x86 and x86-64\n```Nim\nimport laser/simd\n```\n\nLaser includes a wrapper for x86 and x86-64 to operate on 128-bit (SSE) and 256-bit (AVX) vectors of floats and integers. SIMD are added on a as-needed basis for Laser optimisation needs.\n\n### OpenMP templates\n```Nim\nimport laser/openmp\n```\n\nLaser includes several OpenMP templates to easu data-parallel programming in Nim:\n  - The simple omp parallel for loops\n  - Splitting into chunks and having a per-thread ptr+len pair to paralley algorithm that takes a ptr+len\n  - `omp parallel`, `omp critical`, `omp master`, `omp barrier` and `omp flush` for fine-grained control over parallelism\n  - `attachGC` and `detachGC` if you need to use Nim GC-ed types in a non-master thread.\n\nExamples:\n  - [ex02_omp_parallel_for.nim](./examples/ex02_omp_parallel_for.nim)\n  - [ex03_omp_parallel_chunks](./examples/ex03_omp_parallel_chunks.nim)\n\n### `cpuinfo` for runtime CPU feature detection for x86, x86-64 and ARM\n\n```Nim\nimport laser/cpuinfo\n```\n\nLaser includes a wrapper for [`cpuinfo`](https://github.com/pytorch/cpuinfo) by Facebook's PyTorch team.\nThis allows to query runtime information about CPU SIMD capabilities and various L1, L2, L3, L4 CPU cache sizes\nto optimize your compute-bound algorithms.\n\nExample: [ex01_cpuinfo.nim](./examples/ex01_cpuinfo.nim)\n\n### JIT Assembler\n\n```Nim\nimport laser/photon_jit\n```\n\nLaser offers its own JIT assembler with features being added on a as needed basis.\nIt is very lightweight and easy to extend. Currently it only supports x86-64 with [the following\nopcodes](./laser/photon_jit/x86_64/x86_64_ops.nim).\n\nExamples:\n  - [ex06_jit_hello_world.nim](./examples/ex06_jit_hello_world.nim)\n  - [ex07_jit_brainfuck_vm.nim](./examples/ex07_jit_brainfuck_vm.nim)\n\n### Loop-fusion and strided iterators for matrix and tensors\n\n```Nim\nimport laser/strided_iteration/foreach\nimport laser/strided_iteration/foreach_staged\n```\n\nUsage - forEach:\n\n```Nim\nforEach x in a, y in b, z in c:\n  x += y * z\n```\n\nLaser includes optimised macros to iterate on contiguous and strided tensors.\nThe iterators work with normal Nim syntax, are parallelized via OpenMP when it makes sense.\n\nAny tensor type works as long as it exposes the following interface:\n  - rank: the number of dimensions\n  - size: the number of elements in the tensor\n  - shape, strides: a container that supports `[]` indexing\n  - unsafe_raw_data: a routine that returns\n    a `ptr UncheckedArray[T]` or\n    any type with `[]` indexing implemented, including mutable indexing.\n\nA advanced iterator `forEach_staged` provides a lot of flexibility to deal with advanced need, for example for parallel reduction:\n\n```Nim\nproc reduction_localsum_critical[T](x, y: Tensor[T]): T =\n  forEachStaged xi in x, yi in y:\n    openmp_config:\n      use_openmp: true\n      use_simd: false\n      nowait: true\n      omp_grain_size: OMP_MEMORY_BOUND_GRAIN_SIZE\n    iteration_kind:\n      {contiguous, strided} # Default, \"contiguous\", \"strided\" are also possible\n    before_loop:\n      var local_sum = 0.T\n    in_loop:\n      local_sum += xi + yi\n    after_loop:\n      omp_critical:\n        result += local_sum\n```\n\nExamples:\n  - ex04 - TODO\n  - [ex05_tensor_parallel_reduction](./examples/ex05_tensor_parallel_reduction.nim)\n\nBenchmarks:\n  - [iter_bench.nim](./benchmarks/loop_iteration/iter_bench.nim)\n  - [iter_bench_prod.nim](./benchmarks/loop_iteration/iter_bench_prod.nim)\n\n### Raw tensor type\n\n```Nim\nimport laser/tensor/[datatypes, allocator, initialization] # WIP\n```\n\nLaser includes a low-level tensor type with only the low-level allocation and initialization needed:\n  - Aligned allocator\n  - Parallel zero-ing and copy (deep copy, copy from a seq)\n  - Metadata initialisation\n  - Tensor raw data access via pointers is using Nim compiler for safeguard.\n    Immutable objects return a `RawImmutablePtr`\n    and mutable objects return a `RawMutablePtr`\n    to prevent you from accidentally modifying an immutable object when accessing raw memory.\n\nAn example of how to use that to build higher-level `newTensor` or `randomTensor`, `transpose` and `[]` is give in the `iter_bench` in the previous section.\n\n### Optimised floating point parallel reduction for sum, min and max\n\n```Nim\nimport laser/primitives/reductions\n```\n\nFloating-point reductions are not optimised by compilers by default because they can't assume that\n`result = (a+b) + c` is equivalent to `result = a + (b + c)` due to how floating-point rounding work.\nThis forces serial evaluation of reductions unless `-ffast-math` flag is passed to the compiler.\n\nThe primitives work around that by keeping several accumulators in parallel to avoid waiting for a previous serial evaluation. This allows those kernels to maximise memory-bandwith of your computer.\n\nBenchmarks:\n  - [reduction_packed_sse](./benchmarks/fp_reduction_latency/reduction_packed_sse.nim)\n\n### Optimised logarithmic, exponential, tanh, sigmoid, softmax ...\n\nIn heavy development.\n\nUnfortunately the default logarithm and exponential functions included in C and C++ standard \\\u003cmath.h\\\u003e library are extremely slow.\n\nBenchmarks shows that a 10x speed improvement is possible while keeping excellent accuracy.\n\nBenchmarks:\n  - [bench_exp](./benchmarks/vector_math/bench_exp.nim)\n  - [bench_exp_avx2](./benchmarks/vector_math/bench_exp_avx2.nim)\n\n### Optimised transpose, batched transpose and NCHW \u003c=\u003e NHWC format conversion\n\n```Nim\nimport laser/primitives/swapaxes\n```\n\nWhile logical transpose (just swapping the `shape` and `strides` metadata of the tensor/matrix) is often enough, we sometimes might need to transpose data physically in-memory.\n\nLaser provides Optimised routines for physical transpose, batched transpose (N matrices) and also transposition of images from and to NCHW and NHWC i.e. [Image id, Color, Height, Width] and [Image id, Height, Width, Color].\n\n90% of ML libraries including Nvidia's CuDNN prefer to work in NCHW while often images are decoded in HWC.\n\nBenchmarks:\n  - [transpose_bench](./benchmarks/transpose/transpose_bench.nim)\n\n### Optimised strided Matrix-Multiplication for integers and floats\n\n```Nim\nimport laser/primitives/matrix_multiplication/gemm\n```\n\nMatrix multiplication is the at the base of Machine Learning and numerical computing.\n\nThe Dense/Linear/Affine layer of neural network is just a matrix-multiplication and often convolutions are reframed into matrix multiplication to use the 20 years of optimisation research gone into [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) libraries.\n\nLaser implements its own multithreaded BLAS with the following details:\n\n  - It reaches 98% of OpenBLAS speed on float64 when multithreaded and 102% when single-threaded\n  - It reaches 97% of OpenBLAS speed on float32 when multithreaded and 99% when single-threaded\n  - It support strided matrices, for example resulting from slicing every 2 rows\n    or every 2 columns: `myTensor[0::2, :]`.\n    This is very useful when doing cross-validation as you don't need an extra copy before matrix-multiplication.\n  - Contrary to 99% of the BLAS out there, it supports integers: `int32` and `int64` using SSE2 or AVX2 instructions\n  - Extending support to new SIMD including ARM Neon and AVX512 is very easy, including software fallback is easy as well. For example this is how to [add AVX2 int32](./laser/primitives/matrix_multiplication/gemm_ukernel_avx2.nim) support with fused multiply-add fallback:\n    ```Nim\n    template int32x8_muladd_unfused_avx2(a, b, c: m256i): m256i =\n    mm256_add_epi32(mm256_mullo_epi32(a, b), c)\n\n    ukernel_generator(\n          x86_AVX2,\n          typ = int32,\n          vectype = m256i,\n          nb_scalars = 8,\n          simd_setZero = mm256_setzero_si256,\n          simd_broadcast_value = mm256_set1_epi32,\n          simd_load_aligned = mm256_load_si256,\n          simd_load_unaligned = mm256_loadu_si256,\n          simd_store_unaligned = mm256_storeu_si256,\n          simd_mul = mm256_mullo_epi32,\n          simd_add = mm256_add_epi32,\n          simd_fma = int32x8_muladd_unfused_avx2\n        )\n    ```\n\n#### In the future\n\n##### Operation fusion\n\nThe BLAS will allow easily fusing unary operations (like `max/relu`, `tanh` or `sigmoid`) and binary operations (like adding a bias) at the end of the matrix multiplication kernels.\n\nAs those operations are memory-bound and not compute-bound, and for matrix multiplication we already have all the data in memory (in the unary case) or half the data (in the binary case), we basically save lots by not looping once again on the matrix to apply them.\n\nSimilarly, you will be able to fuse operations before the matrix multiplication kernel, during the prepacking when data is being re-ordered for high performance processing. This will be useful\nfor backward propagation when before each matrix multiplication we must apply the derivatives of `relu`, `tanh` and `sigmoid`.\n\n##### Pre-packing\n\nAlso pre-packing matrices and working on pre-packed matrices is being added. This is useful for matrices that are being used repeatedly, for example for batched matrix multiplication.\n\n`im2col` prepacker that fuses the `convolution-\u003ematrix multiplication` (im2col) step with the matrix multiplication packing is also planned to get very efficient convolutions.\n\n##### Batched matrix multiplication\n\nWe often have to bached matrix multiplication for examples N tensors A multiplied by a tensor B, or N tensors A multiplied by N tensors B, this is planned.\n\n##### Small matrix multiplication\n\nIn many cases we don't deal with 1000x1000 matrices. For example the traditional image size is 224x224 and the overhead to re-pack matrices in an efficient format is not justified.\n\nWhen reframing convolutions in terms of matrix multiplication this is even worse as the main convolution kernels are 1x1, 3x3, 5x5.\n\nOptimised small matrix-multiplication is planned.\n\n### Optimised convolutions\n\nIn heavy development.\n\nBenchmarks:\n  - [conv2D_bench](./benchmarks/convolution/conv2d_bench.nim)\n\n### State-of-the art random distributions and weighted random sampling\n\nIn heavy development\n\nBenchmarks of multinomial sampling for Natural Language Processing and Reinforcement Learning:\n  -[bench_multinomial_samplers](./benchmarks/random_sampling/bench_multinomial_sampler)\n\n## Usage \u0026 Installation\n\nThe library is split in relatively independant modules that can be used without the others.\n\nFor example to just use the SIMD and cpu-detection portion, just do:\n\n```Nim\nimport laser/simd\nimport laser/cpuinfo\n```\n\nTo just use OpenMP\n\n```Nim\nimport laser/openmp\n```\n\nThe library is unstable and will be published on nimble when more mature.\nBasically it will be published when it's ready to be the CPU backend of [Arraymancer](https://github.com/mratsim/Arraymancer),\nit will automatically profit from the dozens of tests and edge cases handled in Arraymancer test suite.\n\n## License\n\n* Laser is licensed under the Apache License version 2\n* Facebook's cpuinfo is licensed under Simplified BSD (BSD 2 clauses)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmratsim%2Flaser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmratsim%2Flaser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmratsim%2Flaser/lists"}