{"id":13442701,"url":"https://github.com/marcoheisig/Petalisp","last_synced_at":"2025-03-20T15:30:33.961Z","repository":{"id":39633079,"uuid":"59042786","full_name":"marcoheisig/Petalisp","owner":"marcoheisig","description":"Elegant High Performance Computing","archived":false,"fork":false,"pushed_at":"2024-09-09T12:51:04.000Z","size":3004,"stargazers_count":467,"open_issues_count":2,"forks_count":16,"subscribers_count":30,"default_branch":"master","last_synced_at":"2024-09-09T15:27:18.037Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Common Lisp","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/marcoheisig.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-05-17T17:09:00.000Z","updated_at":"2024-09-09T12:51:08.000Z","dependencies_parsed_at":"2024-01-31T12:05:00.975Z","dependency_job_id":"18568091-f2ae-49f8-a2c5-019937f1e85c","html_url":"https://github.com/marcoheisig/Petalisp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcoheisig%2FPetalisp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcoheisig%2FPetalisp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcoheisig%2FPetalisp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcoheisig%2FPetalisp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/marcoheisig","download_url":"https://codeload.github.com/marcoheisig/Petalisp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221772537,"owners_count":16878124,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:01:49.375Z","updated_at":"2025-03-20T15:30:33.946Z","avatar_url":"https://github.com/marcoheisig.png","language":"Common Lisp","readme":"#+TITLE: Petalisp\n\nPetalisp generates high performance code for parallel computers by\nJIT-compiling array definitions.  It augments the existing general purpose\nprogramming language Common Lisp (and soon also Python) with parallelism and\nlazy evaluation.\n\n** Getting Started\n1. Install Lisp and a suitable IDE.  If unsure, pick [[https://portacle.github.io/][Portacle]].\n2. Download Petalisp via [[https://www.quicklisp.org/][Quicklisp]].\n3. Check out some of the [[file:examples][examples]].\n\n** Showcases\nPetalisp is still under development, so the following examples may still\nchange slightly. Nevertheless they give a good glimpse on what programming\nwith Petalisp will be like.\n\nExample 1: transposing a matrix\n#+BEGIN_SRC lisp\n(defun lazy-transpose (A)\n  (lazy-reshape A (transform m n to n m)))\n#+END_SRC\n\nExample 2: matrix-matrix multiplication\n#+BEGIN_SRC lisp\n(defun matrix-multiplication (A B)\n  (lazy-reduce #'+\n   (lazy #'*\n    (lazy-reshape A (transform m n to n m 1))\n    (lazy-reshape B (transform n k to n 1 k)))))\n#+END_SRC\n\nExample 3: the numerical Jacobi scheme in two dimensions\n#+BEGIN_SRC lisp\n(defun lazy-jacobi-2d (grid iterations)\n  (let ((interior (interior grid)))\n    (if (zerop iterations) grid\n        (lazy-jacobi-2d\n         (lazy-fuse\n          x\n          (lazy #'* 0.25\n           (lazy #'+\n            (lazy-reshape x (transform i0 i1 to (+ i0 1) i1) interior)\n            (lazy-reshape x (transform i0 i1 to (- i0 1) i1) interior)\n            (lazy-reshape x (transform i0 i1 to i0 (+ i1 1)) interior)\n            (lazy-reshape x (transform i0 i1 to i0 (- i1 1)) interior))))\n         (- iterations 1)))))\n#+END_SRC\n\n** Performance\n\nAll subsequent benchmarks have been measured on a Intel i7-8750H CPU, with six\ncores running at 2.20 GHz (4.1 GHz turbocore).  It has the following\ncharacteristics:\n\n- *L3 cache* 9MB L3 cache, shared across all cores\n- *L2 cache* 256kB per core\n- *L1 cache* 32kB per core\n- *SIMD* Support for AVX, AVX2 and FMA, but not AVX512\n- *Petalisp version* commit 9f9cfd6328141ba3d52a5fee1343c825f6a55bcd\n\nAll benchmark results are given double-precision floating-point operations per\nsecond.  The reported values are averages over four seconds of repeated\nexecution.\n\n*** daxpy (single-threaded)\n\nThis first benchmark compares the performance of Petalisp and the [[https://hpc.fau.de/research/tools/likwid/][likwid\nbenchmark suite]] for a single kernel of the form $y = a x + y$, where $a$ is a\nscalar and $x$ and $y$ are vectors.  More precisely, it compares the following\ntwo invocations\n\n#+begin_src lisp\n(in-package #:petalisp.benchmarks)\n(with-temporary-backend (make-native-backend :threads 1)\n  (print-benchmark-table daxpy 20))\n#+end_src\n#+begin_src sh\nlikwid-bench -t daxpy_avx -w S0:SIZE:1\n#+end_src\n\n[[file:images/daxpy-serial.svg]]\n\nFor a single daxpy run, Petalisp reaches about 45-90% of the single-thread\nperformance of likwid-bench.  This picture changes if multiple runs of daxpy\nare scheduled in succession, because it allows Petalisp the chance to apply\ntemporal blocking.  In such cases, Petalisp can outperform high-quality\nsingle-pass kernels such as those of likwid.\n\n*** daxpy (multi-threaded)\n\nThese daxpy results use an experimental scheduler that is not yet upstream.\nThe upstream scheduler doesn't yet parallelize a single daxpy sweep because it\noperates directly on an input array rather than on partitioned data (Stay tuned\nfor future updates!).  Using the experimental scheduler and 6 threads, Petalisp\nreaches 32% to 70% of the performance of the daxpy version of likwid-bench.  A\npossible explanation for the remaining performance difference are the multiple\nthread barriers for synchronization that Petalisp doesn't (yet) optimize away.\n\n[[file:images/daxpy-parallel.svg]]\n\n/Benchmark code:/\n\n#+begin_src lisp\n(in-package #:petalisp.benchmarks)\n(with-temporary-backend (make-native-backend :threads 6)\n  (print-benchmark-table daxpy 20))\n#+end_src\n#+begin_src sh\nlikwid-bench -t daxpy_avx -w S0:SIZE:6\n#+end_src\n\n*** dgemm (single-threaded)\n\nMatrix-matrix multiplication is an interesting case for Petalisp because it\ndoesn't have a built-in reduction operator.  Instead, Petalisp treats\nreductions naively as repeated summations of all odd and even elements.\nAlthough there are optimizations in place to optimize such repeated operations,\nright now the multiplication of an $m \\times n$ and a $n \\times k$ matrix requires $(m \\times\nn \\times k) / 8$ elements of auxiliary storage.  Consequently, the performance of\nmatrix-matrix multiplication of square matrices is much slower than that of,\nsay, [[https://github.com/OpenMathLib/OpenBLAS][OpenBLAS]] (this issue is being worked on):\n\n[[file:images/dgemm.svg]]\n\n/Benchmark code:/\n\n#+begin_src lisp\n(in-package #:petalisp.benchmarks)\n(with-temporary-backend (make-native-backend :threads 1)\n  (print-benchmark-table dgemm 20))\n#+end_src\n\n\nHowever, this picture changes for skinny matrices. For the multiplication of\nskinny matrices of the form Nx8 @ 8xK, the single-thread performance of\nPetalisp significantly faster than [[https://github.com/OpenMathLib/OpenBLAS][OpenBLAS]]:\n\n[[file:images/dgemm-skinny.svg]]\n\n/Benchmark code:/\n\n#+begin_src lisp\n(in-package #:petalisp.benchmarks)\n(with-temporary-backend (make-native-backend :threads 1)\n  (print-benchmark-table dgemm-n=8 20))\n#+end_src\n\nOpenBLAS underperforms in this setting because it has no special handling of\nskinny matrices.  Petalisp overperforms in this setting because it can generate\nspecialized code and its buffer pruning technique can eliminate all\nintermediate storage because the reduction loop is short enough.\n\n*** Jacobi's Method\n\nFor Jacobi's method in two dimensions, Petalisp achieves between 36% and 89% of\nthe single-core performance of auto-vectorized C++ code.  The parallel\nperformance is not yet on par with OpenMP parallelized C++ code, mainly because\nPetalisp's automatic parallel scheduler is very recent and contains a\nsuboptimal synchronization mechanim.\n\n[[file:images/jacobi.svg]]\n\n/Benchmark code:/\n\n#+begin_src lisp\n(in-package #:petalisp.benchmarks)\n(loop for threads from 1 to 6 do\n  (with-temporary-backend (make-native-backend :threads threads)\n    (print-benchmark-table stencil-jacobi-2d 20)))\n#+end_src\n\n*** Red-Black Gauss-Seidel Method\n\nThe Red-Black Gauss-Seidel method differs from Jacobi's method in that it\ntouches elements in a chessboard-like pattern, with two sweeps over the domain\nper iteration.  This results in a more complicated data-flow graph.\nNevertheless, the measured performance is quite similar to that of Jacobi's\nmethod, apart from the cost of having to traverse the domain twice.\n\n[[file:images/rbgs.svg]]\n\n/Benchmark code:/\n\n#+begin_src lisp\n(in-package #:petalisp.benchmarks)\n(loop for threads from 1 to 6 do\n  (with-temporary-backend (make-native-backend :threads threads)\n    (print-benchmark-table rbgs 20)))\n#+end_src\n\n*** Multigrid V-cycle\n\nA [[https://en.wikipedia.org/wiki/Multigrid_method][Multigrid V-Cycle]] combines several numerical primitives to solve partial\ndifferential equations efficiently.  It contains stencils for smoothing\nhigh-frequency components of a grid, interpolation and prolongation for\ntransferring data between smaller and larger grids, and calculations of the\nresidual on each grid level.  Despite these complexities, Petalisp achieves\ndecent floating-point performance and even a modest parallel speedup:\n\n[[file:images/multigrid-v-cycle.svg]]\n\n/Benchmark code:/\n\n#+begin_src lisp\n(in-package #:petalisp.benchmarks)\n(loop for threads from 1 to 6 do\n  (with-temporary-backend (make-native-backend :threads threads)\n    (print-benchmark-table multigrid-v-cycle 20)))\n#+end_src\n\n** Frequently Asked Questions\n\n*** Is Petalisp similar to NumPy?\nNumPy is a widely used Python library for scientific computing on arrays.\nIt provides powerful N-dimensional arrays and a variety of functions for\nworking with these arrays.\n\nPetalisp works on a more fundamental level.  It provides even more powerful\nN-dimensional arrays, but just a few building blocks for working on them -\nelement-wise function application, reduction, reshaping and array fusion.\n\nSo Petalisp is not a substitute for NumPy.  However, it could be used to\nwrite a library that behaves like NumPy, but that is much faster and fully\nparallelized.  In fact, writing such a library is one of my future goals.\n\n*** Do I have to program Lisp to use Petalisp?\nNot necessarily.  Not everyone has the time to learn Common Lisp.  That is\nwhy I am also working on some [[https://github.com/marcoheisig/petalisp-for-python][convenient Python bindings]] for Petalisp.\n\nBut: If you ever have time to learn Lisp, do it!  It is an enlightening\nexperience.\n\n*** How can I get Emacs to indent Petalisp code nicely?\n\nPut the following code in your initialization file:\n\n#+begin_src elisp\n(put 'lazy 'common-lisp-indent-function '(1 \u0026rest 1))\n(put 'lazy-reduce 'common-lisp-indent-function '(1 \u0026rest 1))\n(put 'lazy-multiple-value 'common-lisp-indent-function '(1 1 \u0026rest 1))\n(put 'lazy-reshape 'common-lisp-indent-function '(1 \u0026rest 1))\n#+end_src\n\n*** Why is Petalisp licensed under AGPL?\nI am aware that this license prevents some people from using or\ncontributing to this piece of software, which is a shame. But unfortunately\nthe majority of software developers have not yet understood that\n\n1. In a digital world, free software is a necessary prerequisite for a free\n   society.\n2. When developing software, open collaboration is way more efficient than\n   competition.\n\nSo as long as distribution of non-free software is socially accepted,\ncopyleft licenses like the AGPL seem to be the lesser evil.\n\nThat being said, I am willing to discuss relicensing on an individual\nbasis.\n\n*** Why is Petalisp written in Common Lisp?\nI couldn't wish for a better tool for the job. Common Lisp is extremely\nrich in features, standardized, fast, safe and mature. The Lisp community\nis amazing and there are excellent libraries for almost every imaginable\ntask.\n\nTo illustrate why Lisp is particularly well suited for a project like\nPetalisp, consider the following implementation of a JIT-compiler for\nmapping a function over a vector of a certain element type:\n\n#+BEGIN_SRC lisp\n(defun vector-mapper (element-type)\n  (compile nil `(lambda (fn vec)\n                  (declare (function fn)\n                           (type (simple-array ,element-type (*)) vec)\n                           (optimize (speed 3) (safety 0)))\n                  (loop for index below (length vec) do\n                    (symbol-macrolet ((elt (aref vec index)))\n                      (setf elt (funcall fn elt)))))))\n#+END_SRC\n\nNot only is this JIT-compiler just 8 lines of code, it is also 20 times\nfaster than invoking GCC or Clang on a roughly equivalent piece of C code.\n","funding_links":[],"categories":["Common Lisp","Python ##","Interfaces to other package managers"],"sub_categories":["Third-party APIs"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcoheisig%2FPetalisp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarcoheisig%2FPetalisp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcoheisig%2FPetalisp/lists"}