{"id":17056530,"url":"https://github.com/fuzzypixelz/parallelfft","last_synced_at":"2026-01-04T17:38:38.258Z","repository":{"id":68816458,"uuid":"592947936","full_name":"fuzzypixelz/ParallelFFT","owner":"fuzzypixelz","description":null,"archived":false,"fork":false,"pushed_at":"2023-01-25T04:02:46.000Z","size":203,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-28T13:29:32.291Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Futhark","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fuzzypixelz.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-24T21:50:12.000Z","updated_at":"2023-01-25T00:49:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"748220c4-afac-4178-b9d5-141e34d17c61","html_url":"https://github.com/fuzzypixelz/ParallelFFT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fuzzypixelz%2FParallelFFT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fuzzypixelz%2FParallelFFT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fuzzypixelz%2FParallelFFT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fuzzypixelz%2FParallelFFT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fuzzypixelz","download_url":"https://codeload.github.com/fuzzypixelz/ParallelFFT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245066495,"owners_count":20555402,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T10:24:46.626Z","updated_at":"2026-01-04T17:38:38.208Z","avatar_url":"https://github.com/fuzzypixelz.png","language":"Futhark","funding_links":[],"categories":[],"sub_categories":[],"readme":"* Introduction\n\nTo quote the official website:\n\n#+begin_quote\nFuthark is a small programming language designed to be compiled to efficient\nparallel code. It is a statically typed, data-parallel, and purely functional\narray language in the ML family, and comes with a heavily optimizing\nahead-of-time compiler that presently generates either GPU code via CUDA and\nOpenCL, or multi-threaded CPU code.\n#+end_quote\n\nI sought to write idiomatic Futhark code without digging too much into compiler\ninternals. This allowed me to see the performance one could expect from Futhark\nwithout knowing exactly how code is executed.\n\nThroughout the project, my model of Futhark semantics was that of\nMapReduce. Hence why I avoided algorithms with explicit memory operations and\nsearched for methods to calculate the Fourier Transform using only parallel\narray operations.\n\n* Fourier Transform\n\n(If you're reading this on Github, kindly refer to the PDF file instead; many\nforms are not rendered correctly)\n\n** Discrete Fourier Transform\n\nFor any array of complex numbers $\\bf a$ of length $N$, its Discrete Fourier\nTransform (DFT) is defined component-wise as:\n\n$$\\hat{a}_k = \\sum_{i=0}^{N-1}{a_{i}{\\zeta}^{ik}}$$\n\nwhere\n\n$$\\hat{\\bf a} = (\\hat{a}_0, \\dots, \\hat{a}_{N-1})$$\n\nand $\\zeta$ is a principal root of unity in the ring of complex numbers.\n\n** Parallel Fast Fourier Transform\n\nThe following algorithm is described in the [[https://dl.acm.org/doi/10.1145/2331684.2331693][MapReduce-SSA]] paper by Tsz-Wo Sze,\nwhere a MapReduce-friendly, parallel and relatively simple FFT algorithm is\nneeded to perform large integer multiplication. The exception being that the\npaper applies FFT in the ring of integers modulo $2^n + 1$ while I work with\ncomplex numbers. I will avoid going into the tedious proof.\n\nIf we write $N = PQ$ for some positive $P$ and $Q$ (in the code I assume $N$ is\na perfect square) we can compute the Fourier Transform of $\\bf a$ as:\n\n  1. $P$ DFTs of $Q$ point arrays, in parallel\n  2. then, $Q$ DFTs of $P$ point arrays, in parallel\n\nIn fact, write for all $0\\le p \u003c P$:\n\n$${\\bf{a}}^{(p)} = (a_p, \\dots, a_{(Q-2)P+p}, a_{(Q-1)P+p})$$\n\nin the code, this is referred to as =aslices=. These constitute the $P$ DFTs of\n$Q$ points needed; they are computed in parallel because there are no inter-dependencies.\n\nNext, for all $0\\le q \u003c Q$, we define:\n\n$${\\bf{z}}^{[q]} = (z_{qP}, \\dots, z_{qP+(P-2)}, z_{qP+(P-1)})$$\n\nwhere\n\n$$z_{qP+p} = \\zeta^{pq} \\widehat{a^{(p)}}_q$$\n\nThis corresponds to =zslices=. Likewise, these are the remaining $Q$ DFTs.\n\nFinally, we get for all $p$ and $q$:\n\n$$\\hat{a}_{pQ+q} = \\widehat{z^{[q]}}_p$$\n\n** Implementation\n\nAt first, I naively wrote two separate functions to compute =aslices= and =zslices=\nrespectively:\n\n#+begin_src ml\n  def aslice [n] (f: factorize) (a: [n]complex.complex) (p: i64) =\n    let (p_max, q_max) = f n\n    in map (\\q -\u003e a[q * p_max + p]) (0..\u003cq_max)\n\n  def zslice [n] (f: factorize) (a: [n]complex.complex) (q: i64) =\n    let (p_max, _) = f n\n    let root' p q = root n complex.** (complex.mk_re (f64.i64 (p * q)))\n    in map (\\p -\u003e root' p q complex.* dft (aslice f a p) q) (0..\u003cp_max)\n#+end_src\n\nThe problem with this was that dependencies were not computed in the correct\norder, but rather re-computer several times in =zslice=. This is an example of how\nthe Futhark compiler isn't a magically parallelizing tool; care should be taken\nto ensure that the parallel parts of one's algorithm map to the few parallel\noperations: =map=, =reduce=, =scatter=, ...\n\nAside from this, Futhark /feels/ like any other ML-style language in that many\nfamiliar idioms transfer naturally (e.g higher-order functions).\n\n* Results\n\nHaving had some difficulties running OpenCL/CUDA on my machine (complete lack of\nsupport), and on Grid'5000 (random exceptions from the Futhark runtime) I\npresent benchmarks only from the Multicore and Sequential backends.\n\nThe =bench-fft= benchmark uses an existing FFT implementation by the developers of\nFuthark, it uses the Stockham algorithm and is generally more optimized than\nmine.\n\n** Multicore backend\n\n#+begin_src console\nlib/github.com/fuzzypixelz/fft/bench-dft.fut:\n#0 (\"4i32\"):        249μs (95% CI: [     248.0,      250.9])\n#1 (\"5i32\"):       4187μs (95% CI: [    4124.1,     4269.4])\n#2 (\"6i32\"):      59851μs (95% CI: [   58871.0,    63095.1])\n\nlib/github.com/fuzzypixelz/fft/bench-fft.fut:\n#0 (\"4i32\"):         72μs (95% CI: [      71.8,       71.9])\n#1 (\"5i32\"):        356μs (95% CI: [     355.6,      356.2])\n#2 (\"6i32\"):       2808μs (95% CI: [    2795.5,     2824.5])\n\nlib/github.com/fuzzypixelz/fft/bench-parallel-dft.fut:\n#0 (\"4i32\"):        121μs (95% CI: [     120.7,      122.5])\n#1 (\"5i32\"):        576μs (95% CI: [     572.4,      578.9])\n#2 (\"6i32\"):       3960μs (95% CI: [    3936.6,     3985.5])\n#+end_src\n\n** Sequential C backend\n\n#+begin_src console\nlib/github.com/fuzzypixelz/fft/bench-dft.fut:\n#0 (\"4i32\"):        569μs (95% CI: [     568.4,      570.2])\n#1 (\"5i32\"):      15077μs (95% CI: [   15040.9,    15125.7])\n#2 (\"6i32\"):     242988μs (95% CI: [  242339.8,   244602.9])\n\nlib/github.com/fuzzypixelz/fft/bench-fft.fut:\n#0 (\"4i32\"):         52μs (95% CI: [      52.4,       52.6])\n#1 (\"5i32\"):        274μs (95% CI: [     273.8,      274.4])\n#2 (\"6i32\"):       2836μs (95% CI: [    2829.2,     2845.0])\n\nlib/github.com/fuzzypixelz/fft/bench-parallel-dft.fut:\n#0 (\"4i32\"):        105μs (95% CI: [     105.1,      105.3])\n#1 (\"5i32\"):        795μs (95% CI: [     793.0,      797.7])\n#2 (\"6i32\"):       7151μs (95% CI: [    7136.9,     7164.9])\n#+end_src\n\n* Conclusion\n\nThe above results show that Futhark is quite useful when the problem can be\ndecomposed neatly into parallelizable array operations. Thanks to its familiar\nsyntax, one could get pretty far without really understanding its\nsemantics. Still, the Futhark compiler is not a parallelizing compiler and one\nshould be explicit about which operations would be done in parallel.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffuzzypixelz%2Fparallelfft","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffuzzypixelz%2Fparallelfft","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffuzzypixelz%2Fparallelfft/lists"}