{"id":15015390,"url":"https://github.com/chrisarg/perlassembly","last_synced_at":"2025-04-09T19:32:22.908Z","repository":{"id":244679218,"uuid":"815944351","full_name":"chrisarg/perlAssembly","owner":"chrisarg","description":"Examples of using Perl to augment NASM and vice versa","archived":false,"fork":false,"pushed_at":"2024-10-21T01:03:46.000Z","size":1969,"stargazers_count":11,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-03T05:06:54.754Z","etag":null,"topics":["assembly","educational-project","fafo","perl"],"latest_commit_sha":null,"homepage":"","language":"Perl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chrisarg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-16T15:54:40.000Z","updated_at":"2024-10-29T06:32:54.000Z","dependencies_parsed_at":"2024-06-18T05:03:21.403Z","dependency_job_id":"dcb7468a-8e1a-4037-acb0-cb30544d2554","html_url":"https://github.com/chrisarg/perlAssembly","commit_stats":null,"previous_names":["chrisarg/perlassembly"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisarg%2FperlAssembly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisarg%2FperlAssembly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisarg%2FperlAssembly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisarg%2FperlAssembly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chrisarg","download_url":"https://codeload.github.com/chrisarg/perlAssembly/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248097958,"owners_count":21047343,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assembly","educational-project","fafo","perl"],"created_at":"2024-09-24T19:47:18.956Z","updated_at":"2025-04-09T19:32:22.882Z","avatar_url":"https://github.com/chrisarg.png","language":"Perl","funding_links":[],"categories":[],"sub_categories":[],"readme":"# perlAssembly\n\nThis is probably one of the things that should never be allowed to exist, but why not use Perl and its capabilities to inline foreign code, to FAFO with assembly without a build system? Everything in a single file! In the process one may find ways to use Perl to enhance NASM and vice versa. But for now, I make no such claims : I am just using the perlAssembly git repo to illustrate how one can use Perl to drive (and learn to code!) assembly programs from a single file. \n\n## x86-64 examples\n\n### Adding Two Integers\n##### Script: addIntegers.pl\nSimple integer addition in Perl - this is the Hello World version of this git repo\n\n### The sum of an array of integers\n##### Scripts: addArrayofIntegers.pl \u0026 addArrayOfIntegers\\_C.pl\nExplore multiple equivalent ways to add *large* arrays of short integers (-100 to 100 in this implementat) in Perl:\n* ASM\\_blank : tests the speed of calling ASM from Perl (no computations are done)\n* ASM : passes the integers as bytes and then uses conversion operations and scalar floating point addition\n* ASM\\_doubles : passes the array as a packed string of doubles and do scalar double floating addition in assembly\n* ASM\\_doubles\\_AVX: passes the array as a packed string of doubles and do packed floating point addition in assembly\n* ForLoop : standard for loop in Perl\n* ListUtil: sum function from list utilities\n* PDL : uses summation in PDL\n\nScenarios w\\_alloc : allocate memory for each iteration to test the speed of pack, those marked\nas wo\\_alloc, use a pre-computed data structure to pass the array to the underlying code. \nBenchmarks of the first scenario give the true cost of offloading summation to of a Perl array to a given \nfunction when the source data are in Perl. Timing the second scenario benchmarks speed of the\nunderlying implementation.\n\nThe script illustrates \n* an important (but not the only one!) strategy to create a data structure\nthat is suitable for Assembly to work with, i.e. a standard array of the appropriate type, \nin which one element is laid adjacent to the previous one in memory\n* the emulation of declaring a pointer as constant in the interface of a C function. In the\nAVX code, we don't FAFO with the pointer (RSI in the calling convention) to the array directly,\nbut first load its address to another register that we manipulate at will.  \n\n\n#### Results\nThose were obtained on the i7 with the following topology\n\n![Topology of system](i7.png)\n\nAnd here are the timings! \n\n|                              |  mean  | median | stddev |\n|------------------------------|--------|--------|--------|\n|ASM\\_blank                    | 2.3e-06| 2.0e-06| 1.1e-06|\n|ASM\\_doubles\\_AVX\\_w\\_alloc   | 3.6e-03| 3.5e-03| 4.2e-04|\n|ASM\\_doubles\\_AVX\\_wo\\_alloc  | 3.0e-04| 2.9e-04| 2.7e-05|\n|ASM\\_doubles\\_w\\_alloc        | 4.3e-03| 4.1e-03| 4.5e-04|\n|ASM\\_doubles\\_wo\\_alloc       | 8.9e-04| 8.7e-04| 3.0e-05|\n|ASM\\_w\\_alloc                 | 4.3e-03| 4.2e-03| 4.5e-04|\n|ASM\\_wo\\_alloc                | 9.2e-04| 9.1e-04| 4.1e-05|\n|ForLoop                       | 1.9e-02| 1.9e-02| 2.6e-04|\n|ListUtil                      | 4.5e-03| 4.5e-03| 1.4e-04|\n|PDL\\_w\\_alloc                 | 2.1e-02| 2.1e-02| 6.7e-04|\n|PDL\\_wo\\_alloc                | 9.2e-04| 9.0e-04| 3.9e-05|\n\nLet's say we wanted to do this toy experiment in pure C (using Inline::C of course!)\nThis code obtains the integers as a packed \"string\" of doubles and forms the sum in C\n```C\ndouble sum_array_C(char *array_in, size_t length) {\n    double sum = 0.0;\n    double * array = (double *) array_in;\n    for (size_t i = 0; i \u003c length; i++) {\n        sum += array[i];\n    }\n    return sum;\n}\n```\n\nHere are the timing results:\n\n|                              |  mean  | median | stddev |\n|------------------------------|--------|--------|--------|\n|C\\_doubles\\_w\\_alloc          |4.1e-03 |4.1e-03 | 2.3e-04|\n|C\\_doubles\\_wo\\_alloc         |9.0e-04 |8.7e-04 | 4.6e-05|\n\n\nWhat if we used SIMD directives and parallel loop constructs in OpenMP? This was done in\nthe file addArrayOfIntegers\\_C.pl. All three combinations were tested, i.e. SIMD directives\nalone (the C equivalent of the AVX code), OpenMP parallel loop threads and SIMD+OpenMP.\nHere are the timings!\n\n|                              |  mean  | median | stddev |\n|------------------------------|--------|--------|--------|\n|C\\_OMP\\_w\\_alloc              |4.0e-03 | 3.7e-03| 1.4e-03|\n|C\\_OMP\\_wo\\_alloc             |3.1e-04 | 2.3e-04| 9.5e-04|\n|C\\_SIMD\\_OMP\\_w\\_alloc        |4.0e-03 | 3.8e-03| 8.6e-04|\n|C\\_SIMD\\_OMP\\_wo\\_alloc       |3.1e-04 | 2.5e-04| 8.5e-04|\n|C\\_SIMD\\_w\\_alloc             |4.1e-03 | 4.0e-03| 2.4e-04|\n|C\\_SIMD\\_wo\\_alloc            |5.0e-04 | 5.0e-04| 8.9e-05|\n\n#### Discussion of the sum of an array of integers example\n* For calculations such as this, the price that must be paid is all in memory currency: it\ntakes time to generate these large arrays, and for code with low arithmetic intensity this\ntime dominates the numeric calculation time.\n* Look how insanely effective sum in List::Util is : even though it has to walk the Perl \narray whose elements (the *doubles*, not the AV*) are not stored in a contiguous area in memory,\nit is no more than 3x slower than the equivalent C code  C\\_doubles\\_wo\\_alloc. \n* Look how optimized PDL is compared to the C code in the scenario without memory allocation.\n* Manual SIMD coded in assembly is 40% faster than the equivalent SIMD code in OpenMP (but it is\nmuch more painful to write)\n* The threaded OpenMP version achieved equivalent performance to the single thread AVX assembly\nprograms, with no obvious improvement from combining SIMD+parallel loop for pragmas in OpenMP. \n* For the example considered here, it thus makes ZERO senso to offload a calculation as simple as a \nsummation because ListUtil is already within 15% of the assembly solution (at a latter iteration\nwe will also test AVX2 and AVX512 packed addition to see if we can improve the results). \n* If however, one was managing the array, not as a Perl array, but as an area in memory through \na Perl object, then one COULD consider offloading. It may be fun to consider an example in \nwhich one adds the output of a function that has an efficient PDL and assembly implementation\nto see how the calculus changes (in the to-do list for now).\n\n### Parallel reductions over numerical data using ListUtil , Inline and OpenMP\n##### Script: ListUtil_OMP.pl and analyze_ListUtil_OMP.R\nExploration of reductions for numerical data using Perl Loops, Inline and OpenMP\nVersion 0.01 was created for the [London 2024 Perl and Raku Workshop](https://act.yapc.eu/lpw2024/) on a Xeon E2597v4 with the\nfollowing topology: \n![image](https://github.com/user-attachments/assets/aa1cfde3-8f0d-4d22-b9f6-29fc2d5884c0)\n\nand empirical roofline diagram:\n\n![image](https://github.com/user-attachments/assets/6c93e210-33d7-460c-92b7-5483a0460ff1)\n\nWrite up of the scenario and additional results will be uploaded after the workup. \nVersion 0.02 will likely drop _after_ the [Winter 2024 Perl Community conference](https://science.perlcommunity.org/spj).\n\n\n\n### Disclaimer\nThe code here is NOT meant to be portable. I code in Linux and in x86-64, so if you are looking into Window's ABI or ARM, you will be disappointed. But as my knowledge of ARM assembly grows, I intend to rewrite some examples in Arm assembly!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrisarg%2Fperlassembly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchrisarg%2Fperlassembly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrisarg%2Fperlassembly/lists"}