Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/slothy-optimizer/slothy

Assembly super-optimization via constraint solving
https://github.com/slothy-optimizer/slothy

assembly-language constraint-programming cryptography-tools post-quantum-cryptography superoptimization

Last synced: 4 days ago
JSON representation

Assembly super-optimization via constraint solving

Awesome Lists containing this project

README

        

## SLOTHY: Assembly optimization via constraint solving



## About SLOTHY

**SLOTHY** - **S**uper (**L**azy) **O**ptimization of **T**ricky **H**andwritten assembl**Y** - is an assembly-level superoptimizer
for:
1. Instruction scheduling
2. Register allocation
3. Software pipelining (= periodic loop interleaving)

SLOTHY is generic in the target architecture and microarchitecture. This repository provides instantiations for:
- Armv8.1-M+Helium: Cortex-M55, Cortex-M85
- AArch64: Cortex-A55, and experimentally Cortex-A72, Cortex-X/Neoverse-V, Apple M1 (Firestorm, Icestorm)

SLOTHY is discussed in [Fast and Clean: Auditable high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303).

### Goal

SLOTHY enables a development workflow where developers write 'clean' assembly by hand, emphasizing the logic of the computation, while SLOTHY automates microarchitecture-specific micro-optimizations. This accelerates development, keeps manually written code artifacts maintainable, and allows to split efforts for formal verification into the separate verification of the clean code and the micro-optimizations.

### How it works

SLOTHY is essentially a constraint solver frontend: It converts the input source into a data flow graph and
builds a constraint model capturing valid instruction schedulings, register renamings, and periodic loop
interleavings. The model is passed to an external constraint solver and, upon success,
a satisfying assignment converted back into the final code. Currently, SLOTHY uses
[Google OR-Tools](https://developers.google.com/optimization) as its constraint solver backend.

### Performance

As a rough rule of thumb, SLOTHY typically optimizes workloads of <50 instructions in seconds to minutes, workloads
up to 150 instructions in minutes to hours, while for larger kernels some heuristics are necessary.

### Applications

SLOTHY has been used to provide the fastest known implementations of various cryptographic and DSP primitives:
For example, the [SLOTHY paper](https://eprint.iacr.org/2022/1303) discusses the NTTs underlying ML-KEM and ML-DSA for
Cortex-{A55, A72, M55, M85}, the FFT for Cortex-{M55,M85}, and the X25519 scalar multiplication for Cortex-A55. You find
the clean and optimized source code for those examples in [`paper/`](paper).

## Getting started

Have a look at the [SLOTHY tutorial](tutorial/README.md) for a hands-on and example-based introduction to SLOTHY.

## Real world uses

* [AWS libcrypto (AWS-LC)](https://github.com/aws/aws-lc): SLOTHY-optimized X25519 code based on our un-interleaved form of the [original code by Emil
Lenngren](https://github.com/Emill/X25519-AArch64) has been [formally verified and
included](https://github.com/awslabs/s2n-bignum/pull/108) in
[s2n-bignum](https://github.com/awslabs/s2n-bignum/) (the bignum component of AWS-LC) and [merged](https://github.com/aws/aws-lc/pull/1469) into
AWS-LC. This was the topic of a [Real World Crypto 2024
talk](https://iacr.org/submit/files/slides/2024/rwc/rwc2024/38/slides.pdf).

* [s2n-bignum](https://github.com/awslabs/s2n-bignum/) routinely employs SLOTHY for finding
further highly optimized ECC implementations (e.g., [P256](https://github.com/awslabs/s2n-bignum/pull/118),
[P384](https://github.com/awslabs/s2n-bignum/pull/122), [P521](https://github.com/awslabs/s2n-bignum/pull/130) and
verifies them through automated equivalence-checking in [HOL-Light](https://hol-light.github.io/).

* [Arm EndpointAI](https://github.com/ARM-software/EndpointAI): SLOTHY-optimized code has been deployed to the CMSIS DSP Library for the radix-4 CFFT routines as part
of the Arm EndpointAI project in [this
commit](https://github.com/ARM-software/EndpointAI/commit/817bb57d8a4a604538a04627851f5e9adb5f08fc).

## Installation

### Requirements

SLOTHY has been successfully used on

- Ubuntu-21.10 and up (64-bit),
- macOS Monterey 12.6 and up.

SLOTHY requires Python >= 3.10. See [requirements.txt](requirements.txt) for package requirements, and install via `pip
install -r requirements.txt`.

**Note:** `requirements.txt` pins versions for reproducibility. If you already have newer versions of some dependencies
installed and don't want them downgraded, consider using a virtual environment:

```
python3 -m venv venv
./venv/bin/python3 -m pip install -r requirements.txt
```

Then, enter the virtual environment via `source venv/bin/activate` prior to running SLOTHY.

### Docker

A dockerfile for an Ubuntu-22.04 based Docker image with all dependencies of SLOTHY and the PQMX+PQAX test
environments setup can be found in [paper/artifact/slothy.dockerfile](paper/artifact/slothy.Dockerfile). See
[paper/artifact/README.md](paper/artifact/README.md) for instructions.

### Quick check

To check that your setup is complete, try the following from the base directory:

```
% python3 example.py --examples aarch64_simple0_a55
```

You should see something like the following:

```
* Example: aarch64_simple0_a55...
INFO:aarch64_simple0_a55:Instructions in body: 20
INFO:aarch64_simple0_a55.slothy:Perform internal binary search for minimal number of stalls...
INFO:aarch64_simple0_a55.slothy:Attempt optimization with max 32 stalls...
INFO:aarch64_simple0_a55.slothy:Objective: minimize number of stalls
INFO:aarch64_simple0_a55.slothy:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...
INFO:aarch64_simple0_a55.slothy:[0.0721s]: Found 1 solutions so far... objective 19.0, bound 8.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:[0.0765s]: Found 2 solutions so far... objective 18.0, bound 12.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:OPTIMAL, wall time: 0.155224 s
INFO:aarch64_simple0_a55.slothy:Booleans in result: 509
INFO:aarch64_simple0_a55.slothy.selfcheck:OK!
INFO:aarch64_simple0_a55.slothy:Minimum number of stalls: 18
```

### Examples

The [SLOTHY Tutorial](tutorial/README.md) and the [examples](examples/naive) directory contain numerous exemplary
assembly snippets. To try them, use `python3 example.py --examples={YOUR_EXAMPLE}`. See `python3 example.py --help` for
the list of all available examples.

The use of SLOTHY from the command line is illustrated in [scripts/](scripts/) supporting the real-world optimizations
for the NTT, FFT and X25519 discussed in [Fast and Clean: Auditable high-performance assembly via constraint
solving](https://eprint.iacr.org/2022/1303).