Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/eddieantonio/fast-sp

Benchmarking C, Python, and Rust on the "sp" problem
https://github.com/eddieantonio/fast-sp

benchmarking numpy python rust simd

Last synced: 15 days ago
JSON representation

Benchmarking C, Python, and Rust on the "sp" problem

Host: GitHub
URL: https://github.com/eddieantonio/fast-sp
Owner: eddieantonio
License: agpl-3.0
Created: 2023-07-07T22:45:31.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-07-17T15:05:14.000Z (over 1 year ago)
Last Synced: 2024-12-17T21:43:55.158Z (about 2 months ago)
Topics: benchmarking, numpy, python, rust, simd
Language: Rust
Homepage: https://eddieantonio.ca/blog/2023/07/12/faster-than-c-with-python/
Size: 107 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Fast SP

Various implementations of the problem in [this blog post][blog].

[blog]: https://owen.cafe/posts/six-times-faster-than-c/

# Requirements

To run this, you will need **Rust Nightly** and **Python 3.8+** with **numpy**.

Rust (nightly)

Use [rustup](https://www.rust-lang.org/tools/install) to install a Rust toolchain, then install a nightly

toolchain:

    rustup update -- nightly

Then run `rustup override set nightly` to use the Rust nightly in the

current directory.

Python 3.8+ with NumPy

The test cases were generated using NumPy shenanigans.

You probably want to create a virtual environment, and install NumPy

inside it. Here's one way to do it:

    python3 -m venv .venv

    source .venv/bin/activate

    pip install -r requirements.txt

# Running the benchmarks

**Note**: the `cargo bench` **must** be run before profiling Python (I'm

sorry).

Benchmark C and Rust:

    cargo bench

Benchmark Python (ensure NumPy is installed):

    python3 python/benchmark-python.py

## "Data analysis"

If you want to try analyzing results, you will need to install

additional Python packages. In your virtual environment, run this:

    python3 -m pip install -r requirements-dev.txt

This was really janky. So I just copy-pasted the output from the

benchmarks straight from my terminal into a file called `output.txt`,

but I guess you can do this:

    cargo bench | tee output.txt &&\

      python3 python/benchmark-python.py | tee -a output.txt

And then run the script to parse this file:

   python3 ./analyze-data-from-cargo-bench-output.py output.txt

# Implementations

 - `c_original` — the original implementation from the [blog post][blog].

 - `c_for_loop` — a straightforward C implementation, with buffer size given (no need to find the null-terminator).

 - `c_while_loop` — a slight variation on the original.

 - `rust_for_loop` — a Rust implementation that uses a `for` loop and mutable state.

 - `rust_iter` — a Rust implementation that uses a `for` loop and mutable state.

 - `rust_simd` — a Rust implementation that uses [Portable SIMD][].

 - `python_for_loop` — Python code to analyze buffer byte-by-byte.

 - `python_numpy` — solution that uses NumPy.

[Portable SIMD]: https://github.com/rust-lang/portable-simd

# Benchmarks

There were two test cases that I used:

 - `random_printable`: 12 MiB of random printable ASCII characters.

 - `random_sp`: 12 MiB of either the ASCII character `s` or `p`.

I chose 12 MiB as the size of the test data, as that is the size of the

L2 cache on the M1's performance cores (allegedly).

Here's how fast various implementation strategies work on my machine (from fastest to slowest):

| Language   | Implementation 
|:-----------|:--------- 
| Rust       | portable\_simd 
| Rust       | portable\_simd 
| Python     | numpy 
| Python     | numpy 
| Rust       | iter 
| Rust       | iter 
| Rust       | emulate\_numpy 
| Rust       | emulate\_numpy 
| C          | while\_loop 
| C          | while\_loop 
| C          | for\_loop 
| Rust       | for\_loop 
| Rust       | for\_loop 
| C          | original 
| C          | for\_loop 
| C          | original 
| Python     | for\_loop 
| Python     | for\_loop

| Test case         |  Throughput (GiB/s) |                   Time per iteration | --------|:------------------|--------------------:|-------------------------------------:| | random\_sp        |              19.874 |        589,654 ns/iter ±       5,769 | | random\_printable |              19.870 |        589,766 ns/iter ±       5,726 | | random\_printable |               5.800 |      2,020,549 ns/iter ±      28,156 | | random\_sp        |               5.711 |      2,052,063 ns/iter ±     113,433 | | random\_printable |               3.979 |      2,944,916 ns/iter ±      80,495 | | random\_sp        |               3.978 |      2,946,037 ns/iter ±      33,849 | | random\_sp        |               3.031 |      3,866,700 ns/iter ±     116,109 | | random\_printable |               3.026 |      3,872,550 ns/iter ±     103,355 | | random\_printable |               1.487 |      7,878,845 ns/iter ±     105,742 | | random\_sp        |               1.486 |      7,886,922 ns/iter ±     158,662 | | random\_sp        |               1.482 |      7,905,779 ns/iter ±      40,898 | | random\_sp        |               1.482 |      7,906,754 ns/iter ±     635,075 | | random\_printable |               1.481 |      7,912,558 ns/iter ±     499,821 | | random\_printable |               1.478 |      7,930,400 ns/iter ±     135,758 | | random\_printable |               1.478 |      7,931,450 ns/iter ±     516,610 | | random\_sp        |               0.278 |     42,119,291 ns/iter ±     501,957 | | random\_printable |               0.001 | 18,939,466,916 ns/iter ±  44,605,333 | | random\_sp        |               0.001 | 19,451,341,325 ns/iter ± 291,675,532 |

# Analysis

**TODO!** Briefly, Clang generates code for `c_original` that does all of its

counting logic with branches, which is probably why it struggles with

`random_sp` — the M1 just can't predict the branches. `c_for_loop` and

`rust_for_loop` both yields assembly that uses the `cinc` and `csel`, avoiding

costly unpredictable branches. Rust/LLVM is able to autovectorize

`rust_iter` but it generates a whole mess of instructions, and I haven't

put the time to understand what the instructions are actually doing. For

`rust_simd`, Rust/LLVM generates nice SIMD code that processes 16 bytes

at a time, in a tiny loop.

Details of my testing machine

 - Apple MacBook Pro M1, 2020.

 - RAM: 8 GB

 - Max Clock speed: 3.2 GHz

 - L2 cache: 12 MB

 - Apple clang version 14.0.0 (clang-1400.0.29.202)

 - rustc 1.68.0-nightly (61a415be5 2023-01-12)

 - LLVM version: 15.0.6

# License

Copyright © 2023 Eddie Antonio Santos. AGPL-3.0 licensed.