Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kerollmops/sdset

Set theory applied on sorted and deduplicated slices.
https://github.com/kerollmops/sdset

deduplicated-slices no-allocation set-theory slice

Last synced: about 9 hours ago
JSON representation

Set theory applied on sorted and deduplicated slices.

Awesome Lists containing this project

README

        

# SdSet

[![SdSet crate](https://img.shields.io/crates/v/sdset.svg)](https://crates.io/crates/sdset)
[![SdSet documentation](https://docs.rs/sdset/badge.svg)](https://docs.rs/sdset)

Set theory applied on sorted and deduplicated slices. Much performance! Such Wow!

[API Documentation can be found on docs.rs](https://docs.rs/sdset).

`sdset` stands for `sorted-deduplicated-slices-set` which is a little bit too long.

## Performances

Note about the tests, which are done on ranges of integer, if it ends with:
- `two_slices_big`, the first slice contains `0..100` and the second has `1..101`
- `two_slices_big2`, the first contains `0..100` and the second has `51..151`
- `two_slices_big3`, the first contains `0..100` and the second has `100..200`
- `three_slices_big`, the first contains `0..100`, the second has `1..101` and the third has `2..102`
- `three_slices_big2`, the first contains `0..100`, the second has `34..134` and the third has `67..167`
- `three_slices_big3`, the first contains `0..100`, the second has `100..200` and the third has `200..300`

These slices of runs of integer are useful when they overlap, we can see how performances changes when different parts of the slices overlaps.

For more informations on "Why is there no really big slices benchmarks ?", you can see [my response on /r/rust](https://www.reddit.com/r/rust/comments/98ahv5/sdset_set_theory_applied_on_sorted_and/e4ervlc/).

To run the benchmarks you must enable the `unstable` feature.

```bash
$ cargo bench --features unstable
```

Note that the `sdset` set operations does not need many allocations so it starts with a serious advantage. For more information you can see the benchmarks variance.

`_btree` are benchmarks that uses *two* or *three* `BTreeSet`s which contains runs of integers (see above), the `BTreeSet`s creations are not taken into account. The set operations are done on these sets and the result is accumulated in a final `Vec`.

`_fnv` are benchmarks that uses *two* or *three* `HashSet`s which contains runs of integers (see above), it uses [a custom `Hasher` named `fnv`](https://github.com/servo/rust-fnv) that is specialized for little values like integers, the `HashSet`s creations are not taken into account. The set operations are done on these sets and the result is accumulated in a final `Vec`.

The `_vec` benchmarks are available for the union set operation only, it consist of a `Vec` which is populated with the elements of *two* or *three* slices (see above), sorted and deduplicated.

The `duo` and `multi` measurements are the implementations that are part of this crate, the first one can only do set operations on **two** sets and the second one can be used for any given number of sets.

### Histograms

Histograms can be generated using the benchmarks by executing the following command:

```bash
$ export CARGO_BENCH_CMD='cargo bench --features unstable'
$ ./gen_graphs.sh xxx.bench
```

This is much more easier to read statistics and to see how `sdset` is more performant on already sorted and deduplicated slices than any other kind of collection.

![difference benchmarks](misc/difference.png)

![intersection benchmarks](misc/intersection.png)

![union benchmarks](misc/union.png)

![symmetric difference benchmarks](misc/symmetric_difference.png)