https://github.com/lovasko/aggstat

Constant Time and Space Implementation of Aggregate Functions in C99
https://github.com/lovasko/aggstat

Last synced: about 1 year ago
JSON representation

Constant Time and Space Implementation of Aggregate Functions in C99

Host: GitHub
URL: https://github.com/lovasko/aggstat
Owner: lovasko
License: bsd-2-clause
Created: 2019-10-07T21:43:20.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2021-06-10T23:57:38.000Z (about 5 years ago)
Last Synced: 2024-12-25T11:32:48.175Z (over 1 year ago)
Language: C
Size: 147 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # libagg

The `libagg` is a C99 library that implements a set of algorithms that provide statistical

aggregation of a stream of floating-point numerical values. All aggregate functions are provided in

two separate implementations: _static_ and _streaming_.

The two key properties achieved by the streaming implementations relate to the fact that

the full set of values does not need to be known ahead, enabling:

 * large data sets to be processed in constant amount of memory

 * analysis of data sets of unknown length (e.g. user input, non-deterministic behaviour of a

   system)

These abilities come at a cost: the precision of the streaming implementation is reduced compared to

the static implementation. This trade-off is closely monitored by the test suite to document and

empirically prove the magnitude of the error for each aggregate function.

## Aggregate Functions

The module provides the following aggregate functions:

 * first value

 * last value

 * count

 * sum

 * minimum

 * maximum

 * average

 * variance

 * standard deviation

 * skewness

 * kurtosis

 * p-quantile

 * median

## API

### Functions

The streaming part of the library consists of the following three functions:

 * `agg_new` to initialize or reset the state

 * `agg_put` to update the statistical aggregate estimate

 * `agg_get` to obtain the statistical aggregate estimate

The static part of the library consists only of one function:

 * `agg_run` to calculate the statistical aggregate

### Types

The streaming part of the library consists only of one type:

  * `struct agg` which keeps track of state and should be treated as an opaque structure

The static part of the library does not use any custom types.

### Constants

The following constants are used to identify the aggregate functions by both parts:

 * `AGG_FNC_FST` for first value

 * `AGG_FNC_LST` for last value

 * `AGG_FNC_CNT` for count

 * `AGG_FNC_SUM` for sum

 * `AGG_FNC_MIN` for minimum

 * `AGG_FNC_MAX` for maximum

 * `AGG_FNC_AVG` for average

 * `AGG_FNC_VAR` for variance

 * `AGG_FNC_DEV` for standard deviation

 * `AGG_FNC_SKW` for skewness

 * `AGG_FNC_KRT` for kurtosis

 * `AGG_FNC_QTL` for p-quantile

 * `AGG_FNC_MED` for median

## Examples

The following snippet computes the 99th percentile of values in an stream whilst retrieving numbers

from the hypothetical stream by calling the `get_number` function.

```c

struct aggstat agg;

double         num;

double         p99;

bool           ret;

uint8_t        idx;

agg_new(&agg, AGG_FNC_QTL, 0.99);

for (idx = 0; idx < 32; idx += 1) {

  num = get_number();

  agg_put(&agg, num);

}

ret = agg_get(&agg, &p99);

```

A similar computation can be performed using the static algorithm that expects the full stream

knowledge ahead:

```c

double  num[32];

double  var;

bool    ret;

uint8_t idx;

for (idx = 0; idx < 32; idx += 1) {

  num[idx] = get_number();

}

ret = agg_run(&p99, num, 32, AGG_FNC_QTL, 0.99);

```

## Floating-Point Types

The library uses the `double` type by default, as it is the de-facto standard floating-point type

in the C99 language. This is evidenced by the fact that all mathematical functions for `float` and

`long double` are differentiated by a suffix, e.g. `sin`, `sinf`, and `sinl`. Moreover, the

numerical literals in the language are `double` by default too.

In order to use other floating point types instead, the `AGG_BIT` macro with the appropriate bit width

has to be defined before the `agg.h` header file is included (and thus attended to by the pre-processor).

This creates a trade-off, where the precision of all functions The switch is not dynamic and has to

be done in during the compilation of the source code. The following table lists currently supported

floating point types and their respective values recognised by `AGG_BIT`:

| Type         | `AGG_BIT` |

|--------------|-----------|

| `float`      | 32        |

| `double`     | 64        |

| `__float128` | 128       |

## Testing

The library has a particular trade-off at its heart: it sacrifices the precision of the

computations in order to provide the streaming capabilities of the aggregate functions. With the

explicit goal of keeping this trade-off in check, a suite of tests was introduced that compare the

streaming variants of functions to their non-streaming static versions and compute the difference

between the two. Furthermore, each aggregate function specifies its accepted magnitude of error

for a given number of incoming values. The magnitudes are always the upper bound, rounded to the

closes power of ten - both positive and negative.

The precise values can be found in the [ERROR.md](ERROR.md) file.

## Memory Usage

The library does not dynamically allocate any memory and thus all aggregations are performed in a

constant amount of statically allocated memory on the stack. Based on the chosen floating-point

type - `double` or `float` -  the core type `struct agg` takes up 92 and 136 bytes, respectively.

## Performance

Vast majority of the code is branchless and hand-optimized for performance. The test suite measures

the average execution time per a single value, which tends to be in the order of nanoseconds.

The measurements show stable performance with almost no variance, which makes the library suitable

for use in low-latency scenarios.

## Note on Optimizations

All major C99 compilers offer multiple optimization levels, some of which might sacrifice the

correctness of the computation in order to achieve better performance. The `-ffast-math` option,

which is part of the `-Ofast` optimization level, causes a number of changes to the behaviour of the

floating-point computations. This in turn causes slight divergence in the numerical precision of

the algorithms in questions. The error testing takes this into account and monitors the skew

appropriately.

The [ERROR.md](ERROR.md) file contains the columns `double fast` and `float fast` that represent

the `-Ofast` compilation option.

## Note on Randomness

All tests use a very weak source of randomness: a simple linear congruential generator that is not

to be used in a serious production setting where either cryptographic safety or perfect

distribution are of the essence.

The reason for depending on the weak generator is that of dependencies of the module: one of the

design goals of the module is to be extremely light. The standard `rand` function was not used in

order to silence static analysis warnings.

## Standards

Both the library and test suite are written in standards-compliant C99. The provided source code

ought to compile on all standard compilers without any warnings. Reports of any compiler or static

analysis warnings is encourage and will be addressed.

## Future Work

The following areas of focus are not addressed by the library at this time:

  * support for the `long double` time

  * inter-quartile range aggregate function

  * headless mode where the function type is not stored as part of `struct agg`

  * ability to select an integer type size for the count variables

## License

The module is licensed under the 2-clause BSD license (see LICENSE file for more information). In

case you need a different license, feel free to contact the author.

## Author

Daniel Lovasko

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lovasko/aggstat

Awesome Lists containing this project

README