Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alandefreitas/scistats

High-Performance Descriptive Statistics and Hypothesis Tests in C++20
https://github.com/alandefreitas/scistats

bayesian-statistics data-analysis descriptive-statistics hypothesis-testing performance-statistics statistics

Last synced: 2 months ago
JSON representation

High-Performance Descriptive Statistics and Hypothesis Tests in C++20

Awesome Lists containing this project

README

        

# Scistats

> High-Performance Descriptive Statistics and Hypothesis Tests in C++20

[![Scistats](docs/img/banner.gif)](https://alandefreitas.github.io/scistats/)


Statistics help us analyze and interpret data. High-performance statistical algorithms help us analyze and interpret a lot of data. Most environments provide convenient helper functions to calculate basic statistics. Scistats aims to provide high-performance statistical algorithms with an easy and familiar interface. All algorithms can run sequentially or in parallel, depending on how much data you have.


Table of Contents

- [Quick Start](#quick-start)
- [Descriptive statistics](#descriptive-statistics)
- [Central Tendency](#central-tendency)
- [Dispersion](#dispersion)
- [Multivariate Analysis](#multivariate-analysis)
- [Probability Distributions](#probability-distributions)
- [Hypothesis Testing](#hypothesis-testing)
- [Bayesian statistics](#bayesian-statistics)
- [Mathematics](#mathematics)
- [Parallel Arithmetic](#parallel-arithmetic)
- [Constants](#constants)
- [Functions](#functions)
- [Measuring Time](#measuring-time)
- [Random Number Generators](#random-number-generators)
- [Roadmap](#roadmap)
- [Integration](#integration)
- [Build from Source](#build-from-source)
- [CMake targets](#cmake-targets)
- [Other build systems](#other-build-systems)
- [Contributing](#contributing)
- [Contributors](#contributors)

## Quick Start

Scistats extends the numeric facilities of the standard library to include statistics that work with iterators and ranges. This means you can do things like:

```cpp
std::vector v{/*...*/};
float m = scistats::mean(v);
```

or

```cpp
std::vector v{/*...*/};
float s = scistats::stddev(v); //
```

or

```cpp
std::vector v1{/*...*/};
std::vector v2{/*...*/};
float c = scistats::cov(v); // covariance
```

or

```cpp
std::vector v{/*...*/};
float p = scistats::t_test(v); // student's t hypothesis test
```

All algorithms allow execution policies and iterators. So you can do

```cpp
std::vector v{/*...*/};
float m = scistats::mean(scistats::execution::par, v);
```

to calculate your average in parallel. Or

```cpp
std::vector v{/*...*/};
float m = scistats::mean(scistats::execution::seq, v);
```

to explicitly tell scistats you don't want that to be calculated in parallel. If no execution policy is provided, scistats will choose a policy according to the input size.

As usual, you can also work directly with iterators, so

```cpp
std::vector v{/*...*/};
float m = scistats::mean(v.begin(), v.end();
```

also works.

Note that, when needed, the result type gets promoted to `float`. If the result for a given statistic needs to be floating point, scistats will always promote an integer input type to a corresponding floating type large enough to keep the results without losing precision.

## Descriptive statistics

### Central Tendency

With ranges:

```cpp
using namespace scistats;
// ...
mean(x);
```

With iterators:

```cpp
mean(x.begin(), x.end());
```

You can run any algorithm in parallel by changing the execution policy:

```cpp
mean(execution::seq, x);
mean(execution::par, x);
```

If no execution policy is provided, scistats will infer the best execution policy according to the input data.

Other functions to measure central tendency are:

| Function | Description |
| ----------- | --------------- |
| `mean(x)` | Arithmetic mean |
| `median(x)` | Median |
| `mode(x)` | Mode |

### Dispersion

To calculate the standard deviation of a data set:

```cpp
stddev(x);
```

If you already know the mean `m`, you can make calculations faster with:

```cpp
stddev(x,m);
```

Other functions to measure dispersion are:

| Function | Description |
| ----------------- | --------------------------- |
| `var(x)` | Variance |
| `stddev(x)` | Standard Deviation |
| `min(x)` | Minimum Value |
| `max(x)` | Maximum Value |
| `bounds(x)` | Minimum and Maximum Values |
| `percentile(x,p)` | Calculate `p`-th percentile |

### Multivariate Analysis

To calculate the covariance of two data sets:

```cpp
cov(x,y);
```

## Probability Distributions

To get the probability of `x` in a normal distribution:

```cpp
norm_pdf(x);
```

To get the cumulative probability of `x` in a normal distribution:

```cpp
norm_cdf(x);
```

To get the value `x` that has a cumulative probability `p` in a normal distribution:

```cpp
norm_inv(p);
```

| Probability | Cumulative | Inverse | Description |
| ------------- | ------------- | ------------- | ------------------------ |
| `norm_pdf(x)` | `norm_cdf(x)` | `norm_inv(p)` | Normal distribution |
| `t_pdf(x,df)` | `t_cdf(x,df)` | `t_inv(p,df)` | Student's T distribution |

where `df` is the degrees of freedom in the probability distribution.

## Hypothesis Testing

To test the hypothesis that the values in `x` come from a distribution with `mean(x)` is zero:

```cpp
t_test(x);
```

To test the hypothesis that the values in `x` and `y` have the same mean:

```cpp
t_test(x,y);
```

For a paired test:

```cpp
t_test_paired(x,y);
```

To get a confidence interval for these tests:

```cpp
t_test_interval(x);
t_test_interval(x,y);
```

## Bayesian statistics

Given (i) the probability `P(E|H)=likelihood` of the evidence `E` given the hypothesis `H`, (ii) the prior probability `p_hypothesis` of hypothesis `H`, and (iii) the prior probability `p_evidence` of evidence `E`, we can calculate the probability `P(H|E)` of a hypothesis `H` given the evidence `E` with:

```cpp
bayes_theorem(likelihood, p_hypothesis, p_evidence)
```

Given `P(E|H)` and `P(E|not H)`, we can calculate the bayes factor:

```cpp
bayes_factor(p_evidence_given_h, p_evidence_given_not_h)
```

## Mathematics

### Parallel Arithmetic

To sum the elements of a range in parallel:

```cpp
sum(execution::parallel_policy, x)
```

Or let `scistats` infer if it is worth doing it in parallel:

```cpp
sum(x)
```

| Function | Description |
|-------------------|-------------------|
| `sum` | summation |
| `prod` | product |

### Constants

The header `scistats/math/constants.h` defines a number of useful constants as `constexpr` functions:

| Function | Description | Approximate Value |
| ------------- | -------------------------------------------------------------------------------- | ----------------- |
| `pi` | The constant pi | 3.14159 |
| `epsilon(scale)` | A tiny tiny number for a given scale and type | `epsilon(1.)` = 2.22045e-16 |
| `inf` | The number representing infinity | inf |
| `min` | Smallest number | 2.22507e-308 |
| `max` | Largest number | 1.79769e+308 |
| `NaN` | The number representing "not a number" | nan |
| `e` | Euler's number - The base of exponentials | 2.71828 |
| `euler` | Euler–Mascheroni constant / or Euler's gamma : The base of the natural logarithm | 0.577216 |
| `log2_e` | The base-2 logarithm of e | 1.4427 |
| `log10_e` | The base-10 logarithm of e | 0.434294 |
| `sqrt2` | The square root of two | 1.41421 |
| `sqrt1_2` | The square root of one-half | 0.707107 |
| `sqrt3` | The square root of three | 1.73205 |
| `pi_2` | Pi divided by two | 1.5708 |
| `pi_4` | Pi divided by four | 0.785398 |
| `sqrt_pi` | The square root of pi | 1.77245 |
| `two_sqrt_pi` | Two divided by the square root of pi | 1.12838 |
| `one_by_pi` | The reciprocal of pi (1./pi) | 0.31831 |
| `two_by_pi` | Twice the reciprocal of pi | 0.63662 |
| `ln10` | The natural logarithm of ten | 2.30259 |
| `ln2` | The natural logarithm of two | 0.693147 |
| `lnpi` | The natural logarithm of pi | 1.14473 |

### Functions

Some helper functions:

| Function | Description |
| -------------------- | ------------------------------------------------ |
| **Numeric** | |
| `abs` | absolute value (for floating point and integers) |
| `almost_equal` | check if two numbers are almost the same |
| `is_odd` | check if integer is odd |
| `is_even` | check if integer is even |
| **Trigonometric** | |
| `acot` | acot |
| `cot` | cot |
| **Special** | |
| `beta` | beta |
| `beta_inc` | beta_inc |
| `beta_inc_inv` | beta_inc_inv |
| `beta_inc_inv_upper` | beta_inc_inv_upper |
| `beta_inc_upper` | beta_inc_upper |
| `betaln` | betaln |
| `erfinv` | erfinv |
| `gammaln` | gammaln |
| `tgamma` | tgamma |
| `xinbta` | xinbta |

### Measuring Time

To measure the time between two operations:

```cpp
double t1 = tic();
// your operations
double t2 = toc();
```

To measure the time it takes to run a function:

```cpp
double t = timeit([](){
// Your function...
});
```

To create a mini-benchmark measuring the time it takes to run a function:

```cpp
std::vector t = minibench([](){
// Your function...
});
std::cout << "Mean: " << mean(t) << std::endl;
std::cout << "Standard Deviation: " << stddev(t) << std::endl;
```

### Random Number Generators

To generate a random integer between `a` and `b` with a reasonable
random number generator:

```cpp
randi(a,b)
```

To generate a random number from a normal distribution:

```cpp
randn()
```

To generate a random number from an uniform distribution between `a` and `b`:

```cpp
rand(a,b)
```

## Roadmap

Some functions we plan to implement are:

* Math
* Parallel Arithmetic
* Constants [1](https://www.gnu.org/software/gsl/doc/html/math.html#mathematical-constants)
* Mini-benchmarks
* Random Number Generators
* Descriptive statistics [1](https://docs.python.org/3/library/statistics.html) [2](https://docs.scipy.org/doc/scipy/reference/stats.html) [3](https://www.mathworks.com/help/matlab/descriptive-statistics.html)
* Central tendency
* Dispersion
* Correlation
* Hypothesis Tests [1](https://www.mathworks.com/help/stats/hypothesis-tests-1.html)
* Probability distributions [1](https://www.mathworks.com/help/stats/probability-distributions-1.html)
* Basic tests [1](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/) [2](https://www.mathworks.com/help/stats/hypothesis-tests-1.html)
* Non-Parametric tests
* Anova [1](https://www.mathworks.com/help/stats/analysis-of-variance-anova-1.html)
* Bayeasian Statistics [1](https://www.mathworks.com/help/stats/examples/bayesian-analysis-for-a-logistic-regression-model.html)
* Regression Models
* Classification
* Clustering
* Data processing [1](https://www.mathworks.com/help/matlab/preprocessing-data.html)

## Integration

### Build from Source

#### Dependencies

* C++20
* CMake 3.14+

Instructions: Linux/Ubuntu/GCC

Check your GCC version:

```bash
g++ --version
```

The output should be something like:

```console
g++-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
```

If you see a version before GCC-10, update it with

```bash
sudo apt update
sudo apt install gcc-10
sudo apt install g++-10
```

Once you installed a newer version of GCC, you can link it to `update-alternatives`. For instance, if you have GCC-7 and GCC-10, you can link them with:

```bash
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 7
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 7
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 10
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 10
```

You can now use `update-alternatives` to set your default `gcc` and `g++` to a more recent version:

```bash
update-alternatives --config g++
update-alternatives --config gcc
```

Also check your CMake version:

```bash
cmake --version
```

If it's older than CMake 3.14, update it with

```bash
sudo apt upgrade cmake
```

or download the most recent version from [cmake.org](https://cmake.org/).

[Later](#build-the-examples) when running CMake, make sure you are using GCC-8 or higher by appending the following options:

```bash
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 -DCMAKE_CXX_COMPILER=/usr/bin/g++-10
```

Instructions: Mac Os/Clang

Check your Clang version:

```bash
clang --version
```

The output should have something like

```console
Apple clang version 11.0.0
```

If you see a version before Clang 11, update LLVM+Clang:

```bash
curl --output clang.tar.xz -L https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.0/clang+llvm-11.0.0-x86_64-apple-darwin.tar.xz
mkdir clang
tar -xvJf clang.tar.xz -C clang
cd clang/clang+llvm-11.0.0-x86_64-apple-darwin
sudo cp -R * /usr/local/
```

Update CMake with

```bash
sudo brew upgrade cmake
```

or download the most recent version from [cmake.org](https://cmake.org/).

If the last command fails because you don't have [Homebrew](https://brew.sh) on your computer, you can install it with

```bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
```

or you can follow the instructions in [https://brew.sh](https://brew.sh).

Instructions: Windows/MSVC

* Make sure you have a recent version of [Visual Studio](https://visualstudio.microsoft.com)
* Download Git from [https://git-scm.com/download/win](https://git-scm.com/download/win) and install it
* Download CMake from [https://cmake.org/download/](https://cmake.org/download/) and install it

You can see the dependencies in [`source/CMakeLists.txt`](source/CMakeLists.txt).

#### Build the Examples

This will build the examples in the `build/examples` directory:

```bash
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O2"
cmake --build . --parallel 2 --config Release
```

* Replace `--parallel 2` with `--parallel `
* On Windows, replace `-O2` with `/O2`
* On Linux, you might need `sudo` for this last command

#### Installing Scistats from Source

This will install Scistats on your system:

```bash
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O2" -DBUILD_EXAMPLES=OFF -DBUILD_TESTS=OFF
cmake --build . --parallel 2 --config Release
cmake --install .
```

* Replace `--parallel 2` with `--parallel `
* On Windows, replace `-O2` with `/O2`
* On Linux, you might need `sudo` for this last command

### CMake targets

#### Find it as a CMake Package

If you have the library installed, you can call

```cmake
find_package(Scistats)
```

from your CMake build script.

When creating your executable, link the library to the targets you want:

```
add_executable(my_target main.cpp)
target_link_libraries(my_target PUBLIC scistats)
```

Add this header to your source files:

```cpp
#include
```

#### Use it as a CMake subdirectory

You can use Scistats directly in CMake projects without installing it. Check if you have [Cmake](http://cmake.org) 3.14+ installed:

```bash
cmake -version
```

Clone the whole project

```bash
git clone https://github.com/alandefreitas/scistats/
```

and add the subdirectory to your CMake project:

```cmake
add_subdirectory(scistats)
```

When creating your executable, link the library to the targets you want:

```
add_executable(my_target main.cpp)
target_link_libraries(my_target PUBLIC scistats)
```

You can now add the [scistats headers](#examples) to your source files.

However, it's always recommended to look for Scistats with `find_package` before including it as a subdirectory. Otherwise, we can get [ODR errors](https://en.wikipedia.org/wiki/One_Definition_Rule) in larger projects.

#### CMake with Automatic Download

Check if you have [Cmake](http://cmake.org) 3.14+ installed:

```bash
cmake -version
```

Install [CPM.cmake](https://github.com/TheLartians/CPM.cmake) and then:

```cmake
CPMAddPackage(
NAME scistats
GITHUB_REPOSITORY alandefreitas/scistats
GIT_TAG origin/master # or whatever tag you want
)
# ...
target_link_libraries(my_target PUBLIC scistats)
```

You can now add the [scistats headers](#examples) to your source files.

However, it's always recommended to look for Scistats with `find_package` before including it as a subdirectory. You can use:

```
option(CPM_USE_LOCAL_PACKAGES "Try `find_package` before downloading dependencies" ON)
```

to let CPM.cmake do that for you. Otherwise, we can get [ODR errors](https://en.wikipedia.org/wiki/One_Definition_Rule) in larger projects.

### Other build systems

If you want to use it in another build system you can either install the library (Section [*Binary Packages*](#binary-packages) or Section [Installing Scistats from Source](#installing-scistats-from-source) or you have to somehow rewrite the build script.

If you want to rewrite the build script, your project needs to 1) include the headers, and 2) link with the dependencies described in [`source/CMakeLists.txt`](source/CMakeLists.txt).

## Contributing

There are many ways in which you can contribute to this library:

* Testing the library in new environments
* Contributing with interesting examples
* Contributing with new statistics
* Finding problems in this documentation
* Finding bugs in general
* Whatever idea seems interesting to you

If contributing with code, please leave the pedantic mode ON (`-DBUILD_WITH_PEDANTIC_WARNINGS=ON`), and don't forget cppcheck and clang-format.

Example: CLion

![CLion Settings with Pedantic Mode](docs/img/pedantic_clion.png)

If contributing to the documentation, please edit [`README.md`](README.md) directly, as the files in [`./docs`](./docs) are automatically generated with [mdsplit](https://github.com/alandefreitas/mdsplit).

### Contributors



alandefreitas


Alan De Freitas




rcpsilva


Rcpsilva