Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alandefreitas/scistats
High-Performance Descriptive Statistics and Hypothesis Tests in C++20
https://github.com/alandefreitas/scistats
bayesian-statistics data-analysis descriptive-statistics hypothesis-testing performance-statistics statistics
Last synced: 2 months ago
JSON representation
High-Performance Descriptive Statistics and Hypothesis Tests in C++20
- Host: GitHub
- URL: https://github.com/alandefreitas/scistats
- Owner: alandefreitas
- License: mit
- Created: 2020-10-28T21:16:41.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2020-12-08T06:50:55.000Z (about 4 years ago)
- Last Synced: 2024-05-21T04:02:48.142Z (7 months ago)
- Topics: bayesian-statistics, data-analysis, descriptive-statistics, hypothesis-testing, performance-statistics, statistics
- Language: C++
- Homepage: https://alandefreitas.github.io/scistats/
- Size: 3.19 MB
- Stars: 5
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing.md
- License: LICENSE
Awesome Lists containing this project
README
# Scistats
> High-Performance Descriptive Statistics and Hypothesis Tests in C++20
[![Scistats](docs/img/banner.gif)](https://alandefreitas.github.io/scistats/)
Statistics help us analyze and interpret data. High-performance statistical algorithms help us analyze and interpret a lot of data. Most environments provide convenient helper functions to calculate basic statistics. Scistats aims to provide high-performance statistical algorithms with an easy and familiar interface. All algorithms can run sequentially or in parallel, depending on how much data you have.
Table of Contents
- [Quick Start](#quick-start)
- [Descriptive statistics](#descriptive-statistics)
- [Central Tendency](#central-tendency)
- [Dispersion](#dispersion)
- [Multivariate Analysis](#multivariate-analysis)
- [Probability Distributions](#probability-distributions)
- [Hypothesis Testing](#hypothesis-testing)
- [Bayesian statistics](#bayesian-statistics)
- [Mathematics](#mathematics)
- [Parallel Arithmetic](#parallel-arithmetic)
- [Constants](#constants)
- [Functions](#functions)
- [Measuring Time](#measuring-time)
- [Random Number Generators](#random-number-generators)
- [Roadmap](#roadmap)
- [Integration](#integration)
- [Build from Source](#build-from-source)
- [CMake targets](#cmake-targets)
- [Other build systems](#other-build-systems)
- [Contributing](#contributing)
- [Contributors](#contributors)## Quick Start
Scistats extends the numeric facilities of the standard library to include statistics that work with iterators and ranges. This means you can do things like:
```cpp
std::vector v{/*...*/};
float m = scistats::mean(v);
```or
```cpp
std::vector v{/*...*/};
float s = scistats::stddev(v); //
```or
```cpp
std::vector v1{/*...*/};
std::vector v2{/*...*/};
float c = scistats::cov(v); // covariance
```or
```cpp
std::vector v{/*...*/};
float p = scistats::t_test(v); // student's t hypothesis test
```
All algorithms allow execution policies and iterators. So you can do```cpp
std::vector v{/*...*/};
float m = scistats::mean(scistats::execution::par, v);
```to calculate your average in parallel. Or
```cpp
std::vector v{/*...*/};
float m = scistats::mean(scistats::execution::seq, v);
```to explicitly tell scistats you don't want that to be calculated in parallel. If no execution policy is provided, scistats will choose a policy according to the input size.
As usual, you can also work directly with iterators, so
```cpp
std::vector v{/*...*/};
float m = scistats::mean(v.begin(), v.end();
```also works.
Note that, when needed, the result type gets promoted to `float`. If the result for a given statistic needs to be floating point, scistats will always promote an integer input type to a corresponding floating type large enough to keep the results without losing precision.
## Descriptive statistics
### Central Tendency
With ranges:
```cpp
using namespace scistats;
// ...
mean(x);
```With iterators:
```cpp
mean(x.begin(), x.end());
```You can run any algorithm in parallel by changing the execution policy:
```cpp
mean(execution::seq, x);
mean(execution::par, x);
```If no execution policy is provided, scistats will infer the best execution policy according to the input data.
Other functions to measure central tendency are:
| Function | Description |
| ----------- | --------------- |
| `mean(x)` | Arithmetic mean |
| `median(x)` | Median |
| `mode(x)` | Mode |### Dispersion
To calculate the standard deviation of a data set:
```cpp
stddev(x);
```If you already know the mean `m`, you can make calculations faster with:
```cpp
stddev(x,m);
```Other functions to measure dispersion are:
| Function | Description |
| ----------------- | --------------------------- |
| `var(x)` | Variance |
| `stddev(x)` | Standard Deviation |
| `min(x)` | Minimum Value |
| `max(x)` | Maximum Value |
| `bounds(x)` | Minimum and Maximum Values |
| `percentile(x,p)` | Calculate `p`-th percentile |### Multivariate Analysis
To calculate the covariance of two data sets:
```cpp
cov(x,y);
```## Probability Distributions
To get the probability of `x` in a normal distribution:
```cpp
norm_pdf(x);
```To get the cumulative probability of `x` in a normal distribution:
```cpp
norm_cdf(x);
```To get the value `x` that has a cumulative probability `p` in a normal distribution:
```cpp
norm_inv(p);
```| Probability | Cumulative | Inverse | Description |
| ------------- | ------------- | ------------- | ------------------------ |
| `norm_pdf(x)` | `norm_cdf(x)` | `norm_inv(p)` | Normal distribution |
| `t_pdf(x,df)` | `t_cdf(x,df)` | `t_inv(p,df)` | Student's T distribution |where `df` is the degrees of freedom in the probability distribution.
## Hypothesis Testing
To test the hypothesis that the values in `x` come from a distribution with `mean(x)` is zero:
```cpp
t_test(x);
```To test the hypothesis that the values in `x` and `y` have the same mean:
```cpp
t_test(x,y);
```For a paired test:
```cpp
t_test_paired(x,y);
```To get a confidence interval for these tests:
```cpp
t_test_interval(x);
t_test_interval(x,y);
```## Bayesian statistics
Given (i) the probability `P(E|H)=likelihood` of the evidence `E` given the hypothesis `H`, (ii) the prior probability `p_hypothesis` of hypothesis `H`, and (iii) the prior probability `p_evidence` of evidence `E`, we can calculate the probability `P(H|E)` of a hypothesis `H` given the evidence `E` with:
```cpp
bayes_theorem(likelihood, p_hypothesis, p_evidence)
```Given `P(E|H)` and `P(E|not H)`, we can calculate the bayes factor:
```cpp
bayes_factor(p_evidence_given_h, p_evidence_given_not_h)
```## Mathematics
### Parallel Arithmetic
To sum the elements of a range in parallel:
```cpp
sum(execution::parallel_policy, x)
```Or let `scistats` infer if it is worth doing it in parallel:
```cpp
sum(x)
```| Function | Description |
|-------------------|-------------------|
| `sum` | summation |
| `prod` | product |### Constants
The header `scistats/math/constants.h` defines a number of useful constants as `constexpr` functions:
| Function | Description | Approximate Value |
| ------------- | -------------------------------------------------------------------------------- | ----------------- |
| `pi` | The constant pi | 3.14159 |
| `epsilon(scale)` | A tiny tiny number for a given scale and type | `epsilon(1.)` = 2.22045e-16 |
| `inf` | The number representing infinity | inf |
| `min` | Smallest number | 2.22507e-308 |
| `max` | Largest number | 1.79769e+308 |
| `NaN` | The number representing "not a number" | nan |
| `e` | Euler's number - The base of exponentials | 2.71828 |
| `euler` | Euler–Mascheroni constant / or Euler's gamma : The base of the natural logarithm | 0.577216 |
| `log2_e` | The base-2 logarithm of e | 1.4427 |
| `log10_e` | The base-10 logarithm of e | 0.434294 |
| `sqrt2` | The square root of two | 1.41421 |
| `sqrt1_2` | The square root of one-half | 0.707107 |
| `sqrt3` | The square root of three | 1.73205 |
| `pi_2` | Pi divided by two | 1.5708 |
| `pi_4` | Pi divided by four | 0.785398 |
| `sqrt_pi` | The square root of pi | 1.77245 |
| `two_sqrt_pi` | Two divided by the square root of pi | 1.12838 |
| `one_by_pi` | The reciprocal of pi (1./pi) | 0.31831 |
| `two_by_pi` | Twice the reciprocal of pi | 0.63662 |
| `ln10` | The natural logarithm of ten | 2.30259 |
| `ln2` | The natural logarithm of two | 0.693147 |
| `lnpi` | The natural logarithm of pi | 1.14473 |### Functions
Some helper functions:
| Function | Description |
| -------------------- | ------------------------------------------------ |
| **Numeric** | |
| `abs` | absolute value (for floating point and integers) |
| `almost_equal` | check if two numbers are almost the same |
| `is_odd` | check if integer is odd |
| `is_even` | check if integer is even |
| **Trigonometric** | |
| `acot` | acot |
| `cot` | cot |
| **Special** | |
| `beta` | beta |
| `beta_inc` | beta_inc |
| `beta_inc_inv` | beta_inc_inv |
| `beta_inc_inv_upper` | beta_inc_inv_upper |
| `beta_inc_upper` | beta_inc_upper |
| `betaln` | betaln |
| `erfinv` | erfinv |
| `gammaln` | gammaln |
| `tgamma` | tgamma |
| `xinbta` | xinbta |### Measuring Time
To measure the time between two operations:
```cpp
double t1 = tic();
// your operations
double t2 = toc();
```To measure the time it takes to run a function:
```cpp
double t = timeit([](){
// Your function...
});
```To create a mini-benchmark measuring the time it takes to run a function:
```cpp
std::vector t = minibench([](){
// Your function...
});
std::cout << "Mean: " << mean(t) << std::endl;
std::cout << "Standard Deviation: " << stddev(t) << std::endl;
```### Random Number Generators
To generate a random integer between `a` and `b` with a reasonable
random number generator:```cpp
randi(a,b)
```To generate a random number from a normal distribution:
```cpp
randn()
```To generate a random number from an uniform distribution between `a` and `b`:
```cpp
rand(a,b)
```## Roadmap
Some functions we plan to implement are:
* Math
* Parallel Arithmetic
* Constants [1](https://www.gnu.org/software/gsl/doc/html/math.html#mathematical-constants)
* Mini-benchmarks
* Random Number Generators
* Descriptive statistics [1](https://docs.python.org/3/library/statistics.html) [2](https://docs.scipy.org/doc/scipy/reference/stats.html) [3](https://www.mathworks.com/help/matlab/descriptive-statistics.html)
* Central tendency
* Dispersion
* Correlation
* Hypothesis Tests [1](https://www.mathworks.com/help/stats/hypothesis-tests-1.html)
* Probability distributions [1](https://www.mathworks.com/help/stats/probability-distributions-1.html)
* Basic tests [1](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/) [2](https://www.mathworks.com/help/stats/hypothesis-tests-1.html)
* Non-Parametric tests
* Anova [1](https://www.mathworks.com/help/stats/analysis-of-variance-anova-1.html)
* Bayeasian Statistics [1](https://www.mathworks.com/help/stats/examples/bayesian-analysis-for-a-logistic-regression-model.html)
* Regression Models
* Classification
* Clustering
* Data processing [1](https://www.mathworks.com/help/matlab/preprocessing-data.html)## Integration
### Build from Source
#### Dependencies
* C++20
* CMake 3.14+Instructions: Linux/Ubuntu/GCC
Check your GCC version:
```bash
g++ --version
```The output should be something like:
```console
g++-8 (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
```If you see a version before GCC-10, update it with
```bash
sudo apt update
sudo apt install gcc-10
sudo apt install g++-10
```Once you installed a newer version of GCC, you can link it to `update-alternatives`. For instance, if you have GCC-7 and GCC-10, you can link them with:
```bash
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 7
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 7
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 10
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 10
```You can now use `update-alternatives` to set your default `gcc` and `g++` to a more recent version:
```bash
update-alternatives --config g++
update-alternatives --config gcc
```Also check your CMake version:
```bash
cmake --version
```If it's older than CMake 3.14, update it with
```bash
sudo apt upgrade cmake
```or download the most recent version from [cmake.org](https://cmake.org/).
[Later](#build-the-examples) when running CMake, make sure you are using GCC-8 or higher by appending the following options:
```bash
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 -DCMAKE_CXX_COMPILER=/usr/bin/g++-10
```Instructions: Mac Os/Clang
Check your Clang version:
```bash
clang --version
```The output should have something like
```console
Apple clang version 11.0.0
```If you see a version before Clang 11, update LLVM+Clang:
```bash
curl --output clang.tar.xz -L https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.0/clang+llvm-11.0.0-x86_64-apple-darwin.tar.xz
mkdir clang
tar -xvJf clang.tar.xz -C clang
cd clang/clang+llvm-11.0.0-x86_64-apple-darwin
sudo cp -R * /usr/local/
```Update CMake with
```bash
sudo brew upgrade cmake
```or download the most recent version from [cmake.org](https://cmake.org/).
If the last command fails because you don't have [Homebrew](https://brew.sh) on your computer, you can install it with
```bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
```or you can follow the instructions in [https://brew.sh](https://brew.sh).
Instructions: Windows/MSVC
* Make sure you have a recent version of [Visual Studio](https://visualstudio.microsoft.com)
* Download Git from [https://git-scm.com/download/win](https://git-scm.com/download/win) and install it
* Download CMake from [https://cmake.org/download/](https://cmake.org/download/) and install itYou can see the dependencies in [`source/CMakeLists.txt`](source/CMakeLists.txt).
#### Build the Examples
This will build the examples in the `build/examples` directory:
```bash
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O2"
cmake --build . --parallel 2 --config Release
```* Replace `--parallel 2` with `--parallel `
* On Windows, replace `-O2` with `/O2`
* On Linux, you might need `sudo` for this last command#### Installing Scistats from Source
This will install Scistats on your system:
```bash
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O2" -DBUILD_EXAMPLES=OFF -DBUILD_TESTS=OFF
cmake --build . --parallel 2 --config Release
cmake --install .
```* Replace `--parallel 2` with `--parallel `
* On Windows, replace `-O2` with `/O2`
* On Linux, you might need `sudo` for this last command### CMake targets
#### Find it as a CMake Package
If you have the library installed, you can call
```cmake
find_package(Scistats)
```from your CMake build script.
When creating your executable, link the library to the targets you want:
```
add_executable(my_target main.cpp)
target_link_libraries(my_target PUBLIC scistats)
```Add this header to your source files:
```cpp
#include
```#### Use it as a CMake subdirectory
You can use Scistats directly in CMake projects without installing it. Check if you have [Cmake](http://cmake.org) 3.14+ installed:
```bash
cmake -version
```Clone the whole project
```bash
git clone https://github.com/alandefreitas/scistats/
```and add the subdirectory to your CMake project:
```cmake
add_subdirectory(scistats)
```When creating your executable, link the library to the targets you want:
```
add_executable(my_target main.cpp)
target_link_libraries(my_target PUBLIC scistats)
```You can now add the [scistats headers](#examples) to your source files.
However, it's always recommended to look for Scistats with `find_package` before including it as a subdirectory. Otherwise, we can get [ODR errors](https://en.wikipedia.org/wiki/One_Definition_Rule) in larger projects.
#### CMake with Automatic Download
Check if you have [Cmake](http://cmake.org) 3.14+ installed:
```bash
cmake -version
```Install [CPM.cmake](https://github.com/TheLartians/CPM.cmake) and then:
```cmake
CPMAddPackage(
NAME scistats
GITHUB_REPOSITORY alandefreitas/scistats
GIT_TAG origin/master # or whatever tag you want
)
# ...
target_link_libraries(my_target PUBLIC scistats)
```You can now add the [scistats headers](#examples) to your source files.
However, it's always recommended to look for Scistats with `find_package` before including it as a subdirectory. You can use:
```
option(CPM_USE_LOCAL_PACKAGES "Try `find_package` before downloading dependencies" ON)
```to let CPM.cmake do that for you. Otherwise, we can get [ODR errors](https://en.wikipedia.org/wiki/One_Definition_Rule) in larger projects.
### Other build systems
If you want to use it in another build system you can either install the library (Section [*Binary Packages*](#binary-packages) or Section [Installing Scistats from Source](#installing-scistats-from-source) or you have to somehow rewrite the build script.
If you want to rewrite the build script, your project needs to 1) include the headers, and 2) link with the dependencies described in [`source/CMakeLists.txt`](source/CMakeLists.txt).
## Contributing
There are many ways in which you can contribute to this library:
* Testing the library in new environments
* Contributing with interesting examples
* Contributing with new statistics
* Finding problems in this documentation
* Finding bugs in general
* Whatever idea seems interesting to youIf contributing with code, please leave the pedantic mode ON (`-DBUILD_WITH_PEDANTIC_WARNINGS=ON`), and don't forget cppcheck and clang-format.
Example: CLion
![CLion Settings with Pedantic Mode](docs/img/pedantic_clion.png)If contributing to the documentation, please edit [`README.md`](README.md) directly, as the files in [`./docs`](./docs) are automatically generated with [mdsplit](https://github.com/alandefreitas/mdsplit).
### Contributors