Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hosseinmoein/DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://github.com/hosseinmoein/DataFrame

ai cpp data-analysis data-science dataframe financial-data-analysis financial-engineering heterogeneous-data large-data machine-learning multidimensional-data numerical-analysis pandas polars statistical statistical-analysis tensor tensorboard trading-algorithms trading-strategies

Last synced: about 2 months ago
JSON representation

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

Awesome Lists containing this project

README

        

[![C++23](https://img.shields.io/badge/C%2B%2B-23-blue.svg)](https://isocpp.org/std/the-standard )
[![Build status](https://ci.appveyor.com/api/projects/status/hjw01qui3bvxs8yi?svg=true)](https://ci.appveyor.com/project/hosseinmoein/dataframe)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/db646376a4014c3788c7224e670fe451)](https://app.codacy.com/organizations/gh/hosseinmoein/repositories)


[![GitHub tag (latest by date)](https://img.shields.io/github/tag-date/hosseinmoein/DataFrame.svg?color=blue&label=Official%20Release&style=popout)](https://github.com/hosseinmoein/DataFrame/releases)
[![Conan Center](https://img.shields.io/conan/v/dataframe)](https://conan.io/center/recipes/dataframe)
[![VCPKG package](https://repology.org/badge/version-for-repo/vcpkg/dataframe.svg)](https://vcpkg.link/ports/dataframe)

DataFrame Lion

## DataFrame documentation with code samples
This is a C++ analytical library designed for data analysis similar to libraries in Python and R. For example, you would compare this to [Pandas](https://pandas.pydata.org), [R data.frame](https://www.w3schools.com/r/r_data_frames.asp), or [Polars](https://www.pola.rs)

You can slice the data in many different ways. You can join, merge, group-by the data. You can run various statistical, summarization, financial, and ML algorithms on the data. You can add your custom algorithms easily. You can multi-column sort, custom pick and delete the data. And more …

DataFrame also includes a large collection of analytical algorithms in form of visitors. These are from basic stats such as Mean, Std Deviation, Return, … to more involved analysis such as Affinity Propagation, Polynomial Fit, Fast Fourier transform of arbitrary length … including a good collection of trading indicators. You can also easily add your own algorithms.

DataFrame also employs extensive multithreading in almost all its API’s, for large datasets. That makes DataFrame especially suitable for analyzing large datasets.

For basic operations to start you off, see [Hello World](examples/hello_world.cc). For a complete list of features with code samples, see documentation.

I have followed a few principles in this library:

1. [Support any type either built-in or user defined without needing new code](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/any_type.html)
2. [Never chase pointers ala _linked lists_, _std::any_, _pointer to base_, ...](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/pointers.html)
3. [Have all column data in contiguous memory space](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/contiguous_memory.html)
4. [Never use more space than you need ala _unions_, _std::variant_, ...](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/std_variant.html)
5. [Avoid copying data as much as possible](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/copying_data.html)
6. [Use multi-threading but only when it makes sense](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/multithreading.html)
7. [Do not attempt to protect the user against _garbage in_, _garbage out_](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/garbage_in_garbage_out.html)
8. Keep DataFrame library self-contained, meaning DataFrame must only depend on _C++ language_ and its _standard library_

---

### Performance
You have probably heard of Polars DataFrame. It is implemented in Rust and ported with zero-overhead to Python (as long as you don’t have a loop). I have been asked by many people to write a comparison for DataFrame vs. Polars. So, I finally found some time to learn a bit about Polars and write a very simple benchmark.

I wrote the following identical programs for both Polars and C++ DataFrame (and Pandas). I used Polars version 0.19.14. And I used C++20 clang compiler with -O3 option. I ran both on my, somewhat outdated, MacBook Pro.

In both cases, I created a dataframe with 3 random columns. The C++ DataFrame also required an additional index column of the same size. Polars doesn’t believe in index columns (that has its own pros and cons. I am not going through it here).
Each program has three identical parts. First it generates and populates 3 columns with 300m random numbers each (in case of C++ DataFrame, it must also generate a sequential index column of the same size). That is the part I am _not_ interested in. In the second part, it calculates the mean of the first column, the variance of the second column, and the Pearson correlation of the second and third columns. In the third part, it does a select (or filter as Polars calls it) on one of the columns.

**Results**:

The maximum dataset I could load into Polars was 300m rows per column. Any bigger dataset blew up the memory and caused OS to kill it. I ran C++ DataFrame with 10b rows per column and I am sure it would have run with bigger datasets too. So, I was forced to run both with 300m rows to compare.
I ran each test 4 times and took the best time. Polars numbers varied a lot from one run to another, especially calculation and selection times. C++ DataFrame numbers were significantly more consistent.

```text
C++ DataFrame:
Data generation/load time: 26.945900 secs
Calculation time: 1.260150 secs
Selection time: 0.742493 secs
Overall time: 28.948600 secs

Polars:
Data generation/load time: 28.468640 secs
Calculation time: 4.876561 secs
Selection time: 3.876561 secs
Overall time: 36.876345 secs

Pandas, for comparison:
Data generation/load time: 36.678976 secs
Calculation time: 40.326350 secs
Selection time: 8.326350 secs
Overall time: 85.845114 secs
```

[C++ DataFrame source file](https://github.com/hosseinmoein/DataFrame/blob/master/benchmarks/dataframe_performance.cc)

[Polars source file](https://github.com/hosseinmoein/DataFrame/blob/master/benchmarks/polars_performance.py)

[Pandas source file](https://github.com/hosseinmoein/DataFrame/blob/master/benchmarks/pandas_performance.py)

---

[**Please consider sponsoring DataFrame, especially if you are using it in production capacity. It is the strongest form of appreciation**](https://github.com/sponsors/hosseinmoein)

---

### Installing using CMake
```sh
mkdir [Debug|Release]
cd [Debug|Release]

cmake -DCMAKE_BUILD_TYPE=Release -DHMDF_BENCHMARKS=1 -DHMDF_EXAMPLES=1 -DHMDF_TESTING=1 ..
cmake -DCMAKE_BUILD_TYPE=Debug -DHMDF_SANITY_EXCEPTIONS=1 -DHMDF_BENCHMARKS=1 -DHMDF_EXAMPLES=1 -DHMDF_TESTING=1 ..

make
make install

cd [Debug|Release]
make uninstall
```

### Package managers
DataFrame is available on [_Conan_](https://conan.io/center/recipes/dataframe) platform. See the [_Conan_ docs](https://docs.conan.io/en/latest/) for more information.

DataFrame is also available on [_Microsoft VCPKG_](https://vcpkg.link/ports/dataframe) platform. See [_VCPKG docs_](https://learn.microsoft.com/en-us/vcpkg/) for more information