https://github.com/firmai/pandapy

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)
https://github.com/firmai/pandapy
algorithmic-trading arrays data-science data-structures finance machine-learning numpy pandas structured-data
Last synced: 9 months ago
JSON representation
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)
Host: GitHub
URL: https://github.com/firmai/pandapy
Owner: firmai
Created: 2020-01-15T18:21:23.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2021-10-20T11:36:04.000Z (over 4 years ago)
Last Synced: 2025-05-05T02:51:40.215Z (9 months ago)
Topics: algorithmic-trading, arrays, data-science, data-structures, finance, machine-learning, numpy, pandas, structured-data
Language: Python
Homepage: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3599639
Size: 550 KB
Stars: 547
Watchers: 19
Forks: 66
Open Issues: 2
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-python-machine-learning-resources - GitHub - 50% open · ⏱️ 20.10.2021): (数据容器和结构)
awesome-starred - firmai/pandapy - PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai) (finance)
README

          ## PandaPy

[![Downloads](https://pepy.tech/badge/pandapy)](https://pepy.tech/project/pandapy)

[![DOI](https://zenodo.org/badge/234144397.svg)](https://zenodo.org/badge/latestdoi/234144397)

> "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to become mainstream."

[SSRN Report](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3599639)

Snow, Derek (2020), PandaPy: A Wrapper Around Structured Arrays to Mimic ‘structs’ in the C Language, SSRN

```

@software{pandapy,

  title = {{PandaPy}: A Wrapper Around Structured Arrays to Mimic ‘structs’ in the C Language.},

  author = {Snow, Derek},

  url = {https://github.com/firmai/pandapy/},

  version = {1.11},

  date = {2020-05-13},

}

```

---------

**Install**

```

!pip3 install pandapy

```

**Load**

```python

import pandapy as pp

```

#### Why PandaPy? 

1. Maintains the full functionality and speed of structured NumPy datatype (eg., ```array[col1] + array[col2], or np.log(array[col1]```)

1. If you have smaller pandas dataframes (<50K number of records) in a production environment, then it is worth considering PandaPy, you will see a significant speed up and a large reduction in memory usage.

1. When using mixed data types (int, float, datatime, str), PandaPy generally consumes (roughly a 1/3rd) less memory than Pandas.

1. Pandas outperform PandaPy at the same point when Pandas outperform __NumPy__. NumPy generally performs better than pandas for 50K rows or less. Pandas generally performs better than numpy for 500K rows or more; from 50K to 500K rows it is a toss up depending on the operation.

1. Because both Pandas and PandaPy is built on NumPy, the performance difference can be attributed to Pandas overhead. For larger datasets Pandas' hash tables and columnar data format gives it the upperhand on many operations. 

1. The performance claims therefore hold for small datasets, 1,000-100,000 numpy rows. There is however many PandaPy operations that improve relative to Pandas as the number of rows increase: rename, column drop, fillna mean, correlation matrix, filter (``array > 0``), value reads(```a=array[col]```), singular value access (```array[col][pos]```), atomic functions (```sqrt, power```), and np. calculations differences even out (```np.log, np.exp```, etc).

2. Provides wrapper functions over NumPy to give you the usability of Pandas (eg., ```pp.group(array, [col1, col2, col2], ['mean', 'std'], ['Adj_Close','Close'])```

3. If you need Pandas for speciality functions, you can easily ```df = pp.pandas(array)``` and back ```array = pp.structured(df)```

4. For simple calculations on a small dataset (i.e, plus, mult, log) PandaPy is 25x - 80x faster than Pandas.

5. For table functions (i.e., group, pivot, drop, concat, fillna) on a small data set PandaPy is 5x - 100x times faster than Pandas.

6. For most use cases with small data, PandaPy is faster than Dask, Modin Ray and Pandas.

7. The best competing python package for performance on table functions is [datatable](https://github.com/h2oai/datatable), it is 2x - 10x faster than  PandaPy. 

8. The problem is that datatable is 5x - 10x slower with simple calculations (plus, mult, returns), it is less intuitive, does not have a large range of functions, have very few complementary libraries, e.g. matplotlib, and doesn't leave you in a Numpy datatype. 

9. For finance applications the speed of simple calculations takes preference over table function speed.

10. PandaPy is not created to allow you to scale up to clusters for multiple computer processing like Dask, Modin, and Spark, instead it is focused on speed and usability within a single computer's Memory.

11. Machines are getting large, EC2 X1 has 2TB of RAM and is remarkably affordable. If it can be done on a single machine then it should be done on a single machine. Quoting Dask - "For data that fits into RAM, Pandas {PandaPy, NumPy} can often be faster and easier to use than Dask DataFrame"

12. If your dataset is very small you can load your data using PandaPy's ```read()``` function, for medium sized data, it is best to load it with datatable or pyspark and convert it to structured Numpy, if it is large, pyspark, Dask, or Modin, if it is very large use pyspark. 

13. Lastly PandaPy can have as input any multidimensional object and does not have to conform to the basic NumPy datatypes. It can include nested datatypes, subarrays, functions as long as each column conforms to the array lenght, this allows for a great amount of flexibility. You can for example, ```add(array, "panda function",[[pd for i in range(len(multiple_stocks))]])``` to create a list of the panda (pd) module and access it along any index value ```array["panda function"][0].read_csv(url)```.

PandaPy software, similar to the original Pandas project, is developed to improve the usability of python for finance. Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. PandaPy currently houses more than 30 functions. Structured NumPy are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure. 

**Note this is a fledgling project, much room for improvement, all feedback appreciated (issues tab)**

### Description

------------------------

A Structured NumPy Array is an array of structures. NumPy arrays can only contain one data type, but structured arrays in a sense create an array of homogeneous structures. This is done without moving out of NumPy such as is required with Xarray. For structured arrays the data type only has to be the same per column like an SQL data base. Each column can be another multidimensional object and does not have to conform to the basic NumPy datatypes.

PandaPy comes with similar functionality like Pandas, such as groupby, pivot, and others. The biggest benefit of this approach is that NumPy dtype(data type) directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program. If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful. 

### Additional

1. Play around with [speed tests here](https://colab.research.google.com/drive/1JqvplTUUciIw2KGkuoCNv196prl3eoiL) and some more [here](https://colab.research.google.com/drive/1I4sJOM8o4RAqHp3YU1nlx92UxwoC3WB-).

2. Test and explore the package with this [Google Colab Notebook](https://colab.research.google.com/drive/1j45o36_FFIof9uzp1DoyzxETD4lfpci5).

3. Get in touch on [LinkedIn](https://www.linkedin.com/company/firmai) or [Twitter](https://twitter.com/dereknow?lang=en).

4. Use ```table(array)``` to get a pandas looking table printout

5. You can read the paper on [SSRN](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3599639) for a little more information. 

### Functions

#### PandaPy Speed Over Pandas In (X) e.g., (dropnarow) (30x)

----------------------------------

#### Array Structure

    Read In Arrays (read)

    To Pandas (unstructured) 

    Pandas to Structured (structured) 

    To Unstructured (to_unstruct) 

    To Structured (to_struct) 

    Print Table (table) 

#### Explorative Functions

    Descriptive Statistics (describe) (5x)

    Correlation Array (corr) (2x)

#### Finance Functions

    Returns (returns) (50x)  

    Portfolio Value (portfolio_value) (50x)

    Cummulative Value (cummulative_return) (50x)

    Column Lags (lags) (7x)

#### Array Functions

    Drop Null Rows (dropnarow) (30x)

    Drop Column/s (drop) (100x)

    Add Column/s (add) (3x)

    Concatenate (concat) (rows 25x columns 70x)

    Merge (merge) (2x)

    Group by (group) (10x)

    Pivot (pivot) (20x)

    Fill Nulls (fillna) (20x)

    Shift Column (shift) (50x)

    Rename (rename) (500x)

#### Other Speed Tests

    Update (array[col] = values) (60x)

    Addition (array[col] + array[col]) (80x)

    Multiplication (array[col] * array[col]) (80x)

    Log (np.log(array[col]) (25x)

    

    

_note speed tests done on financial dataset only_

### Documentation by Example

------------------------

**Read In Arrays**

```python

# First Example

multiple_stocks = pp.read('https://github.com/firmai/random-assets-two/blob/master/numpy/multiple_stocks.csv?raw=true')

closing = multiple_stocks[['Ticker','Date','Adj_Close']]

piv = pp.pivot(closing,"Date","Ticker","Adj_Close"); piv

closing = pp.to_struct(piv, name_list = [x for x in np.unique(multiple_stocks["Ticker"])])

# Second Example

tsla = pp.read('https://github.com/firmai/random-assets-two/raw/master/numpy/tsla.csv')

crm = pp.read('https://github.com/firmai/random-assets-two/raw/master/numpy/crm.csv')

tsla_sub = tsla[["Date","Adj_Close","Volume"]]

crm_sub = crm[["Date","Adj_Close","Volume"]]

crm_adj = crm[['Date','Adj_Close']]

```

```

closing

```

    array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312,  41.9791832 ,  81.51140594, 66.33999634),

           (35.08446503,  97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312,  41.59314346,  80.89860535, 66.15000153),

           (35.34244537,  97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897,  40.98268127,  80.28580475, 64.58000183),

           ...,

           (21.57999992, 289.79998779, 59.08000183, 11.18000031, 135.27000427, 55.34999847, 158.96000671, 137.53999329, 88.37000275),

           (21.34000015, 291.51998901, 58.65999985, 11.07999992, 132.80999756, 55.27000046, 157.58999634, 136.80999756, 87.95999908),

           (21.51000023, 293.6499939 , 58.47999954, 11.15999985, 134.03999329, 55.34999847, 157.69999695, 136.66999817, 88.08999634)],

          dtype=[('AA', 'DescribeobservationsminimummaximummeanvarianceskewnesskurtosisAA1258.0015.9760.2331.4699.420.67-0.58AAPL1258.0085.39293.65149.452119.860.66-0.28DAL1258.0030.7362.6947.1544.33-0.01-0.78GE1258.006.4228.6718.8548.45-0.25-1.54IBM1258.0099.83161.17133.35116.28-0.370.56KO1258.0032.8155.3541.6728.860.80-0.05MSFT1258.0036.27158.9678.311102.210.61-0.82PEP1258.0078.46139.30102.86229.010.63-0.32UAL1258.0037.7596.7069.22195.650.02-1.04

**Drop Column/s**

```python

removed = pp.drop(closing,["AA","AAPL","IBM"]) ; removed[:5]

```

    array([(44.57522202, 20.72605705, 35.80251312, 41.9791832 , 81.51140594, 66.33999634),

           (43.83200836, 20.34561157, 35.80251312, 41.59314346, 80.89860535, 66.15000153),

           (42.79874039, 19.90727234, 36.07437897, 40.98268127, 80.28580475, 64.58000183),

           (42.57216263, 19.91554451, 36.52467346, 41.50337982, 82.63342285, 65.52999878),

           (43.67792892, 20.15538216, 36.966465  , 42.72432327, 84.13523865, 66.63999939)],

          dtype={'names':['DAL','GE','KO','MSFT','PEP','UAL'], 'formats':['DateAdj_Close_TSLAAdj_Close_CRMVolume02019-01-02310.120135.5501165860012019-01-03300.360130.400696520022019-01-04317.690137.960739410032019-01-07334.960142.220755120042019-01-08335.350145.7207008500

```python

### This is the new function that you should include above

### You can add the same peculuarities to remove

```

**Add and Concatenate**

```python

tsla = pp.add(tsla,["Ticker"], "TSLA", "U10")

crm = pp.add(crm,["Ticker"], "CRM", "U10")

combine = pp.concat(tsla[0:5], crm[0:5], type="row"); combine

```

    array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, '2019-01-02', 'TSLA'),

           (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535, '2019-01-03', 'TSLA'),

           (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244, '2019-01-04', 'TSLA'),

           (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146, '2019-01-07', 'TSLA'),

           (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 , '2019-01-08', 'TSLA'),

           (136.83000183, 133.05000305, 133.3999939 , 135.55000305,  4783900, 135.55000305, '2019-01-02', 'CRM'),

           (134.77999878, 130.1000061 , 133.47999573, 130.3999939 ,  6365700, 130.3999939 , '2019-01-03', 'CRM'),

           (139.32000732, 132.22000122, 133.5       , 137.96000671,  6650600, 137.96000671, '2019-01-04', 'CRM'),

           (143.38999939, 138.78999329, 141.02000427, 142.22000122,  9064800, 142.22000122, '2019-01-07', 'CRM'),

           (146.46000671, 142.88999939, 144.72999573, 145.72000122,  9057300, 145.72000122, '2019-01-08', 'CRM')],

          dtype=[('High', 'Adj_CloseCRMTSLA2019-01-02135.55310.122019-01-03130.40300.362019-01-04137.96317.692019-01-07142.22334.962019-01-08145.72335.35

**Add New Data types**

```python

tsla_extended = pp.add(tsla,"Month",tsla["Date"],'datetime64[M]')

tsla_extended = pp.add(tsla_extended,"Year",tsla_extended["Date"],'datetime64[Y]')

```

**Update Existing Column**

```python

## faster method elsewhere

year_frame = pp.update(tsla,"Date", [dt.year for dt in tsla["Date"].astype(object)],types="|U10"); year_frame[:5]

```

    array([(315.13000488, 298.79998779, 306.1000061 , 310.11999512, 11658600, 310.11999512, 'TSLA', '2019'),

           (309.3999939 , 297.38000488, 307.        , 300.35998535,  6965200, 300.35998535, 'TSLA', '2019'),

           (318.        , 302.73001099, 306.        , 317.69000244,  7394100, 317.69000244, 'TSLA', '2019'),

           (336.73999023, 317.75      , 321.72000122, 334.95999146,  7551200, 334.95999146, 'TSLA', '2019'),

           (344.01000977, 327.01998901, 341.95999146, 335.3500061 ,  7008500, 335.3500061 , 'TSLA', '2019')],

          dtype=[('High', 'TickerMonthYearAdj_Close_meanAdj_Close_stdAdj_Close_minAdj_Close_maxClose_meanClose_stdClose_minClose_max0TSLA2019-01-012019-01-01318.49421.098287.590347.310318.49421.098287.590347.3101TSLA2019-02-012019-01-01307.7288.053291.230321.350307.7288.053291.230321.3502TSLA2019-03-012019-01-01277.7578.925260.420294.790277.7578.925260.420294.7903TSLA2019-04-012019-01-01266.65614.985235.140291.810266.65614.985235.140291.8104TSLA2019-05-012019-01-01219.71524.040185.160255.340219.71524.040185.160255.3405TSLA2019-06-012019-01-01213.71712.125178.970226.430213.71712.125178.970226.4306TSLA2019-07-012019-01-01242.38212.077224.550264.880242.38212.077224.550264.8807TSLA2019-08-012019-01-01225.1037.831211.400238.300225.1037.831211.400238.3008TSLA2019-09-012019-01-01237.2618.436220.680247.100237.2618.436220.680247.1009TSLA2019-10-012019-01-01266.35531.463231.430328.130266.35531.463231.430328.13010TSLA2019-11-012019-01-01338.30013.226313.310359.520338.30013.226313.310359.52011TSLA2019-12-012019-01-01377.69536.183330.370430.940377.69536.183330.370430.940

**Convert Array to Pandas**

```python

grouped_frame = pp.pandas(grouped); grouped_frame.head()

```

  

    

      

      Ticker

      Month

      Year

      Adj_Close_mean

      Adj_Close_std

      Adj_Close_min

      Adj_Close_max

      Close_mean

      Close_std

      Close_min

      Close_max

    

  

  

    

      0

      TSLA

      2019-01-01

      2019-01-01

      318.494284

      21.098362

      287.589996

      347.309998

      318.494284

      21.098362

      287.589996

      347.309998

    

    

      1

      TSLA

      2019-02-01

      2019-01-01

      307.728421

      8.052522

      291.230011

      321.350006

      307.728421

      8.052522

      291.230011

      321.350006

    

    

      2

      TSLA

      2019-03-01

      2019-01-01

      277.757140

      8.924873

      260.420013

      294.790009

      277.757140

      8.924873

      260.420013

      294.790009

    

    

      3

      TSLA

      2019-04-01

      2019-01-01

      266.655716

      14.984572

      235.139999

      291.809998

      266.655716

      14.984572

      235.139999

      291.809998

    

    

      4

      TSLA

      2019-05-01

      2019-01-01

      219.715454

      24.039647

      185.160004

      255.339996

      219.715454

      24.039647

      185.160004

      255.339996

    

  

**From Pandas to Structured**

```python

struct = pp.structured(grouped_frame); struct[:5]

```

    rec.array([('TSLA', '2019-01-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 318.49428449, 21.09836186, 287.58999634, 347.30999756, 318.49428449, 21.09836186, 287.58999634, 347.30999756),

               ('TSLA', '2019-02-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 307.72842086,  8.05252198, 291.23001099, 321.3500061 , 307.72842086,  8.05252198, 291.23001099, 321.3500061 ),

               ('TSLA', '2019-03-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 277.75713966,  8.92487345, 260.42001343, 294.79000854, 277.75713966,  8.92487345, 260.42001343, 294.79000854),

               ('TSLA', '2019-04-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 266.65571594, 14.98457194, 235.13999939, 291.80999756, 266.65571594, 14.98457194, 235.13999939, 291.80999756),

               ('TSLA', '2019-05-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000', 219.7154541 , 24.03964724, 185.16000366, 255.33999634, 219.7154541 , 24.03964724, 185.16000366, 255.33999634)],

              dtype=[('Ticker', 'O'), ('Month', 'CorrelationAAAAPLDALGEIBMKOMSFTPEPUALAA1.000.210.24-0.170.39-0.090.05-0.040.12AAPL0.211.000.86-0.830.220.850.940.850.82DAL0.240.861.00-0.780.140.790.860.780.86GE-0.17-0.83-0.781.000.06-0.76-0.86-0.69-0.76IBM0.390.220.140.061.000.070.150.240.18KO-0.090.850.79-0.760.071.000.940.960.74MSFT0.050.940.86-0.860.150.941.000.930.83PEP-0.040.850.78-0.690.240.960.931.000.75UAL0.120.820.86-0.760.180.740.830.751.00

**Log Returns**

```python

pp.returns(closing,"IBM",type="log")

```

    array([        nan, -0.01585991, -0.02180223, ...,  0.0026649 ,

           -0.0183533 ,  0.0092187 ])

**Normal Returns**

```python

loga = pp.returns(closing,"IBM",type="normal"); loga

```

    array([        nan, -0.0157348 , -0.02156628, ...,  0.00266845,

           -0.0181859 ,  0.00926132])

**Add Column**

```python

close_ret = pp.add(closing,"IBM_log_return",loga); close_ret[:5]

```

    array([(37.24206924, 100.45429993, 44.57522202, 20.72605705, 130.59109497, 35.80251312, 41.9791832 , 81.51140594, 66.33999634,         nan),

           (35.08446503,  97.62433624, 43.83200836, 20.34561157, 128.53627014, 35.80251312, 41.59314346, 80.89860535, 66.15000153, -0.0157348 ),

           (35.34244537,  97.63354492, 42.79874039, 19.90727234, 125.76422119, 36.07437897, 40.98268127, 80.28580475, 64.58000183, -0.02156628),

           (36.25707626,  99.00255585, 42.57216263, 19.91554451, 124.94229126, 36.52467346, 41.50337982, 82.63342285, 65.52999878, -0.00653548),

           (37.28897095, 102.80648041, 43.67792892, 20.15538216, 127.65791321, 36.966465  , 42.72432327, 84.13523865, 66.63999939,  0.02173501)],

          dtype=[('AA', 'HighLowOpenCloseVolumeAdj_CloseDateTickerMonthYearAdj_Close_lag_1Adj_Close_lag_2Adj_Close_lag_3Adj_Close_lag_4Adj_Close_lag_50315.130298.800306.100310.12011658600310.1202019-01-02TSLA2019-01-012019-01-01nannannannannan1309.400297.380307.000300.3606965200300.3602019-01-03TSLA2019-01-012019-01-01310.120nannannannan2318.000302.730306.000317.6907394100317.6902019-01-04TSLA2019-01-012019-01-01300.360310.120nannannan3336.740317.750321.720334.9607551200334.9602019-01-07TSLA2019-01-012019-01-01317.690300.360310.120nannan4344.010327.020341.960335.3507008500335.3502019-01-08TSLA2019-01-012019-01-01334.960317.690300.360310.120nan

**Outliers**

```python

signal = tsla_lagged["Volume"]

z_signal = (signal - np.mean(signal)) / np.std(signal)

```

```python

tsla_lagged = pp.add(tsla_lagged,"z_signal_volume",z_signal)

```

```python

outliers = pp.detect(tsla_lagged["z_signal_volume"]); outliers

```

    [12, 40, 42, 64, 78, 79, 84, 95, 97, 98, 107, 141, 205, 206, 207]

```python

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 7))

plt.plot(np.arange(len(tsla_lagged["Volume"])), tsla_lagged["Volume"])

plt.plot(np.arange(len(tsla_lagged["Volume"])), tsla_lagged["Volume"], 'X', label='outliers',markevery=outliers, c='r')

plt.legend()

plt.show()

```

![png](PandaPy_files/PandaPy_46_0.png)

**Remove Noise**

```python

price_signal = tsla_lagged["Close"]

removed_signal = pp.removal(price_signal, 30)

noise = pp.get(price_signal, removed_signal)

```

```python

plt.figure(figsize=(15, 7))

plt.subplot(2, 1, 1)

plt.plot(removed_signal)

plt.title('timeseries without noise')

plt.subplot(2, 1, 2)

plt.plot(noise)

plt.title('noise timeseries')

plt.show()

```

![png](PandaPy_files/PandaPy_48_0.png)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/firmai/pandapy

Awesome Lists containing this project

README