An open API service indexing awesome lists of open source software.

https://github.com/firmai/pandasvault

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).
https://github.com/firmai/pandasvault

data-science data-structures dataframe functions pandas python snippets table tips

Last synced: 5 months ago
JSON representation

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

Awesome Lists containing this project

README

          

# **PandasVault** ⁠— Advanced Pandas Functions and Code Snippets

The only Pandas utility package you would ever need. It has no exotic external dependencies. All functions have been compared and tested with alternatives, only the fastest equivalent functions have been developed and included in this package. The package has more than 20 wrapped functions and 100 snippets.

---------

[`Github PandasVault Link`](https://github.com/firmai/pandasvault), [`LinkedIn`](https://www.linkedin.com/company/firmai)

You have the option to view this [Readme](https://github.com/firmai/pandasvault/blob/master/pandasvault.md) or run a [Colab](https://colab.research.google.com/drive/1TRKHPGfQnE2yw6_VPBJZ3nZ8lIPQYiuP) Notebook.

```python
pip install pandasvault
```

If you can identify performance improvements, or improvements in code length and styling, please open a pull request. This package is new, all help and criticisms are appreciated. I would love to hear about any additional function ideas. If you have a **function to contribute** please open an issues tab or email me at d.snow(at)nyu.edu.

## List of Code

#### [Table Processing](#table-processing)
- [Configure Pandas](#configure-pandas)
- [Data Frame Formatting](#data-frame-formatting)
- [Data Frames for Testing](#data-frames-for-testing)
- [Lower Case Columns](#lower-case-columns)
- [Front and Back Column Selection](#front-and-back-columns)
- [Fast Data Frame Split](#fast-data-frame-split)
- [Create Features and Labels List](#create-features-and-labels-list)
- [Short Basic Commands](#short-basic-commands)
- [Read Commands](#read-commands)
- [Create Ordered Categories](#create-ordered-categories)
- [Select Columns Based on Regex](#select-columns-based-on-regex)
- [Accessing Group of Groupby Object](#accessing-group-of-groupby-object)
- [Multiple External Selection Criteria](#multiple-external-selection-criteria)
- [Memory Reduction Script](#memory-reduction-script)
- [Verify Primary Key](#verify-primary-key)
- [Shift Columns to Front](#shift-columns-to-front)
- [Multiple Column Assignment](#multiple-column-assignment)
- [Method Changing Technique](#method-chaning-event)
- [Load Multiple Files](#load-multiple-files)
- [Drop Rows and Column Substring](#drop-rows-and-column-substring)
- [Explode a Column](#explode-a-column)
- [Nest List Back into Column](#nest-list-back-into-column)
- [Split Cells with List](#split-cells-with-list)

#### [Table Exploration](#table-exploration)

- [Groupby Functionality](#groupby-functionality)
- [Cross Correlation Series Without Duplicates](#cross-correlation-series-without-duplicates)
- [Missing Data Report](#missing-data-report)
- [Duplicated Rows Report](#duplicated-rows-report)
- [Skewness](#skewness)

#### [Feature Processing](#feature-processing)

- [Remove Correlated Pairs](#remove-correlated-pairs)
- [Replace Infrequently Occurring Categories](#replace-infrequently-occuring-categories)
- [Quasi-Constant Feature Detection](#quasi-constant-feature-detection)
- [Filling Missing Values Separately](#filling-missing-values-separately)
- [Conditioned Column Value Replacement](#conditioned-column-value-replacement)
- [Remove Non-numeric Values in Data Frame](#remove-non-numeric-values-in-data-frame)
- [Feature Scaling, Normalisation, Standardisation](#feature-scaling-normalisation-standardisation)
- [Impute Null with Tail Distribution](#impute-null-with-tail-distribution)
- [Detect Outliers](#detect-outliers)
- [Windzorise Outliers](#windsorize-outliers)
- [Drop Outliers](#drop-outliers)
- [Impute Outliers](#impute-outliers)

#### [Feature Engineering](#feature-engineering)

- [Automated Dummy Encoding](#automate-dummy-encodings)
- [Binarise Empty Columns](#binarise-empty-columns)
- [Polynomials](#polynomials)
- [Transformations](#transformations)
- [Genetic Programming](#genetic-programming)
- [Principal Component](#principal-component)
- [Multiple Lags](#multiple-lags)
- [Multiple Rolling](#multiple-rolling)
- [Date Features](#data-features)
- [Haversine Distance](#havervsine-distance)
- [Parse Address](#parse-address)
- [Processing Strings in Pandas](#processing-strings-in-pandas)
- [Filtering Strings in Pandas](#filtering-strings-in-pandas)

#### [Model Validation](#model-validation)

- [Classification Metrics](#classification-metrics)

--------------

### List of Functions

```python
import pandas as pd
import numpy as np
import pandasvault as pv

"""TABLE PROCESSING"""
df = pv.list_shuff(["target","c","d"],df)
df = pv.reduce_mem_usage(df)

"""TABLE EXPLORATION"""
df = pv.corr_list(df)
df = pv.missing_data(df)

"""FEATURE PROCESSING"""
df = pv.drop_corr(df, thresh=0.1,keep_cols=["target"])
df = pv.replace_small_cat(df,["cat"])
qconstant_col = pv.constant_feature_detect(data=df,threshold=0.9)
df_train, scl = pv.scaler(df,target="target",cols_ignore=["a"],type="MinMax")
df_test = pv.scaler(df_test,scaler=scl,train=False, target="target",cols_ignore=["a"])
df = pv.impute_null_with_tail(df,cols=df.columns)
index,para = pv.outlier_detect(df,"a",threshold=0.5,method="IQR")
df = pv.windsorization(data=df,col='a',para=para,strategy='both')
df = pv.impute_outlier(data=df,col='a', outlier_index=index,strategy='mean')

"""FEATURE EXTRACTION"""
df = pv.auto_dummy(df, unique=3)
df = pv.binarise_empty(df, frac=0.6)
df = pv.polynomials(df, ["a","b"])
df = pv.transformations(df,["a","b"])
df = pv.pca_feature(df,variance_or_components=0.80,drop_cols=["target","a"])
df = pv.multiple_lags(df, start=1, end=2,columns=["a","target"])
df = pv.multiple_rolling(df, columns=["a"])
df = pv.date_features(df, date="date_fake")
df['distance_central'] = df.apply(pv.haversine_distance,axis=1)

"""MODEL VALIDATION"""
scores = pv.classification_scores(y_test, y_predict, y_prob)

```

## Functions and Snippets Applied
--------------

*If you are running the code for the first time load this test dataframe:*

```python
!pip install pandasvault
```

```python
import pandas as pd
import numpy as np
import pandasvault as pv

np.random.seed(1)
"""quick way to create a data frame for testing"""
df_test = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd']) \
.assign(target=lambda x: (x['b']+x['a']/x['d'])*x['c'])
```

### **Table Processing**

---

---

**>>> Configure Pandas (func)**

---

```python
import pandas as pd

def pd_config():
options = {
'display': {
'max_colwidth': 25,
'expand_frame_repr': False, # Don't wrap to multiple pages
'max_rows': 14,
'max_seq_items': 50, # Max length of printed sequence
'precision': 4,
'show_dimensions': False
},
'mode': {
'chained_assignment': None # Controls SettingWithCopyWarning
}
}

for category, option in options.items():
for op, value in option.items():
pd.set_option(f'{category}.{op}', value) # Python 3.6+

if __name__ == '__main__':
pv.pd_config()

```


**>>> Data Frame Formatting**

---

```python
df = df_test.copy()
df["number"] = [3,10,1]
```

```python

df_out = (
df.style.format({"a":"${:.2f}", "target":"${:.5f}"})
.hide_index()
.highlight_min("a", color ="red")
.highlight_max("a", color ="green")
.background_gradient(subset = "target", cmap ="Blues")
.bar("number", color = "lightblue", align = "zero")
.set_caption("DF with different stylings")
) ; df_out

```

```See Colab for Output```



**>>> Data Frames For Testing**

---

```python
df1 = pd.util.testing.makeDataFrame() # contains random values
print("Contains missing values")
df2 = pd.util.testing.makeMissingDataframe() # contains missing values
print("Contains datetime values")
df3 = pd.util.testing.makeTimeDataFrame() # contains datetime values
print("Contains mixed values")
df4 = pd.util.testing.makeMixedDataFrame(); df4.head() # contains mixed values

```

Contains missing values
Contains datetime values
Contains mixed values




A
B
C
D




0
0.0
0.0
foo1
2009-01-01


1
1.0
1.0
foo2
2009-01-02


2
2.0
0.0
foo3
2009-01-05


3
3.0
1.0
foo4
2009-01-06


4
4.0
0.0
foo5
2009-01-07


**>>> Lower Case Columns**

---

```python
## Lower-case all DataFrame column names
df = df_test.copy() ; df
df.columns = ["A","BGs","c","dag","Target"]
```

```python
df.columns = map(str.lower, df.columns); df
```




a
bgs
c
dag
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910


**>>> Front and Back Column Selection**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def front(self, n):
return self.iloc[:, :n]

def back(self, n):
return self.iloc[:, -n:]

pd.back = back
pd.front = front

pd.back(df,2)
```




d
target




0
-1.0730
1.1227


1
-0.7612
-5.9994


2
-2.0601
-0.5910


**>>> Fast Data Frame Split**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
test = df.sample(frac=0.4)
train = df[~df.isin(test)].dropna(); train
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


**>>> Create Features and Labels List**

---

```python
df = df_test.head()
y = 'target'
X = [name for name in df.columns if name not in [y, 'd']]
print('y =', y)
print('X =', X)
```

y = target
X = ['a', 'b', 'c']


**>>> Short Basic Commands**

---

```python
df = df_test.copy()
df["category"] = np.where( df["target"]>1, "1", "0")
df["k"] = df["category"].astype(str) +": " + df["d"].round(1).astype(str)
df = df.append(df, ignore_index=True) ; df.head()
```




a
b
c
d
target
category
k




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1
1: -1.1


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
0
0: -0.8


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
0
0: -2.1


3
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1
1: -1.1


4
0.8654
-2.3015
1.7448
-0.7612
-5.9994
0
0: -0.8

```python
"""set display width, col_width etc for interactive pandas session"""
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 20)
pd.set_option('display.max_rows', 100)

"""when you have an excel sheet with spaces in column names"""
df.columns = [c.lower().replace(' ', '_') for c in df.columns]

"""Add prefix to all columns"""
df.add_prefix("1_")

"""Add suffix to all columns"""
df.add_suffix("_Z")

"""Droping column where missing values are above a threshold"""
df.dropna(thresh = len(df)*0.95, axis = "columns")

"""Given a dataframe df to filter by a series ["a","b"]:"""
df[df['category'].isin(["1","0"])]

"""filter by multiple conditions in a dataframe df"""
df[(df['a'] >1) & (df['b'] <1)]

"""filter by conditions and the condition on row labels(index)"""
df[(df.a > 0) & (df.index.isin([0, 1]))]

"""regexp filters on strings (vectorized), use .* instead of *"""
df[df.category.str.contains(r'.*[0-9].*')]

"""logical NOT is like this"""
df[~df.category.str.contains(r'.*[0-9].*')]

"""creating complex filters using functions on rows"""
df[df.apply(lambda x: x['b'] > x['c'], axis=1)]

"""Pandas replace operation"""
df["a"].round(2).replace(0.87, 17, inplace=True)
df["a"][df["a"] < 4] = 19

"""Conditionals and selectors"""
df.loc[df["a"] > 1, ["a","b","target"]]

"""Selecting multiple column slices"""
df.iloc[:, np.r_[0:2, 4:5]]

"""apply and map examples"""
df[["a","b","c"]].applymap(lambda x: x+1)

"""add 2 to row 3 and return the series"""
df[["a","b","c"]].apply(lambda x: x[0]+2,axis=0)

"""add 3 to col A and return the series"""
df.apply(lambda x: x['a']+1,axis=1)

""" Split delimited values in a DataFrame column into two new columns """
df['new1'], df['new2'] = zip(*df['k'].apply(lambda x: x.split(': ', 1)))

""" Doing calculations with DataFrame columns that have missing values
In example below, swap in 0 for df['col1'] cells that contain null """
df['new3'] = np.where(pd.isnull(df['b']),0,df['a']) + df['c']

""" Exclude certain data type or include certain data types """
df.select_dtypes(exclude=['O','float'])
df.select_dtypes(include=['int'])

"""one liner to normalize a data frame"""
(df[["a","b"]] - df[["a","b"]].mean()) / (df[["a","b"]].max() - df[["a","b"]].min())

"""groupby used like a histogram to obtain counts on sub-ranges of a variable, pretty handy"""
df.groupby(pd.cut(df.a, range(0, 1, 2))).size()

"""use a local variable use inside a query of pandas using @"""
mean = df["a"].mean()
df.query("a > @mean")

"""Calculate the % of missing values in each column"""
df.isna().mean()

"""Calculate the % of missing values in each row"""
rows = df.isna().mean(axis=1) ; df.head()

```




a
b
c
d
target
category
k
new1
new2
new3




0
19.0
-0.6118
-0.5282
-1.0730
1.1227
1
1: -1.1
1
-1.1
18.4718


1
19.0
-2.3015
1.7448
-0.7612
-5.9994
0
0: -0.8
0
-0.8
20.7448


2
19.0
-0.2494
1.4621
-2.0601
-0.5910
0
0: -2.1
0
-2.1
20.4621


3
19.0
-0.6118
-0.5282
-1.0730
1.1227
1
1: -1.1
1
-1.1
18.4718


4
19.0
-2.3015
1.7448
-0.7612
-5.9994
0
0: -0.8
0
-0.8
20.7448


**>>> Read Commands**

---

```python
df = pd.util.testing.makeMixedDataFrame()
df.to_csv("data.csv") ; df
```




A
B
C
D




0
0.0
0.0
foo1
2009-01-01


1
1.0
1.0
foo2
2009-01-02


2
2.0
0.0
foo3
2009-01-05


3
3.0
1.0
foo4
2009-01-06


4
4.0
0.0
foo5
2009-01-07

```python
"""To avoid Unnamed: 0 when loading a previously saved csv with index"""
"""To parse dates"""
"""To set data types"""

df_out = pd.read_csv("data.csv", index_col=0,
parse_dates=['D'],
dtype={"c":"category", "B":"int64"}).set_index("D")

"""Copy data to clipboard; like an excel copy and paste
df = pd.read_clipboard()
"""

"""Read table from website
df = pd.read_html(url, match="table_name")
"""

""" Read pdf into dataframe ()
!pip install tabula
from tabula import read_pdf
df = read_pdf('test.pdf', pages='all')
"""
df_out.head()
```




A
B
C


D







2009-01-01
0.0
0
foo1


2009-01-02
1.0
1
foo2


2009-01-05
2.0
0
foo3


2009-01-06
3.0
1
foo4


2009-01-07
4.0
0
foo5


**>>> Create Ordered Categories**

---

```python
df = df_test.copy()
df["cats"] = ["bad","good","excellent"]; df
```




a
b
c
d
target
cats




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
bad


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
good


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
excellent

```python
import pandas as pd
from pandas.api.types import CategoricalDtype

print("Let's create our own categorical order.")
cat_type = CategoricalDtype(["bad", "good", "excellent"], ordered = True)
df["cats"] = df["cats"].astype(cat_type)

print("Now we can use logical sorting.")
df = df.sort_values("cats", ascending = True)

print("We can also filter this as if they are numbers.")
df[df["cats"] > "bad"]

```

Let's create our own categorical order.
Now we can use logical sorting.
We can also filter this as if they are numbers.




a
b
c
d
target
cats




1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
good


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
excellent


**>>> Select Columns Based on Regex**

---

```python
df = df_test.head(); df
df.columns = ["a_l", "b_l", "c_r","d_r","target"] ; df
```




a_l
b_l
c_r
d_r
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
df_out = df.filter(regex="_l",axis=1) ; df_out
```




a_l
b_l




0
1.6243
-0.6118


1
0.8654
-2.3015


2
0.3190
-0.2494


**>>> Accessing Group of Groupby Object**

---

```python
df = df_test.copy()
df = df.append(df, ignore_index=True)
df["groupie"] = ["falcon","hawk","hawk","eagle","falcon","hawk"]; df
```




a
b
c
d
target
groupie




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
falcon


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
hawk


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
hawk


3
1.6243
-0.6118
-0.5282
-1.0730
1.1227
eagle


4
0.8654
-2.3015
1.7448
-0.7612
-5.9994
falcon


5
0.3190
-0.2494
1.4621
-2.0601
-0.5910
hawk

```python
gbdf = df.groupby("groupie")
hawk = gbdf.get_group("hawk").mean(); hawk
```

a 0.5012
b -0.9334
c 1.5563
d -1.6272
target -2.3938
dtype: float64


**>>> Multiple External Selection Criteria**

---

```python
df = df_test.copy()
```

```python
cr1 = df["a"] > 0
cr2 = df["b"] < 0
cr3 = df["c"] > 0
cr4 = df["d"] >-1

df[cr1 & cr2 & cr3 & cr4]

```




a
b
c
d
target




1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


**>>> Memory Reduction Script (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
import gc

def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

for col in df.columns:
col_type = df[col].dtype
gc.collect()
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')

end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

return df
df_out = pv.reduce_mem_usage(df); df_out
```

Memory usage of dataframe is 0.00 MB
Memory usage after optimization is: 0.00 MB
Decreased by 36.3%




a
b
c
d
target




0
1.6240
-0.6118
-0.5283
-1.0732
1.1230


1
0.8652
-2.3008
1.7451
-0.7612
-6.0000


2
0.3191
-0.2494
1.4619
-2.0605
-0.5908


**>>> Verify Primary Key (func)**

---

```python
df = df_test.copy()
df["first_d"] = [0,1,2]
df["second_d"] = [4,1,9] ; df
```




a
b
c
d
target
first_d
second_d




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
0
4


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
1
1


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
2
9

```python
def verify_primary_key(df, column_list):
'''Verify if columns in column list can be treat as primary key'''

return df.shape[0] == df.groupby(column_list).size().reset_index().shape[0]

verify_primary_key(df, ["first_d","second_d"])
```

True


**>>> Shift Columns to Front (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def list_shuff(items, df):
"Bring a list of columns to the front"
cols = list(df)
for i in range(len(items)):
cols.insert(i, cols.pop(cols.index(items[i])))
df = df.loc[:, cols]
df.reset_index(drop=True, inplace=True)
return df

df_out = pv.list_shuff(["target","c","d"],df); df_out
```




target
c
d
a
b




0
1.1227
-0.5282
-1.0730
1.6243
-0.6118


1
-5.9994
1.7448
-0.7612
0.8654
-2.3015


2
-0.5910
1.4621
-2.0601
0.3190
-0.2494


**>>> Multiple Column Assignments**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
df_out = (df.assign(stringed = df["a"].astype(str),
ounces = df["b"]*12,# this will allow yo set a title
galons = lambda df: df["a"]/128)
.query("b > -1")
.style.set_caption("Average consumption")) ; df_out

```

Average consumption a b c d target stringed ounces galons

0
1.624
-0.6118
-0.5282
-1.073
1.123
1.6243453636632417
-7.341
0.01269


2
0.319
-0.2494
1.462
-2.06
-0.591
0.31903909605709857
-2.992
0.002492


**>>> Method Chaining Technique**

---

```python
df = df_test.copy()
df[df>df.mean()] = None ; df
```




a
b
c
d
target




0
NaN
NaN
-0.5282
NaN
NaN


1
0.8654
-2.3015
NaN
NaN
-5.9994


2
0.3190
NaN
NaN
-2.0601
NaN

```python
# with line continuation character
df_out = df.dropna(subset=["b","c"],how="all") \
.loc[df["a"]>0] \
.round(2) \
.groupby(["target","b"]).max() \
.unstack() \
.fillna(0) \
.rolling(1).sum() \
.reset_index() \
.stack() \
.ffill().bfill()

df_out
```





a
c
d
target



b








0
-2.3
0.87
0.0
0.0
-6.0



0.87
0.0
0.0
-6.0


**>>> Load Multiple Files**

---

```python
import os
os.makedirs("folder",exist_ok=True,); df_test.to_csv("folder/first.csv",index=False) ; df_test.to_csv("folder/last.csv",index=False)
```

```python
import glob
files = glob.glob('folder/*.csv')
dfs = [pd.read_csv(fp) for fp in files]
df_out = pd.concat(dfs)
```

```python
df_out
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910


**>>> Drop Rows with Column Substring**

---

```python
df = df_test.copy()
df["string_feature"] = ["1xZoo", "Safe7x", "bat4"]; df
```




a
b
c
d
target
string_feature




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1xZoo


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
Safe7x


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
bat4

```python
substring = ["xZ","7z", "tab4"]

df_out = df[~df.string_feature.str.contains('|'.join(substring))]; df_out
```




a
b
c
d
target
string_feature




1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
Safe7x


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
bat4


**>>> Unnest (Explode) a Column**

---

```python
df = df_test.head()
df["g"] = [[str(a)+lista for a in range(4)] for lista in ["a","b","c"]]; df
```




a
b
c
d
target
g




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
[0a, 1a, 2a, 3a]


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
[0b, 1b, 2b, 3b]


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
[0c, 1c, 2c, 3c]

```python
df_out = df.explode("g"); df_out.iloc[:5,:]
```




a
b
c
d
target
g




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
0a


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1a


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
2a


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
3a


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
0b


**>>> Nest List Back into Column**

---

```python
### Run above example first
df = df_out.copy()
```

```python
df_out['g'] = df_out.groupby(df_out.index)['g'].agg(list); df_out.head()
```




a
b
c
d
target
g




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
[0a, 1a, 2a, 3a]


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
[0a, 1a, 2a, 3a]


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
[0a, 1a, 2a, 3a]


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
[0a, 1a, 2a, 3a]


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
[0b, 1b, 2b, 3b]


**>>> Split Cells With Lists**

---

```python
df = df_test.head()
df["g"] = [",".join([str(a)+lista for a in range(4)]) for lista in ["a","b","c"]]; df
```




a
b
c
d
target
g




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
0a,1a,2a,3a


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
0b,1b,2b,3b


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
0c,1c,2c,3c

```python
df_out = df.assign(g = df["g"].str.split(",")).explode("g"); df_out.head()
```




a
b
c
d
target
g




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
0a


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1a


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
2a


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
3a


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
0b


### **Table Exploration**

---

---

**>>> Groupby Functionality**

---

```python
df = df_test.head()
df["gr"] = [1, 1 , 0] ;df
```




a
b
c
d
target
gr




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
1


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
0

```python
In [34]: gb. # noqa: E225, E999
gb.agg gb.boxplot gb.cummin gb.describe gb.filter
gb.get_group gb.height gb.last gb.median gb.ngroups
gb.plot gb.rank gb.std gb.transform gb.aggregate
gb.count gb.cumprod gb.dtype gb.first gb.nth
gb.groups gb.hist gb.max gb.min gb.gender
gb.prod gb.resample gb.sum gb.var gb.ohlc
gb.apply gb.cummax gb.cumsum gb.fillna
gb.head gb.indices gb.mean gb.name
gb.quantile gb.size gb.tail gb.weight

```

```python
df_out = df.groupby('gr').agg([np.sum, np.mean, np.std]); df_out.iloc[:,:8]
```




a
b
c



sum
mean
std
sum
mean
std
sum
mean


gr












0
0.3190
0.3190
NaN
-0.2494
-0.2494
NaN
1.4621
1.4621


1
2.4898
1.2449
0.5367
-2.9133
-1.4566
1.1949
1.2166
0.6083


**>>> Cross Correlation Series Without Duplicates (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def corr_list(df):

return (df.corr()
.unstack()
.sort_values(kind="quicksort",ascending=False)
.drop_duplicates().iloc[1:]); df_out

pv.corr_list(df)
```

b target 0.9215
a d 0.6605
target 0.3206
b a -0.0724
c d -0.1764
b -0.4545
target d -0.4994
c target -0.7647
b d -0.7967
a c -0.8555
dtype: float64


**>>> Missing Data Report (func)**

---

```python
df = df_test.copy()
df[df>df.mean()] = None ; df
```




a
b
c
d
target




0
NaN
NaN
-0.5282
NaN
NaN


1
0.8654
-2.3015
NaN
NaN
-5.9994


2
0.3190
NaN
NaN
-2.0601
NaN

```python

def missing_data(data):
"Create a dataframe with a percentage and count of missing values"
total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

pv.df_out = missing_data(df); df_out
```




a
b
c
d
target



sum
mean
std
sum
mean
std
sum
mean
std
sum
mean
std
sum
mean
std


gr



















0
0.3190
0.3190
NaN
-0.2494
-0.2494
NaN
1.4621
1.4621
NaN
-2.0601
-2.0601
NaN
-0.5910
-0.5910
NaN


1
2.4898
1.2449
0.5367
-2.9133
-1.4566
1.1949
1.2166
0.6083
1.6072
-1.8342
-0.9171
0.2204
-4.8767
-2.4384
5.0361


**>>> Duplicated Rows Report**

---

```python
df = df_test.copy()
df["a"].iloc[2] = df["a"].iloc[1]
df["b"].iloc[2] = df["b"].iloc[1] ; df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.8654
-2.3015
1.4621
-2.0601
-0.5910

```python
# Get a report of all duplicate records in a dataframe, based on specific columns
df_out = df[df.duplicated(['a', 'b'], keep=False)] ; df_out

```




a
b
c
d
target




1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.8654
-2.3015
1.4621
-2.0601
-0.5910


**>>> Skewness (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
from scipy.stats import skew

def display_skewness(data):
'''show skewness information

Parameters
----------
data: pandas dataframe

Return
------
df: pandas dataframe
'''
numeric_cols = data.columns[data.dtypes != 'object'].tolist()
skew_value = []

for i in numeric_cols:
skew_value += [skew(data[i])]
df = pd.concat(
[pd.Series(numeric_cols), pd.Series(data.dtypes[data.dtypes != 'object'].apply(lambda x: str(x)).values)
, pd.Series(skew_value)], axis=1)
df.columns = ['var_name', 'col_type', 'skew_value']

return df

display_skewness(df)

```




var_name
col_type
skew_value




0
a
float64
0.1963


1
b
float64
-0.6210


2
c
float64
-0.6659


3
d
float64
-0.5427


4
target
float64
-0.5418

### **Feature Processing**

---

---

**>>> Remove Correlated Pairs (func)**

---

```python
df= df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def drop_corr(df, thresh=0.99,keep_cols=[]):
df_corr = df.corr().abs()
upper = df_corr.where(np.triu(np.ones(df_corr.shape), k=1).astype(np.bool))
to_remove = [column for column in upper.columns if any(upper[column] > thresh)] ## Change to 99% for selection
to_remove = [x for x in to_remove if x not in keep_cols]
df_corr = df_corr.drop(columns = to_remove)
return df.drop(to_remove,axis=1)

df_out = pv.drop_corr(df, thresh=0.1,keep_cols=["target"]); df_out
```




a
b
target




0
1.6243
-0.6118
1.1227


1
0.8654
-2.3015
-5.9994


2
0.3190
-0.2494
-0.5910


**>>> Replace Infrequently Occuring Categories**

---

```python
df = df_test.copy()
df = df.append([df]*2)
df["cat"] = ["bat","bat","rat","mat","mat","mat","mat","mat","mat"]; df
```




a
b
c
d
target
cat




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
bat


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
bat


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
rat


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
mat


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
mat


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
mat


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
mat


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
mat


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
mat

```python

def replace_small_cat(df, columns, thresh=0.2, term="other"):
for col in columns:

# Step 1: count the frequencies
frequencies = df[col].value_counts(normalize = True)

# Step 2: establish your threshold and filter the smaller categories

small_categories = frequencies[frequencies < thresh].index

df[col] = df[col].replace(small_categories, "Other")

return df

df_out = pv.replace_small_cat(df,["cat"]); df_out.head()

```




a
b
c
d
target
cat




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
bat


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
bat


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
Other


0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
mat


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
mat


**>>> Quasi-Constant Features Detection (func)**

---

```python
df = df_test.copy()
df["a"] = 3
```

```python

def constant_feature_detect(data,threshold=0.98):
""" detect features that show the same value for the
majority/all of the observations (constant/quasi-constant features)

Parameters
----------
data : pd.Dataframe
threshold : threshold to identify the variable as constant

Returns
-------
list of variables names
"""

data_copy = data.copy(deep=True)
quasi_constant_feature = []
for feature in data_copy.columns:
predominant = (data_copy[feature].value_counts() / np.float(
len(data_copy))).sort_values(ascending=False).values[0]
if predominant >= threshold:
quasi_constant_feature.append(feature)
print(len(quasi_constant_feature),' variables are found to be almost constant')
return quasi_constant_feature

# the original dataset has no constant variable
qconstant_col = pv.constant_feature_detect(data=df,threshold=0.9)
df_out = df.drop(qconstant_col, axis=1) ; df_out
```

1 variables are found to be almost constant




b
c
d
target




0
-0.6118
-0.5282
-1.0730
1.1227


1
-2.3015
1.7448
-0.7612
-5.9994


2
-0.2494
1.4621
-2.0601
-0.5910

```python
### I will take care of outliers separately
```

**>>> Filling Missing Values Separately**

---

```python
df = df_test.copy()
df[df>df.mean()] = None ; df
```




a
b
c
d
target




0
NaN
NaN
-0.5282
NaN
NaN


1
0.8654
-2.3015
NaN
NaN
-5.9994


2
0.3190
NaN
NaN
-2.0601
NaN

```python
# Clean up missing values in multiple DataFrame columns
dict_fill = {'a': 4,
'b': 3,
'c': 5,
'd': 9999,
'target': "False"}
df = df.fillna(dict_fill) ;df

```




a
b
c
d
target




0
4.0000
3.0000
-0.5282
9999.0000
False


1
0.8654
-2.3015
5.0000
9999.0000
-5.999


2
0.3190
3.0000
5.0000
-2.0601
False


**>>> Conditioned Column Value Replacement**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
# Set DataFrame column values based on other column values
df.loc[(df['a'] >1 ) & (df['c'] <0), ['target']] = np.nan ;df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
NaN


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910


**>>> Remove Non-numeric Values in Data Frame**

---

```python
df = df_test.copy().assign(target=lambda row: row["a"].round(4).astype(str)+"SC"+row["b"].round(4).astype(str))
df["a"] = "TI4560L" + df["a"].round(4).astype(str) ; df
```




a
b
c
d
target




0
TI4560L1.6243
-0.6118
-0.5282
-1.0730
1.6243SC-0.6118


1
TI4560L0.8654
-2.3015
1.7448
-0.7612
0.8654SC-2.3015


2
TI4560L0.319
-0.2494
1.4621
-2.0601
0.319SC-0.2494

```python
df_out = df.replace('[^0-9]+', '', regex=True); df_out
```




a
b
c
d
target




0
456016243
-0.6118
-0.5282
-1.0730
1624306118


1
456008654
-2.3015
1.7448
-0.7612
0865423015


2
45600319
-0.2494
1.4621
-2.0601
031902494


**>>> Feature Scaling, Normalisation, Standardisation (func)**

---

```python
df= df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

def scaler(df,scaler=None,train=True, target=None, cols_ignore=None, type="Standard"):

if cols_ignore:
hold = df[cols_ignore].copy()
df = df.drop(cols_ignore,axis=1)
if target:
x = df.drop([target],axis=1).values #returns a numpy array
else:
x = df.values
if train:
if type=="Standard":
scal = StandardScaler()
elif type=="MinMax":
scal = MinMaxScaler()
scal.fit(x)
x_scaled = scal.transform(x)
else:
x_scaled = scaler.transform(x)

if target:
df_out = pd.DataFrame(x_scaled, index=df.index, columns=df.drop([target],axis=1).columns)
df_out[target]= df[target]
else:
df_out = pd.DataFrame(x_scaled, index=df.index, columns=df.columns)

df_out = pd.concat((hold,df_out),axis=1)
if train:
return df_out, scal
else:
return df_out

df_out_train, scl = pv.scaler(df,target="target",cols_ignore=["a"],type="MinMax")
df_out_test = pv.scaler(df_test,scaler=scl,train=False, target="target",cols_ignore=["a"]); df_out_test

```




a
b
c
d
target




0
1.6243
0.8234
0.0000
0.76
1.1227


1
0.8654
0.0000
1.0000
1.00
-5.9994


2
0.3190
1.0000
0.8756
0.00
-0.5910


**>>> Impute Null with Tail Distribution (func)**

---

```python
df = df_test.copy()
df[df>df.mean()] = None ; df
```




a
b
c
d
target




0
NaN
NaN
-0.5282
NaN
NaN


1
0.8654
-2.3015
NaN
NaN
-5.9994


2
0.3190
NaN
NaN
-2.0601
NaN

```python
def impute_null_with_tail(df,cols=[]):
"""
replacing the NA by values that are at the far end of the distribution of that variable
calculated by mean + 3*std
"""

df = df.copy(deep=True)
for i in cols:
if df[i].isnull().sum()>0:
df[i] = df[i].fillna(df[i].mean()+3*df[i].std())
else:
warn("Column %s has no missing" % i)
return df

df_out = pv.impute_null_with_tail(df,cols=df.columns); df_out
```




a
b
c
d
target




0
1.7512
NaN
-0.5282
NaN
NaN


1
0.8654
-2.3015
NaN
NaN
-5.9994


2
0.3190
NaN
NaN
-2.0601
NaN


**>>> Detect Outliers (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def outlier_detect(data,col,threshold=3,method="IQR"):

if method == "IQR":
IQR = data[col].quantile(0.75) - data[col].quantile(0.25)
Lower_fence = data[col].quantile(0.25) - (IQR * threshold)
Upper_fence = data[col].quantile(0.75) + (IQR * threshold)
if method == "STD":
Upper_fence = data[col].mean() + threshold * data[col].std()
Lower_fence = data[col].mean() - threshold * data[col].std()
if method == "OWN":
Upper_fence = data[col].mean() + threshold * data[col].std()
Lower_fence = data[col].mean() - threshold * data[col].std()
if method =="MAD":
median = data[col].median()
median_absolute_deviation = np.median([np.abs(y - median) for y in data[col]])
modified_z_scores = pd.Series([0.6745 * (y - median) / median_absolute_deviation for y in data[col]])
outlier_index = np.abs(modified_z_scores) > threshold
print('Num of outlier detected:',outlier_index.value_counts()[1])
print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))
return outlier_index, (median_absolute_deviation, median_absolute_deviation)

para = (Upper_fence, Lower_fence)
tmp = pd.concat([data[col]>Upper_fence,data[col]

**>>> Windsorize Outliers (func)**

---

```python
# RUN above example first
df = df_test.copy(); df

```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def windsorization(data,col,para,strategy='both'):
"""
top-coding & bottom coding (capping the maximum of a distribution at an arbitrarily set value,vice versa)
"""

data_copy = data.copy(deep=True)
if strategy == 'both':
data_copy.loc[data_copy[col]>para[0],col] = para[0]
data_copy.loc[data_copy[col]para[0],col] = para[0]
elif strategy == 'bottom':
data_copy.loc[data_copy[col]



a
b
c
d
target




0
1.5712
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910


**>>> Drop Outliers**

---

```python
## run the top two examples
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
df_out = df[~index] ; df_out
```




a
b
c
d
target




1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910


**>>> Impute Outliers**

---

```python
def impute_outlier(data,col,outlier_index,strategy='mean'):
"""
impute outlier with mean/median/most frequent values of that variable.
"""

data_copy = data.copy(deep=True)
if strategy=='mean':
data_copy.loc[outlier_index,col] = data_copy[col].mean()
elif strategy=='median':
data_copy.loc[outlier_index,col] = data_copy[col].median()
elif strategy=='mode':
data_copy.loc[outlier_index,col] = data_copy[col].mode()[0]

return data_copy

df_out = pv.impute_outlier(data=df,col='a', outlier_index=index,strategy='mean'); df_out
```




a
b
c
d
target




0
0.9363
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

### **Feature Engineering**

---

---

**>>> Automated Dummy (one-hot) Encoding (func)**

---

```python
df = df_test.copy()
df["e"] = np.where(df["c"]> df["a"], 1, 2)
```

```python
def auto_dummy(df, unique=15):
# Creating dummies for small object uniques
if len(df)



a
b
c
d
target
e_1
e_2




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
0
1


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
1
0


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
1
0


**>>> Binarise Empty Columns (func)**

---

```python
df = df_test.copy()
df[df>df.mean()] = None ; df
```




a
b
c
d
target




0
NaN
NaN
-0.5282
NaN
NaN


1
0.8654
-2.3015
NaN
NaN
-5.9994


2
0.3190
NaN
NaN
-2.0601
NaN

```python
def binarise_empty(df, frac=80):
# Binarise slightly empty columns
this =[]
for col in df.columns:
if df[col].dtype != "object":
is_null = df[col].isnull().astype(int).sum()
if (is_null/df.shape[0]) >frac: # if more than 70% is null binarise
print(col)
this.append(col)
df[col] = df[col].astype(float)
df[col] = df[col].apply(lambda x: 0 if (np.isnan(x)) else 1)
df = pd.get_dummies(df, columns = this)
return df

df_out = pv.binarise_empty(df, frac=0.6); df_out
```

b
c
d
target




a
b_0
b_1
c_0
c_1
d_0
d_1
target_0
target_1




0
NaN
1
0
0
1
1
0
1
0


1
0.8654
0
1
1
0
1
0
0
1


2
0.3190
1
0
1
0
0
1
1
0


**>>> Polynomials (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def polynomials(df, feature_list):
for feat in feature_list:
for feat_two in feature_list:
if feat==feat_two:
continue
else:
df[feat+"/"+feat_two] = df[feat]/(df[feat_two]-df[feat_two].min()) #zero division guard
df[feat+"X"+feat_two] = df[feat]*(df[feat_two])

return df

df_out = pv.polynomials(df, ["a","b"]) ; df_out
```




a
b
c
d
target
a/b
aXb
b/a
bXa




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
0.9613
-0.9937
-0.4687
-0.9937


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
inf
-1.9918
-4.2124
-1.9918


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
0.1555
-0.0796
-inf
-0.0796


**>>> Transformations (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def transformations(df,features):
df_new = df[features]
df_new = df_new - df_new.min()

sqr_name = [str(fa)+"_POWER_2" for fa in df_new.columns]
log_p_name = [str(fa)+"_LOG_p_one_abs" for fa in df_new.columns]
rec_p_name = [str(fa)+"_RECIP_p_one" for fa in df_new.columns]
sqrt_name = [str(fa)+"_SQRT_p_one" for fa in df_new.columns]

df_sqr = pd.DataFrame(np.power(df_new.values, 2),columns=sqr_name, index=df.index)
df_log = pd.DataFrame(np.log(df_new.add(1).abs().values),columns=log_p_name, index=df.index)
df_rec = pd.DataFrame(np.reciprocal(df_new.add(1).values),columns=rec_p_name, index=df.index)
df_sqrt = pd.DataFrame(np.sqrt(df_new.abs().add(1).values),columns=sqrt_name, index=df.index)

dfs = [df, df_sqr, df_log, df_rec, df_sqrt]

df= pd.concat(dfs, axis=1)

return df

df_out = pv.transformations(df,["a","b"]); df_out.iloc[:,:8]
```




a
b
c
d
target
a_POWER_2
b_POWER_2
a_LOG_p_one_abs




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1.7038
2.8554
0.8352


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
0.2985
0.0000
0.4359


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
0.0000
4.2114
0.0000


**>>> Genetic Programming**

---

```python
! pip install gplearn
```

Collecting gplearn
[?25l Downloading https://files.pythonhosted.org/packages/43/6b/ee38cd74b32ad5056603aabbef622f9691f19d0869574dfc610034f18662/gplearn-0.4.1-py3-none-any.whl (41kB)
[K |████████████████████████████████| 51kB 2.5MB/s
[?25hRequirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.6/dist-packages (from gplearn) (0.22.1)
Requirement already satisfied: joblib>=0.13.0 in /usr/local/lib/python3.6/dist-packages (from gplearn) (0.14.1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20.0->gplearn) (1.17.5)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20.0->gplearn) (1.4.1)
Installing collected packages: gplearn
Successfully installed gplearn-0.4.1

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
from gplearn.genetic import SymbolicTransformer
function_set = ['add', 'sub', 'mul', 'div',
'sqrt', 'log', 'abs', 'neg', 'inv','tan']

gp = SymbolicTransformer(generations=800, population_size=200,
hall_of_fame=100, n_components=10,
function_set=function_set,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=6)

gen_feats = gp.fit_transform(df.drop("target", axis=1), df["target"]); df.iloc[:,:8]
df_out = pd.concat((df,pd.DataFrame(gen_feats, columns=["gen_"+str(a) for a in range(gen_feats.shape[1])])),axis=1); df_out.iloc[:,:8]

```

| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 10.14 0.91 22 1 0 43.36m




a
b
c
d
target
gen_0
gen_1
gen_2




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
-1.8292
-2.6469
0.5059


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
-3.5190
99.1619
3.6243


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
-1.4668
1.3677
3.1826


**>>> Prinicipal Component Features (func)**

---

```python
df =df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
from sklearn.decomposition import PCA, IncrementalPCA

def pca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,drop_cols=None):

if memory_issues:
if not mem_iss_component:
raise ValueError("If you have memory issues, you have to preselect mem_iss_component")
pca = IncrementalPCA(mem_iss_component)
else:
if variance_or_components>1:
pca = PCA(n_components=variance_or_components)
else: # automted selection based on variance
pca = PCA(n_components=variance_or_components,svd_solver="full")
X_pca = pca.fit_transform(df.drop(drop_cols,axis=1))
df = pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])])),axis=1)
return df

df_out = pv.pca_feature(df,variance_or_components=0.80,drop_cols=["target","a"]); df_out

```




target
a
PCA_1
PCA_2




0
1.1227
1.6243
-1.2944
-0.7684


1
-5.9994
0.8654
1.5375
-0.4537


2
-0.5910
0.3190
-0.2431
1.2220


**>>> Multiple Lags (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def multiple_lags(df, start=1, end=3,columns=None):
if not columns:
columns = df.columns.to_list()
lags = range(start, end+1) # Just two lags for demonstration.

df = df.assign(**{
'{}_t_{}'.format(col, t): df[col].shift(t)
for t in lags
for col in columns
})
return df

df_out = pv.multiple_lags(df, start=1, end=2,columns=["a","target"]); df_out
```




a
b
c
d
target
a_t_1
target_t_1
a_t_2
target_t_2




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
NaN
NaN
NaN
NaN


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
1.6243
1.1227
NaN
NaN


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
0.8654
-5.9994
1.6243
1.1227


**>>> Multiple Rolling (func)**

---

```python
df = df_test.copy(); df
```




a
b
c
d
target




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910

```python
def multiple_rolling(df, windows = [1,2], functions=["mean","std"], columns=None):
windows = [1+a for a in windows]
if not columns:
columns = df.columns.to_list()
rolling_dfs = (df[columns].rolling(i) # 1. Create window
.agg(functions) # 1. Aggregate
.rename({col: '{0}_{1:d}'.format(col, i)
for col in columns}, axis=1) # 2. Rename columns
for i in windows) # For each window
df_out = pd.concat((df, *rolling_dfs), axis=1)
da = df_out.iloc[:,len(df.columns):]
da = [col[0] + "_" + col[1] for col in da.columns.to_list()]
df_out.columns = df.columns.to_list() + da

return df_out # 3. Concatenate dataframes

df_out = pv.multiple_rolling(df, columns=["a"]); df_out
```




a
b
c
d
target
a_2_mean
a_2_std
a_3_mean
a_3_std




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
NaN
NaN
NaN
NaN


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
1.2449
0.5367
NaN
NaN


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
0.5922
0.3863
0.9363
0.6555


**>>> Date Features**

---

```python
df = df_test.copy()
df["date_fake"] = pd.date_range(start="2019-01-03", end="2019-01-06", periods=len(df)); df
```




a
b
c
d
target
date_fake




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
2019-01-03 00:00:00


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2019-01-04 12:00:00


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
2019-01-06 00:00:00

```python
def date_features(df, date="date"):
df[date] = pd.to_datetime(df[date])
df[date+"_month"] = df[date].dt.month.astype(int)
df[date+"_year"] = df[date].dt.year.astype(int)
df[date+"_week"] = df[date].dt.week.astype(int)
df[date+"_day"] = df[date].dt.day.astype(int)
df[date+"_dayofweek"]= df[date].dt.dayofweek.astype(int)
df[date+"_dayofyear"]= df[date].dt.dayofyear.astype(int)
df[date+"_hour"] = df[date].dt.hour.astype(int)
df[date+"_int"] = pd.to_datetime(df[date]).astype(int)
return df

df_out = date_features(df, date="date_fake"); df_out.iloc[:,:8]
```




a
b
c
d
target
date_fake
date_fake_month
date_fake_year




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
2019-01-03 00:00:00
1
2019


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2019-01-04 12:00:00
1
2019


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
2019-01-06 00:00:00
1
2019


**>>> Haversine Distance (Location Feature) (func)**

---

```python
df = df_test.copy()
df["latitude"] = [39, 35 , 20]
df["longitude"]= [-77, -40 , -10 ]
```

```python
from math import sin, cos, sqrt, atan2, radians
def haversine_distance(row, lon="latitude", lat="longitude"):
c_lat,c_long = radians(52.5200), radians(13.4050)
R = 6373.0
long = radians(row['longitude'])
lat = radians(row['latitude'])

dlon = long - c_long
dlat = lat - c_lat
a = sin(dlat / 2)**2 + cos(lat) * cos(c_lat) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))

return R * c

df['distance_central'] = df.apply(pv.haversine_distance,axis=1); df.iloc[:,4:]
```




target
latitude
longitude
distance_central




0
1.1227
39
-77
6702.7127


1
-5.9994
35
-40
4583.5988


2
-0.5910
20
-10
4141.6783


**>>> Parse Address**

---

```python
df = df_test.copy()
df["addr"] = pd.Series([
'Washington, D.C. 20003',
'Brooklyn, NY 11211-1755',
'Omaha, NE 68154' ]) ; df
```




a
b
c
d
target
addr




0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
Washington, D.C....


1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
Brooklyn, NY 112...


2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
Omaha, NE 68154

```python
regex = (r'(?P[A-Za-z ]+), (?P[A-Z]{2}) (?P\d{5}(?:-\d{4})?)')

df.addr.str.replace('.', '').str.extract(regex)
```




city
state
zip




0
Washington
DC
20003


1
Brooklyn
NY
11211-1755


2
Omaha
NE
68154


**>>> Processing Strings in Pandas**

---

```python
df = pd.util.testing.makeMixedDataFrame()
df["C"] = df["C"] + " " + df["C"] ; df
```




A
B
C
D




0
0.0
0.0
foo1 foo1
2009-01-01


1
1.0
1.0
foo2 foo2
2009-01-02


2
2.0
0.0
foo3 foo3
2009-01-05


3
3.0
1.0
foo4 foo4
2009-01-06


4
4.0
0.0
foo5 foo5
2009-01-07

```python
"""convert column to UPPERCASE"""

col_name = "C"
df[col_name].str.upper()

"""count string occurence in each row"""
df[col_name].str.count(r'\d') # counts number of digits

"""count # o chars in each row"""
df[col_name].str.count('o') # counts number of digits

"""split rows"""
s = pd.Series(["this is a regular sentence", "https://docs.p.org", np.nan])
s.str.split()

"""this creates new columns with the different split values (instead of lists)"""
s.str.split(expand=True)

"""limit the number of splits to 1, and start spliting from the rights side"""
s.str.rsplit("/", n=1, expand=True)

```




0
1




0
this is a regula...
None


1
https:/
docs.p.org


2
NaN
NaN


**>>> Filtering Strings in Pandas**

---

```python
df = pd.util.testing.makeMixedDataFrame()
df["C"] = df["C"] + " " + df["C"] ; df
```




A
B
C
D




0
0.0
0.0
foo1 foo1
2009-01-01


1
1.0
1.0
foo2 foo2
2009-01-02


2
2.0
0.0
foo3 foo3
2009-01-05


3
3.0
1.0
foo4 foo4
2009-01-06


4
4.0
0.0
foo5 foo5
2009-01-07

```python
col_name = "C"

"""check if a certain word/pattern occurs in each row"""
df[col_name].str.contains('oo') # returns True/False for each row

"""find occurences"""
df[col_name].str.findall(r'[ABC]\d') # returns a list of the found occurences of the specified pattern for each row

"""replace Weekdays by abbrevations (e.g. Monday --> Mon)"""
df[col_name].str.replace(r'(\w+day\b)', lambda x: x.groups[0][:3]) # () in r'' creates a group with one element, which we acces with x.groups[0]

"""create dataframe from regex groups (str.extract() uses first match of the pattern only)"""
df[col_name].str.extract(r'(\d?\d):(\d\d)')
df[col_name].str.extract(r'(?P\d?\d):(?P\d\d)')
df[col_name].str.extract(r'(?P

"""if you want to take into account ALL matches in a row (not only first one):"""
df[col_name].str.extractall(r'(\d?\d):(\d\d)') # this generates a multiindex with level 1 = 'match', indicating the order of the match

df[col_name].replace('\n', '', regex=True, inplace=True)

"""remove all the characters after (including ) for column - col_1"""
df[col_name].replace(' .*', '', regex=True, inplace=True)

"""remove white space at the beginning of string"""
df[col_name] = df[col_name].str.lstrip()

```

## **Model Validation**

---

---


**>>> Classification Metrics (func)**

---

```python
y_test = [0, 1, 1, 1, 0]
y_predict = [0, 0, 1, 1, 1]
y_prob = [0.2,0.6,0.7,0.7,0.9]
```

```python
from sklearn.metrics import roc_auc_score, average_precision_score, confusion_matrix
from sklearn.metrics import log_loss, brier_score_loss, accuracy_score

def classification_scores(y_test, y_predict, y_prob):

confusion_mat = confusion_matrix(y_test,y_predict)

TN = confusion_mat[0][0]
FP = confusion_mat[0][1]
TP = confusion_mat[1][1]
FN = confusion_mat[1][0]

TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP)
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)

ll = log_loss(y_test, y_prob) # Its low but means nothing to me.
br = brier_score_loss(y_test, y_prob) # Its low but means nothing to me.
acc = accuracy_score(y_test, y_predict)
print(acc)
auc = roc_auc_score(y_test, y_prob)
print(auc)
prc = average_precision_score(y_test, y_prob)

data = np.array([np.arange(1)]*1).T

df_exec = pd.DataFrame(data)

df_exec["Average Log Likelihood"] = ll
df_exec["Brier Score Loss"] = br
df_exec["Accuracy Score"] = acc
df_exec["ROC AUC Sore"] = auc
df_exec["Average Precision Score"] = prc
df_exec["Precision - Bankrupt Firms"] = PPV
df_exec["False Positive Rate (p-value)"] = FPR
df_exec["Precision - Healthy Firms"] = NPV
df_exec["False Negative Rate (recall error)"] = FNR
df_exec["False Discovery Rate "] = FDR
df_exec["All Observations"] = TN + TP + FN + FP
df_exec["Bankruptcy Sample"] = TP + FN
df_exec["Healthy Sample"] = TN + FP
df_exec["Recalled Bankruptcy"] = TP + FP
df_exec["Correct (True Positives)"] = TP
df_exec["Incorrect (False Positives)"] = FP
df_exec["Recalled Healthy"] = TN + FN
df_exec["Correct (True Negatives)"] = TN
df_exec["Incorrect (False Negatives)"] = FN

df_exec = df_exec.T[1:]
df_exec.columns = ["Metrics"]
return df_exec

met = pv.classification_scores(y_test, y_predict, y_prob); met
```

0.6
0.5




Metrics




Average Log Likelihood
0.7500


Brier Score Loss
0.2380


Accuracy Score
0.6000


ROC AUC Sore
0.5000


Average Precision Score
0.6944


Precision - Bankrupt Firms
0.6667


False Positive Rate (p-value)
0.5000


Precision - Healthy Firms
0.5000


False Negative Rate (recall error)
0.3333


False Discovery Rate
0.3333


All Observations
5.0000


Bankruptcy Sample
3.0000


Healthy Sample
2.0000


Recalled Bankruptcy
3.0000


Correct (True Positives)
2.0000


Incorrect (False Positives)
1.0000


Recalled Healthy
2.0000


Correct (True Negatives)
1.0000


Incorrect (False Negatives)
1.0000