Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bacross/datamunger

python package for handling nan's and outliers
https://github.com/bacross/datamunger

data data-frame datamunger knn nan outliers python scikit-learn

Last synced: 18 days ago
JSON representation

python package for handling nan's and outliers

Awesome Lists containing this project

README

        

# Readme text for datamunger package
datamunger uses a K-Nearest Neighbors approach to impute both outliers and missing data.
for each column and each nan within the column of a dataframe, datamunger uses the other available columns to build a geometry for which kNN's can be used to impute the missing data point.

Notes:
- multicore doesn't handle well in windows. There is the option to use multicore on both the embarassing loop thru columns but also the knn fit as available via scikit-learn. right now, multicore is hardcoded to n_jobs=1 in the scikit-learn fit and set to number of available cores for the columnar embarassing loop.

ToDo's:
1) handling for a row of all nan's

Required Packages:

```python
import numpy as np
import pandas as pd
import random
from sklearn.neighbors import KNeighborsRegressor
import timeit
import sys
from joblib import Parallel, delayed
import importlib
```

```python
sys.path.append('C:/Users/bacro/OneDrive/PythonScripts/MungerProject/datamunger')
```

```python
import imputeKNN as iknn
```

Generate a dataframe of random numbers and then randomly force some of those numbers to be nan's

```python
df = pd.DataFrame(np.random.randn(1000,5))
ix = [(row,col) for row in range (df.shape[0]) for col in range(df.shape[1])]
for row, col in random.sample(ix, int(round(.1*len(ix)))):
df.iat[row,col]=np.nan
```

```python
# code to remove missing data
start_time = timeit.default_timer()
newdf = iknn.imputeMissingDataKNN(df,30,multicore=False)
elapsed = timeit.default_timer() - start_time
print(elapsed)
```

1.71197532222

```python
# code to remove outliers
start_time = timeit.default_timer()
cleandf = iknn.imputeOutlierKNN(newdf,lower_lim=0.05,upper_lim=0.95,k=30,multicore=False)
elapsed = timeit.default_timer() - start_time
print(elapsed)
```

1.55103146836

```python
mediandf = df.apply(lambda x: x.fillna(x.median()),axis=0)
medoutdf = iknn.outlierToNanDF(mediandf,lower_lim=0.05,upper_lim=0.95,multicore=False)
meddf = medoutdf.apply(lambda x: x.fillna(x.median()),axis=0)
```

```python
%matplotlib inline
cleandf.hist(layout=(3,2))
```

array([[,
],
[,
],
[,
]], dtype=object)

![png](output_10_1.png)

```python
df.hist(layout=(3,2))
```

array([[,
],
[,
],
[,
]], dtype=object)

![png](output_11_1.png)

```python
meddf.hist(layout=(3,2))
```

array([[,
],
[,
],
[,
]], dtype=object)

![png](output_12_1.png)

```python

```