https://github.com/bacross/datamunger

python package for handling nan's and outliers
https://github.com/bacross/datamunger

data data-frame datamunger knn nan outliers python scikit-learn

Last synced: 3 months ago
JSON representation

python package for handling nan's and outliers

Host: GitHub
URL: https://github.com/bacross/datamunger
Owner: bacross
License: mit
Created: 2017-11-08T16:09:49.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-11-28T20:36:07.000Z (over 6 years ago)
Last Synced: 2025-03-07T20:11:11.726Z (4 months ago)
Topics: data, data-frame, datamunger, knn, nan, outliers, python, scikit-learn
Language: Python
Size: 247 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        
# Readme text for datamunger package

datamunger uses a K-Nearest Neighbors approach to impute both outliers and missing data.

for each column and each nan within the column of a dataframe, datamunger uses the other available columns to build a geometry for which kNN's can be used to impute the missing data point.

Notes:

- multicore doesn't handle well in windows.  There is the option to use multicore on both the embarassing loop thru columns but also the knn fit as available via scikit-learn.  right now, multicore is hardcoded to n_jobs=1 in the scikit-learn fit and set to number of available cores for the columnar embarassing loop.

ToDo's:

1) handling for a row of all nan's 

Required Packages:

```python

import numpy as np

import pandas as pd

import random

from sklearn.neighbors import KNeighborsRegressor

import timeit

import sys

from joblib import Parallel, delayed

import importlib

```

```python

sys.path.append('C:/Users/bacro/OneDrive/PythonScripts/MungerProject/datamunger')

```

```python

import imputeKNN as iknn

```

Generate a dataframe of random numbers and then randomly force some of those numbers to be nan's

```python

df = pd.DataFrame(np.random.randn(1000,5))

ix = [(row,col) for row in range (df.shape[0]) for col in range(df.shape[1])]

for row, col in random.sample(ix, int(round(.1*len(ix)))):

    df.iat[row,col]=np.nan

```

```python

# code to remove missing data

start_time = timeit.default_timer()

newdf = iknn.imputeMissingDataKNN(df,30,multicore=False)

elapsed = timeit.default_timer() - start_time

print(elapsed)

```

    1.71197532222

    

```python

# code to remove outliers

start_time = timeit.default_timer()

cleandf = iknn.imputeOutlierKNN(newdf,lower_lim=0.05,upper_lim=0.95,k=30,multicore=False)

elapsed = timeit.default_timer() - start_time

print(elapsed)

```

    1.55103146836

    

```python

mediandf = df.apply(lambda x: x.fillna(x.median()),axis=0)

medoutdf = iknn.outlierToNanDF(mediandf,lower_lim=0.05,upper_lim=0.95,multicore=False)

meddf = medoutdf.apply(lambda x: x.fillna(x.median()),axis=0)

```

```python

%matplotlib inline

cleandf.hist(layout=(3,2))

```

    array([[,

            ],

           [,

            ],

           [,

            ]], dtype=object)

![png](output_10_1.png)

```python

df.hist(layout=(3,2))

```

    array([[,

            ],

           [,

            ],

           [,

            ]], dtype=object)

![png](output_11_1.png)

```python

meddf.hist(layout=(3,2))

```

    array([[,

            ],

           [,

            ],

           [,

            ]], dtype=object)

![png](output_12_1.png)

```python

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bacross/datamunger

Awesome Lists containing this project

README