Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jlehrer1/instanteda

Instantly generate common exploratory data plots without worrying about cleaning your DataFrame.
https://github.com/jlehrer1/instanteda

eda pandas python visualization

Last synced: 3 days ago
JSON representation

Instantly generate common exploratory data plots without worrying about cleaning your DataFrame.

Awesome Lists containing this project

README

        

# Instant EDA
Instantly generate common exploratory data plots without having to worry about cleaning your DataFrame.

The code is hosted on PyPi, the Python Package Index
[here](https://pypi.org/project/quickplotter/1.0/)

It can be installed by running
```shell
pip install quickplotter==1.0
```

To setup the proper development environment, run
```
conda env create -f environment.yml
conda update pip
```

To run the test suite, run `pytest`.

## 1. Usage:
```python3
plotter = quickplotter.QuickPlotter(df: pd.DataFrame) #creates a QuickPlotter object with the given DataFrame

plotter.common(subset=['correlation', 'percent_nan']) #plots correlation between features, and percent nan in each column

plotter.distribution(column_subset=df.columns[0:4]) #plots distributions for the first four columns in the DataFrame

plotter.common(column_subset=['body_mass_index', 'blood_type']) #plots common plots for the given columns
```

**Remember, this is meant to be a quick and dirty tool for exploration, and not for being delicate with each data entry.** Therefore, if the number of `NaN` values in the DataFrame is `<= 5%` of the total values, the NaN rows will be dropped and the plots will be generated without them.

## 2. subset & diff lists
The quickplot module works mainly with two specifications, `subset` and `diff`.

For any `subset`-like list, the items in the list will be used. For any `diff`-like list, all items *except* those in the list will be used.

The options are as follow:
- `subset`: Use only the plots specified in the list
- `diff`: Use all plots *except* those specified in the list
- `subset_columns`: Use all columns specified in the list. Can either be `df.columns` slicing or by name
- `diff_columns`: Use all columns *except* those specified in the list. Can either be `df.columns` slicing or by name.

## 3. Contributing

If you have read this far I hope you've found this tool useful. I am always looking to learn more and develop as a programmer, so if you have any ideas or contributions, feel free to write a feature or pull request.