Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aeturrell/skimpy
skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.
https://github.com/aeturrell/skimpy
data-science eda exploratory-data-analysis pandas statistics summary-statistics
Last synced: 2 months ago
JSON representation
skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.
- Host: GitHub
- URL: https://github.com/aeturrell/skimpy
- Owner: aeturrell
- License: other
- Created: 2021-09-01T19:39:56.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-04-23T05:33:48.000Z (9 months ago)
- Last Synced: 2024-04-23T08:59:57.924Z (9 months ago)
- Topics: data-science, eda, exploratory-data-analysis, pandas, statistics, summary-statistics
- Language: Python
- Homepage: https://aeturrell.github.io/skimpy/
- Size: 4.34 MB
- Stars: 357
- Watchers: 10
- Forks: 17
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing.qmd
- License: LICENSE.md
- Code of conduct: docs/code_of_conduct.qmd
- Citation: CITATION.cff
Awesome Lists containing this project
- awesome-quarto - Documentation website from Jupyter Notebook - Quarto used to generate a website from a Jupyter notebook containing Python module documentation. (Real-life examples / Websites formats)
README
# Skimpy
A light weight tool for creating summary statistics from dataframes.
![png](docs/logo.png)![](logo.png)
[![PyPI](https://img.shields.io/pypi/v/skimpy.svg)](https://pypi.org/project/skimpy/)
[![Status](https://img.shields.io/pypi/status/skimpy.svg)](https://pypi.org/project/skimpy/)
[![Python Version](https://img.shields.io/pypi/pyversions/skimpy)](https://pypi.org/project/skimpy)
[![License](https://img.shields.io/pypi/l/skimpy)](https://opensource.org/licenses/MIT)
[![Read the documentation at https://aeturrell.github.io/skimpy/](https://img.shields.io/badge/docs-passing-brightgreen)](https://aeturrell.github.io/skimpy/)
[![Tests](https://github.com/aeturrell/skimpy/workflows/Tests/badge.svg)](https://github.com/aeturrell/skimpy/actions?workflow=Tests)
[![Codecov](https://codecov.io/gh/aeturrell/skimpy/branch/main/graph/badge.svg)](https://codecov.io/gh/aeturrell/skimpy)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/aeturrell/7bf183c559dc1d15ab7e7aaac39ea0ed/skimpy_demo.ipynb)
[![Downloads](https://static.pepy.tech/badge/skimpy)](https://pepy.tech/project/skimpy)
[![Source](https://img.shields.io/badge/source%20code-github-lightgrey?style=for-the-badge)](https://github.com/aeturrell/skimpy)![Linux](https://img.shields.io/badge/Linux-FCC624?style=for-the-badge&logo=linux&logoColor=black)
![macOS](https://img.shields.io/badge/mac%20os-000000?style=for-the-badge&logo=macos&logoColor=F0F0F0)
![Windows](https://img.shields.io/badge/Windows-0078D6?style=for-the-badge&logo=windows&logoColor=white)**skimpy** is a light weight tool that provides summary statistics about variables in **pandas** or **Polars** data frames within the console or your interactive Python window.
Think of it as a super-charged version of **pandas**' `df.describe()`.
[You can find the documentation here](https://aeturrell.github.io/skimpy/).## Quickstart
`skim` a **pandas** or **polars** dataframe and produce summary statistics within the console
using:```python
from skimpy import skimskim(df)
```where `df` is a **pandas** or **polars** dataframe.
If you need to a dataset to try *skimpy* out on, you can use the built-in test **Pandas** data frame:
```python
from skimpy import generate_test_data, skimdf = generate_test_data()
skim(df)
```╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│ Data Summary Data Types Categories │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ ┃ Categorical Variables ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩ │
│ │ Number of rows │ 1000 │ │ float64 │ 3 │ │ class │ │
│ │ Number of columns │ 13 │ │ category │ 2 │ │ location │ │
│ └───────────────────┴────────┘ │ datetime64 │ 2 │ └───────────────────────┘ │
│ │ object │ 2 │ │
│ │ int64 │ 1 │ │
│ │ bool │ 1 │ │
│ │ string │ 1 │ │
│ │ timedelta64 │ 1 │ │
│ └─────────────┴───────┘ │
│ number │
│ ┏━━━━━━━━━━━━━━┳━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┓ │
│ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │
│ ┡━━━━━━━━━━━━━━╇━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━┩ │
│ │ length │ 0 │ 0 │ 0.5016 │ 0.3597 │ 1.573e-06 │ 0.134 │ 0.4976 │ 0.8602 │ 1 │ ▇▃▃▃▅▇ │ │
│ │ width │ 0 │ 0 │ 2.037 │ 1.929 │ 0.002057 │ 0.603 │ 1.468 │ 2.953 │ 13.91 │ ▇▃▁ │ │
│ │ depth │ 0 │ 0 │ 10.02 │ 3.208 │ 2 │ 8 │ 10 │ 12 │ 20 │ ▁▃▇▆▃▁ │ │
│ │ rnd │ 118 │ 11.8 │ -0.01977 │ 1.002 │ -2.809 │ -0.7355 │ -0.0007736 │ 0.6639 │ 3.717 │ ▁▅▇▅▁ │ │
│ └──────────────┴─────┴──────┴──────────┴────────┴───────────┴─────────┴────────────┴────────┴───────┴────────┘ │
│ category │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ column_name ┃ NA ┃ NA % ┃ ordered ┃ unique ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │
│ │ class │ 0 │ 0 │ False │ 2 │ │
│ │ location │ 1 │ 0.1 │ False │ 5 │ │
│ └──────────────────────────────────┴───────────┴────────────────┴───────────────────────┴────────────────────┘ │
│ bool │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ column_name ┃ true ┃ true rate ┃ hist ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │
│ │ booly_col │ 516 │ 0.52 │ ▇ ▇ │ │
│ └────────────────────────────────────┴─────────────────┴───────────────────────────────┴─────────────────────┘ │
│ datetime │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ │
│ ┃ column_name ┃ NA ┃ NA % ┃ first ┃ last ┃ frequency ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │
│ │ datetime │ 0 │ 0 │ 2018-01-31 │ 2101-04-30 │ ME │ │
│ │ datetime_no_freq │ 3 │ 0.3 │ 1992-01-05 │ 2023-03-04 │ None │ │
│ └──────────────────────────────┴───────┴──────────┴────────────────────┴───────────────────┴─────────────────┘ │
│ <class 'datetime.date'> │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓ │
│ ┃ column_name ┃ NA ┃ NA % ┃ first ┃ last ┃ frequency ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩ │
│ │ datetime.date │ 0 │ 0 │ 2018-01-31 │ 2101-04-30 │ ME │ │
│ │ datetime.date_no_freq │ 0 │ 0 │ 1992-01-05 │ 2023-03-04 │ None │ │
│ └──────────────────────────────────┴───────┴──────────┴──────────────────┴──────────────────┴────────────────┘ │
│ timedelta64 │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ median ┃ max ┃ │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩ │
│ │ time diff │ 5 │ 0.5 │ 8 days 00:05:47 │ 0 days 00:00:00 │ 26 days 00:00:00 │ │
│ └──────────────────┴──────┴─────────┴───────────────────────┴───────────────────────┴────────────────────────┘ │
│ string │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │
│ │ text │ 6 │ 0.6 │ 5.8 │ 5761 │ │
│ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘ │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯It is recommended that you set your datatypes before using **skimpy** (for example converting any text columns to pandas string datatype), as this will produce richer statistical summaries. However, the `skim()` function will try and guess what the datatypes of your columns are.
## Requirements
You can find a full list of requirements in the [pyproject.toml](https://github.com/aeturrell/skimpy/blob/main/pyproject.toml) file.
You can try this package out right now in your browser using this
[Google Colab notebook](https://colab.research.google.com/gist/aeturrell/7bf183c559dc1d15ab7e7aaac39ea0ed/skimpy_demo.ipynb)
(requires a Google account). Note that the Google Colab notebook uses the latest package released on PyPI (rather than the development release).## Installation
You can install the latest release of *skimpy* via
[pip](https://pip.pypa.io/) from [PyPI](https://pypi.org/):```bash
$ pip install skimpy
```To install the development version from git, use:
```bash
$ pip install git+https://github.com/aeturrell/skimpy.git
```For development, see [contributing](contributing.qmd).
## License
Distributed under the terms of the [MIT license](https://opensource.org/licenses/MIT), *skimpy* is free and open source software.
## Issues
If you encounter any problems, please [file an issue](https://github.com/aeturrell/skimpy/issues) along with a detailed description.
## Credits
This project was generated from [\@cjolowicz](https://github.com/cjolowicz)\'s [Hypermodern Python Cookiecutter](https://github.com/cjolowicz/cookiecutter-hypermodern-python) template.
**skimpy** was inspired by the R package [**skimr**](https://docs.ropensci.org/skimr/articles/skimr.html) and by exploratory Python packages including [**ydata_profiling**](https://docs.profiling.ydata.ai) and [**dataprep**](https://dataprep.ai/), from which the `clean_columns` function comes.
This package would not have been possible without the [**Rich**](https://github.com/Textualize/rich) package.
The package is built with [poetry](https://python-poetry.org/), while the documentation is built with [Quarto](https://quarto.org/) and [Quartodoc](https://github.com/machow/quartodoc) (a Python package). Tests are run with [nox](https://nox.thea.codes/en/stable/).
Using **skimpy** in your paper? Let us know by raising an issue beginning with "citation" and we'll add it to this page.