https://github.com/ivankmk/dfaudit

Audit your pandas DataFrame before you trust it - missing values, soft missing, cardinality and top categories in one call.
https://github.com/ivankmk/dfaudit

data-audit data-governance data-profiling data-quality eda exploratory-data-analysis matplotlib missing-data pandas

Last synced: about 1 month ago
JSON representation

Audit your pandas DataFrame before you trust it - missing values, soft missing, cardinality and top categories in one call.

Host: GitHub
URL: https://github.com/ivankmk/dfaudit
Owner: ivankmk
License: mit
Created: 2026-05-05T17:21:29.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-05-10T14:31:52.000Z (about 2 months ago)
Last Synced: 2026-05-10T16:29:44.838Z (about 2 months ago)
Topics: data-audit, data-governance, data-profiling, data-quality, eda, exploratory-data-analysis, matplotlib, missing-data, pandas
Language: Python
Homepage:
Size: 516 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  

  dfaudit

  Know if you can trust your data before you use it.

  A fast, modern data audit tool for pandas DataFrames -- built for analysts who need to assess data reliability, not just explore it.


  [![PyPI version](https://img.shields.io/pypi/v/dfaudit.svg)](https://pypi.org/project/dfaudit/)

  [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)

  [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)



---

## What is dfaudit?

**dfaudit** is a lightweight Python package for auditing pandas DataFrames. Before you clean, model, or report -- you need to know whether your data is reliable. dfaudit gives you that answer fast.

Think of it as an auditor's first pass: one call to understand how complete, consistent, and trustworthy your data actually is. Missing values, cardinality, dominant categories -- the things that tell you if the data holds up.

## Install

```bash

pip install dfaudit

```

## Quickstart

```python

import pandas as pd

import dfaudit as dfa

df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv")

dfa.overview(df)

dfa.missing_matrix(df)

dfa.unique_matrix(df)

```

## What you get

### `overview(df)` - Column-level data quality table

A styled summary table covering every column at once: dtypes, missing counts and percentages, unique value counts, mode, and top-3 categories - all color coded so problems jump out instantly.

![overview table](docs/assets/screenshot_overview.png)

**missing** — standard `NaN` / `None` / `NaT` values pandas considers null.

**soft_missing** — values that are technically non-null but meaningless in practice: empty strings, placeholder text, and numeric sentinel codes commonly used in legacy systems or exports. dfaudit flags these automatically:

- Strings: `""`, `"n/a"`, `"na"`, `"null"`, `"none"`, `"nan"`, `"nil"`, `"-"`, `"--"`, `"?"`, `"unknown"`, `"missing"`, `"not available"`, `"not applicable"`

- Numbers: `-999`, `-9999`, `9999`, `99999`

Missing percentage cells are highlighted red for high missingness, blue for high cardinality.

### `missing_matrix(df)` - Row-level missingness visualization

A matrix plot that shows *where* your data is missing across every row, with a bar chart of missing percentages per column on the right. Spot correlated missingness, systematic gaps, and data loading issues that column-level stats miss entirely.

![missing matrix](docs/assets/screenshot_missing_matrix.png)

```python

dfa.missing_matrix(df, style="vivid")

```

### `unique_matrix(df)` - Row-level unique value visualization

A matrix plot that shows the distribution of unique values across every row and column. Low-cardinality columns (≤ 10 unique values) get a distinct color per value — so a binary column like `Sex` renders in two colors, a 3-class column like `Pclass` in three. High-cardinality columns (`Name`, `Ticket`, `PassengerId`) appear in a uniform gray, making it immediately obvious which columns are categorical vs. continuous or near-unique identifiers. A bar chart on the right shows the exact unique value count per column.

![unique matrix](docs/assets/screenshot_unique_matrix.png)

```python

dfa.unique_matrix(df)

dfa.unique_matrix(df, style="vivid", max_colors=8)

```

## Why I built this

I work with a lot of different data on a daily basis. For years I carried the same audit snippets from notebook to notebook -- copy, paste, tweak, repeat. At some point I decided to stop doing that and package the most common data audit tasks into something reusable, with visuals I actually want to look at.

dfaudit is what I reach for now at the start of every new dataset. I hope it saves you the same time it saves me.

If you work with modern data and want to contribute -- please do, pull requests are very welcome.

---



  Made by Ivan Kumeyko  ·  MIT License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ivankmk/dfaudit

Awesome Lists containing this project

README

dfaudit