https://github.com/luanee/pandera-report

Pandera Report for row-based reporting by using the power of pandera.
https://github.com/luanee/pandera-report

pandera reporting

Last synced: 9 months ago
JSON representation

Pandera Report for row-based reporting by using the power of pandera.

Host: GitHub
URL: https://github.com/luanee/pandera-report
Owner: Luanee
License: mit
Created: 2023-09-20T17:09:46.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-09-27T22:41:02.000Z (over 1 year ago)
Last Synced: 2025-01-31T09:33:50.621Z (over 1 year ago)
Topics: pandera, reporting
Language: Python
Homepage:
Size: 108 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

Pandera Extension for row-based reporting

---

## 🚀 Description

> [pandera](https://github.com/unionai-oss/pandera) provides a flexible and expressive API for performing data
> validation on dataframe-like objects to make data processing pipelines more
> readable and robust

If you have to report potential quality issues resulting from the dataframe validation via `pandera`, than `pandera-report` is your friend. Based on the information of possible validation issues that pandera provides, your original dataframe will be extended with these issues on a row-level base.

With
`pandera-report`, you can:

- Seamlessly integrates with the `pandera` library to provide enhanced data validation capabilities without interfering with the pandera functionality.
- Provides a convenient way to enrich your data with information about why specific rows failed validation.

## ⚡ Setup

Using pip:

```bash
pip install pandera-report
```

Using poetry:

```bash
poetry add pandera-report
```

## Quick start

The following example is taken from the `pandera` documentation and shows the definition of a DataFrameSchema which will end in a valid result for the provided dataframe.

```Python
import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})

# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})

validated_df = schema(df)
print(validated_df)

# column1 column2 column3
# 0 1 -1.3 value_1
# 1 4 -1.4 value_2
# 2 0 -2.9 value_3
# 3 10 -10.1 value_2
# 4 9 -20.4 value_1
```

To make usage of the `pandera-report` functionality for the same schema and dataframe, you can do this:

```Python

validator = DataFrameValidator() # default is quality_report=True, lazy=True
print(validator.validate(schema, df))

# column1 column2 column3 quality_issues quality_status
# 0 1 -1.3 value_1 None Valid
# 1 4 -1.4 value_2 None Valid
# 2 0 -2.9 value_3 None Valid
# 3 10 -10.1 value_2 None Valid
# 4 9 -20.4 value_1 None Valid
```

You see?! Same result but extended by the fact that the validation of the dataframe was completely valid. This can also be deactivated for the case that everything is 100% valid.

But what if the dataframe contains data quality issues? `pandera` will throw SchemaErrors or SchemaError (depends on the lazyness). Let's see what `pandera-report` does, if we change the dataframe against the schema definition:

```Python

# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value1"]
})

validator = DataFrameValidator()
print(validator.validate(schema, df))

Why is this useful? Quite simply, it becomes particularly interesting when you are not the one who has to prepare a valid file so that it can be processed into a valid DataFrame in the end.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/luanee/pandera-report

Awesome Lists containing this project

README

Pandera Extension for row-based reporting