https://github.com/luanee/pandera-report
Pandera Report for row-based reporting by using the power of pandera.
https://github.com/luanee/pandera-report
pandera reporting
Last synced: 9 months ago
JSON representation
Pandera Report for row-based reporting by using the power of pandera.
- Host: GitHub
- URL: https://github.com/luanee/pandera-report
- Owner: Luanee
- License: mit
- Created: 2023-09-20T17:09:46.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-27T22:41:02.000Z (over 1 year ago)
- Last Synced: 2025-01-31T09:33:50.621Z (over 1 year ago)
- Topics: pandera, reporting
- Language: Python
- Homepage:
- Size: 108 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
Pandera Extension for row-based reporting
---
## 🚀 Description
> [pandera](https://github.com/unionai-oss/pandera) provides a flexible and expressive API for performing data
> validation on dataframe-like objects to make data processing pipelines more
> readable and robust
If you have to report potential quality issues resulting from the dataframe validation via `pandera`, than `pandera-report` is your friend. Based on the information of possible validation issues that pandera provides, your original dataframe will be extended with these issues on a row-level base.
With
`pandera-report`, you can:
- Seamlessly integrates with the `pandera` library to provide enhanced data validation capabilities without interfering with the pandera functionality.
- Provides a convenient way to enrich your data with information about why specific rows failed validation.
## âš¡ Setup
Using pip:
```bash
pip install pandera-report
```
Using poetry:
```bash
poetry add pandera-report
```
## Quick start
The following example is taken from the `pandera` documentation and shows the definition of a DataFrameSchema which will end in a valid result for the provided dataframe.
```Python
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema(df)
print(validated_df)
# column1 column2 column3
# 0 1 -1.3 value_1
# 1 4 -1.4 value_2
# 2 0 -2.9 value_3
# 3 10 -10.1 value_2
# 4 9 -20.4 value_1
```
To make usage of the `pandera-report` functionality for the same schema and dataframe, you can do this:
```Python
validator = DataFrameValidator() # default is quality_report=True, lazy=True
print(validator.validate(schema, df))
# column1 column2 column3 quality_issues quality_status
# 0 1 -1.3 value_1 None Valid
# 1 4 -1.4 value_2 None Valid
# 2 0 -2.9 value_3 None Valid
# 3 10 -10.1 value_2 None Valid
# 4 9 -20.4 value_1 None Valid
```
You see?! Same result but extended by the fact that the validation of the dataframe was completely valid. This can also be deactivated for the case that everything is 100% valid.
But what if the dataframe contains data quality issues? `pandera` will throw SchemaErrors or SchemaError (depends on the lazyness). Let's see what `pandera-report` does, if we change the dataframe against the schema definition:
```Python
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value1"]
})
validator = DataFrameValidator()
print(validator.validate(schema, df))
# column1 column2 column3 quality_issues quality_status
# 0 1 -1.3 value_1 None Valid
# 1 4 -1.4 value_2 None Valid
# 2 0 -2.9 value_3 None Valid
# 3 10 -10.1 value_2 None Valid
# 4 9 -20.4 value1 Column : str_startswith('value_') Invalid
```
Why is this useful? Quite simply, it becomes particularly interesting when you are not the one who has to prepare a valid file so that it can be processed into a valid DataFrame in the end.