https://github.com/unionai-oss/pandera
A light-weight, flexible, and expressive statistical data testing library
https://github.com/unionai-oss/pandera
assertions data-assertions data-check data-cleaning data-processing data-validation data-verification dataframe-schema dataframes hypothesis-testing pandas pandas-dataframe pandas-validation pandas-validator schema testing testing-tools validation
Last synced: 13 days ago
JSON representation
A light-weight, flexible, and expressive statistical data testing library
- Host: GitHub
- URL: https://github.com/unionai-oss/pandera
- Owner: unionai-oss
- License: mit
- Created: 2018-11-01T02:18:34.000Z (over 7 years ago)
- Default Branch: main
- Last Pushed: 2024-10-15T02:46:42.000Z (over 1 year ago)
- Last Synced: 2024-10-29T22:56:47.185Z (over 1 year ago)
- Topics: assertions, data-assertions, data-check, data-cleaning, data-processing, data-validation, data-verification, dataframe-schema, dataframes, hypothesis-testing, pandas, pandas-dataframe, pandas-validation, pandas-validator, schema, testing, testing-tools, validation
- Language: Python
- Homepage: https://www.union.ai/pandera
- Size: 4.08 MB
- Stars: 3,343
- Watchers: 20
- Forks: 308
- Open Issues: 392
-
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-list - pandera - A light-weight, flexible, and expressive statistical data testing library. (Data Processing / Data Management)
- awesome-data-analysis - Pandera - Data validation through declarative schemas. (๐ Python / Useful Python Tools for Data Analysis)
- best-of-python - GitHub - 41% open ยท โฑ๏ธ 31.10.2025): (Data Containers & Dataframes)
- awesome-safety-critical-ai - `unionai-oss/pandera`
- awesome-python-data-science - pandera - A lightweight, flexible, and expressive statistical data testing library. (Data Validation / NLP)
- awesome-python-data-science - pandera - A lightweight, flexible, and expressive statistical data testing library. (Data Validation / Synthetic Data)
README
The Open-source Framework for Validating DataFrame-like Objects
๐ ๐ โ
Data validation for scientists, engineers, and analysts seeking correctness.
[](https://github.com/unionai-oss/pandera/actions/workflows/ci-tests.yml?query=branch%3Amain)
[](https://pandera.readthedocs.io/en/stable/?badge=stable)
[](https://pypi.org/project/pandera/)
[](https://pypi.python.org/pypi/)
[](https://github.com/pyOpenSci/software-review/issues/12)
[](https://www.repostatus.org/#active)
[](https://pandera.readthedocs.io/en/latest/?badge=latest)
[](https://codecov.io/gh/unionai-oss/pandera)
[](https://pypi.python.org/pypi/pandera/)
[](https://doi.org/10.5281/zenodo.3385265)
[](https://pandera-dev.github.io/pandera-asv-logs/)
[](https://pepy.tech/project/pandera)
[](https://anaconda.org/conda-forge/pandera)
[](https://flyte-org.slack.com/archives/C08FDTY2X3L)
Pandera is a [Union.ai](https://union.ai/blog-post/pandera-joins-union-ai) open
source project that provides a flexible and expressive API for performing data
validation on dataframe-like objects. The goal of Pandera is to make data
processing pipelines more readable and robust with statistically typed
dataframes.
## Install
Pandera supports [multiple dataframe libraries](https://pandera.readthedocs.io/en/stable/supported_libraries.html), including [pandas](http://pandas.pydata.org), [polars](https://docs.pola.rs/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), and more. To validate `pandas` DataFrames, install Pandera with the `pandas` extra:
**With `pip`:**
```
pip install 'pandera[pandas]'
```
**With `uv`:**
```
uv pip install 'pandera[pandas]'
```
**With `conda`:**
```
conda install -c conda-forge pandera-pandas
```
## Get started
First, create a dataframe:
```python
import pandas as pd
import pandera.pandas as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 2, 3],
"column2": [1.1, 1.2, 1.3],
"column3": ["a", "b", "c"],
})
```
Validate the data using the object-based API:
```python
# define a schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, pa.Check.ge(0)),
"column2": pa.Column(float, pa.Check.lt(10)),
"column3": pa.Column(
str,
[
pa.Check.isin([*"abc"]),
pa.Check(lambda series: series.str.len() == 1),
]
),
})
print(schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
```
Or validate the data using the class-based API:
```python
# define a schema
class Schema(pa.DataFrameModel):
column1: int = pa.Field(ge=0)
column2: float = pa.Field(lt=10)
column3: str = pa.Field(isin=[*"abc"])
@pa.check("column3")
def custom_check(cls, series: pd.Series) -> pd.Series:
return series.str.len() == 1
print(Schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
```
> [!WARNING]
> Pandera `v0.24.0` introduces the `pandera.pandas` module, which is now the
> (highly) recommended way of defining `DataFrameSchema`s and `DataFrameModel`s
> for `pandas` data structures like `DataFrame`s. Defining a dataframe schema from
> the top-level `pandera` module will produce a `FutureWarning`:
>
> ```python
> import pandera as pa
>
> schema = pa.DataFrameSchema({"col": pa.Column(str)})
> ```
>
> Update your import to:
>
> ```python
> import pandera.pandas as pa
> ```
>
> And all of the rest of your pandera code should work. Using the top-level
> `pandera` module to access `DataFrameSchema` and the other pandera classes
> or functions will be deprecated in version `0.29.0`
## Next steps
See the [official documentation](https://pandera.readthedocs.io) to learn more.
