Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/multimeric/pandasschema
A validation library for Pandas data frames using user-friendly schemas
https://github.com/multimeric/pandasschema
data-science pandas schema validation
Last synced: 16 days ago
JSON representation
A validation library for Pandas data frames using user-friendly schemas
- Host: GitHub
- URL: https://github.com/multimeric/pandasschema
- Owner: multimeric
- License: gpl-3.0
- Created: 2016-12-05T07:22:21.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2023-03-24T11:48:47.000Z (over 1 year ago)
- Last Synced: 2024-05-22T09:08:01.847Z (6 months ago)
- Topics: data-science, pandas, schema, validation
- Language: Python
- Homepage: https://multimeric.github.io/PandasSchema/
- Size: 767 KB
- Stars: 185
- Watchers: 6
- Forks: 34
- Open Issues: 37
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
PandasSchema
************For the full documentation, refer to the `Github Pages Website
`_.======================================================================
PandasSchema is a module for validating tabulated data, such as CSVs
(Comma Separated Value files), and TSVs (Tab Separated Value files).
It uses the incredibly powerful data analysis tool Pandas to do so
quickly and efficiently.For example, say your code expects a CSV that looks a bit like this:
.. code:: default
Given Name,Family Name,Age,Sex,Customer ID
Gerald,Hampton,82,Male,2582GABK
Yuuwa,Miyake,27,Male,7951WVLW
Edyta,Majewska,50,Female,7758NSIDNow you want to be able to ensure that the data in your CSV is in the
correct format:.. code:: python
import pandas as pd
from io import StringIO
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, InRangeValidation, InListValidationschema = Schema([
Column('Given Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('Family Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('Age', [InRangeValidation(0, 120)]),
Column('Sex', [InListValidation(['Male', 'Female', 'Other'])]),
Column('Customer ID', [MatchesPatternValidation(r'\d{4}[A-Z]{4}')])
])test_data = pd.read_csv(StringIO('''Given Name,Family Name,Age,Sex,Customer ID
Gerald ,Hampton,82,Male,2582GABK
Yuuwa,Miyake,270,male,7951WVLW
Edyta,Majewska ,50,Female,775ANSID
'''))errors = schema.validate(test_data)
for error in errors:
print(error)PandasSchema would then output
.. code:: text
{row: 0, column: "Given Name"}: "Gerald " contains trailing whitespace
{row: 1, column: "Age"}: "270" was not in the range [0, 120)
{row: 1, column: "Sex"}: "male" is not in the list of legal options (Male, Female, Other)
{row: 2, column: "Family Name"}: "Majewska " contains trailing whitespace
{row: 2, column: "Customer ID"}: "775ANSID" does not match the pattern "\d{4}[A-Z]{4}"