Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hmelberg/rulebook

Validate and revise Pandas dataframes
https://github.com/hmelberg/rulebook

Last synced: about 19 hours ago
JSON representation

Validate and revise Pandas dataframes

Host: GitHub
URL: https://github.com/hmelberg/rulebook
Owner: hmelberg
License: mit
Created: 2019-01-27T01:09:30.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-10-13T06:42:12.000Z (about 5 years ago)
Last Synced: 2024-08-09T23:05:39.560Z (3 months ago)
Language: Python
Size: 38.1 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        ## Rulebook

  - Validate and revise Pandas dataframes

## Status

  - alpha 

  - Will change, everything does not work well, use at your own risk

## Usage

```python

    import rulebook as rb

    

    # create a rulebook

    rules=rb.RuleBook()

    

    # add rules to the rulebook

    rules.add('age>0') # add one rule for one column

    rules.add('no_missing', cols=['id_n', 'date', 'cost']) # use a predefined rule for many columns

    

    # check the dataframe against the rules

    rules.check(df)

```

    

## Features

  - Succinct: Easy to add many rules to many columns

  - Flexible: Use predefined rules or add your own functions or expressions

  - Smart: Suggest rules feature save you the work of generating rules

  - Share: The rulebook, including self-defined functions, can be saved and shared

  - Visualize: Get a quick visualization of amount and type of invalid data

  

## Installation

```python 

pip install rulebook

```

    

## Requirements

  - Python 3.6 and above

  - Pandas

  

## Licence

  -MIT

  

## Advanced features

  - Add complex rule expressions (pandas expressions) 

```python 

    # All observations with the same id should also have the same gender

         rules.add("df.groupby('id')['gender'].nunique()<2")

```        

  - Add rules for revising invalid observations (including self-defined rules) 

```python 

    # Make all values that are invalid, missing

        rules.add("isin('m', 'f')", cols='gender', invalid='to_missing')

    # Check and revise the dataframe

        revised_df = rules.check_and_revise(df)

 ```  

## General structure

  - There are three types of rules that can be added:

    - Expressions

      - Simple: ```rules.add('age>25')```

      - Logical: No pregnant men: ```rules.add("not (gender=='m' and icd=='O82)")```

    - Functions

      - Pre-defined: ```rules.add(['never_negative', 'never_missing'], cols=['id', 'age'])```

      - Self-defined: ```rules.add('THE_NAME_OF_YOUR_FUNCTION')``` 

          Define a function that takes a dataframe (and possibly a column) and returns a series that is True or False. The name of the function can be added as a rule:

    - Pandas expressions

      - Series: ```rules.add("name.str.contains('Cathy')")```

      - Dataframe:       

 ```python

          # For each persons age should never decrease as the date increases

          rules.add("df.sort_values(['id', 'age']).groupby('id')['age'].is_monotonic")

 ```

 

 ## See also

- [Engarde](https://github.com/TomAugspurger/engarde)

- [assertr](https://github.com/tonyfischetti/assertr)

- [Validada](https://github.com/jnmclarty/validada)

- [Validate (R)](https://cran.r-project.org/web/packages/validate/vignettes/introduction.html)

- [PandasSchema](https://github.com/TMiguelT/PandasSchema)

- [Great Expectations](https://github.com/great-expectations/great_expectations)

 

 ## API info

  - rules=rb.RuleBook()

  - rules.add()

  - rules.delete()

  - rules.view()

  - rules.save()

  - rules.suggest()

  - rules.check()

  - rules.check_and_revise()