An open API service indexing awesome lists of open source software.

https://github.com/gibbsbravo/datadelta

The best Python package for comparing two dataframes
https://github.com/gibbsbravo/datadelta

analytics comparison data data-analytics database database-management databases dataops dataops-platform devops pandas pandas-dataframe testing testing-tools version-control

Last synced: 4 months ago
JSON representation

The best Python package for comparing two dataframes

Awesome Lists containing this project

README

        





Logo


The best Python package for comparing two dataframes


Explore the docs »




Table of Contents



  1. About The Project


  2. Getting Started


  3. Usage Examples

  4. Contributing

  5. Example HTML Report Output

  6. License

  7. Contact

## About The Project

DataDelta is a very useful Python package for easily comparing two pandas dataframes for use in data analysis, data engineering, and tracking table changes across time.

DataDelta generates a report as both a Python dict and HTML file that summarizes the key changes between two dataframes through completing a series of tests (that can also be selected individually). The Python report is intended for use as part of a DevOps / DataOps pipeline for testing to ensure table changes are expected.

Report Bug
·
Request Feature

(back to top)

## Getting Started

DataDelta is easy to install through pip or feel free to clone locally to make changes.

### Dependencies

DataDelta has very few dependencies:

- pandas: _a fast, powerful, flexible and easy to use open source data analysis and manipulation tool_ - DataDelta is built on for comparing dataframes
- numpy: _The fundamental package for scientific computing with Python_ - used for transformations and calculations
- jinja2: _a fast, expressive, extensible templating engine_ - used to generate the HTML report
- pytest (optional): _a mature full-featured Python testing tool that helps you write better programs_ - used for testing

### Installation

- Install using Pip through PyPI:
```sh
pip install datadelta
```

OR

- Clone the repo locally:
```sh
git clone https://github.com/gibbsbravo/DataDelta.git
```

(back to top)

## Usage Examples

- Quick starter code to get summary dataframe changes report:

```sh
import pandas as pd
import datadelta as delta

old_df = pd.read_csv('MainTestData_old_df.csv') # Add your old dataframe here
new_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here
primary_key = 'A' # Set the primary key
column_subset = None # Specify the subset of columns of interest or leave None to compare all columns

# The consolidated_report dictionary will contain the summary changes
consolidated_report, record_changes_comparison_df = delta.create_consolidated_report(
old_df, new_df, primary_key, column_subset)

# This will create a report named datadelta_html_report.html in the current working directory containing the summary changes
delta.export_html_report(consolidated_report, record_changes_comparison_df,
export_file_name='datadelta_html_report.html',
overwrite_existing_file=False)
```

- Get dataframe summary:

```sh
import pandas as pd
import datadelta as delta

new_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here

# Returns a report summarizing the key attributes and values of a dataframe
summary_report = delta.get_df_summary(
input_df=new_df, primary_key=primary_key, column_subset=column_subset, max_cols=15)
```

- Get record count changes report:

```sh
old_df = pd.read_csv('MainTestData_old_df.csv') # Add your old dataframe here
new_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here
primary_key = 'A' # Set the primary key
column_subset = None # Specify the subset of columns of interest or leave None to compare all columns

# Returns a report summarizing any changes to the number of records (and composition) between two dataframes
record_count_change_report = delta.check_record_count(
old_df, new_df, primary_key)
```

Other functions include:

- check_column_names: Returns a report summarizing any changes to column names between two dataframes
- check_datatypes: Returns a report summarizing any columns with different datatypes
- check_chg_in_values: Returns a report summarizing any records with changes in values
- get_records_in_both_tables: Returns the records found in both dataframes
- get_record_changes_comparison_df: Returns a dataframe comparing any records with changes in values by column
- export_html_report: Exports an html report of the differences between two dataframes

(back to top)

## Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
Don't forget to give the project a star! Thanks again!

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

(back to top)

## Example HTML Report Output

![Report Screenshot][report-screenshot]

(back to top)

## License

Distributed under the GNU General Public License v3 (GPLV3) License. See `LICENSE.txt` for more information.

(back to top)

## Contact

Andrew Gibbs-Bravo - [email protected]

Project Link: [https://github.com/gibbsbravo/DataDelta](https://github.com/gibbsbravo/DataDelta)

(back to top)

[contributors-shield]: https://img.shields.io/github/contributors/gibbsbravo/DataDelta.svg?style=for-the-badge
[contributors-url]: https://github.com/gibbsbravo/DataDelta/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/gibbsbravo/DataDelta.svg?style=for-the-badge
[forks-url]: https://github.com/gibbsbravo/DataDelta/network/members
[stars-shield]: https://img.shields.io/github/stars/gibbsbravo/DataDelta.svg?style=for-the-badge
[stars-url]: https://github.com/gibbsbravo/DataDelta/stargazers
[issues-shield]: https://img.shields.io/github/issues/gibbsbravo/DataDelta.svg?style=for-the-badge
[issues-url]: https://github.com/gibbsbravo/DataDelta/issues
[license-shield]: https://img.shields.io/github/license/gibbsbravo/DataDelta.svg?style=for-the-badge
[license-url]: https://github.com/gibbsbravo/DataDelta/blob/master/LICENSE.txt
[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555
[report-screenshot]: https://github.com/gibbsbravo/DataDelta/blob/master/images/DatasetComparisonReport.png?raw=true