https://github.com/gibbsbravo/datadelta
The best Python package for comparing two dataframes
https://github.com/gibbsbravo/datadelta
analytics comparison data data-analytics database database-management databases dataops dataops-platform devops pandas pandas-dataframe testing testing-tools version-control
Last synced: 4 months ago
JSON representation
The best Python package for comparing two dataframes
- Host: GitHub
- URL: https://github.com/gibbsbravo/datadelta
- Owner: gibbsbravo
- License: gpl-3.0
- Created: 2021-12-21T14:51:22.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2021-12-29T21:50:19.000Z (over 3 years ago)
- Last Synced: 2024-12-14T00:46:04.714Z (4 months ago)
- Topics: analytics, comparison, data, data-analytics, database, database-management, databases, dataops, dataops-platform, devops, pandas, pandas-dataframe, testing, testing-tools, version-control
- Language: Python
- Homepage:
- Size: 514 KB
- Stars: 10
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
Table of Contents
About The Project
Getting Started
- Usage Examples
- Contributing
- Example HTML Report Output
- License
- Contact
## About The Project
DataDelta is a very useful Python package for easily comparing two pandas dataframes for use in data analysis, data engineering, and tracking table changes across time.
DataDelta generates a report as both a Python dict and HTML file that summarizes the key changes between two dataframes through completing a series of tests (that can also be selected individually). The Python report is intended for use as part of a DevOps / DataOps pipeline for testing to ensure table changes are expected.
## Getting Started
DataDelta is easy to install through pip or feel free to clone locally to make changes.
### Dependencies
DataDelta has very few dependencies:
- pandas: _a fast, powerful, flexible and easy to use open source data analysis and manipulation tool_ - DataDelta is built on for comparing dataframes
- numpy: _The fundamental package for scientific computing with Python_ - used for transformations and calculations
- jinja2: _a fast, expressive, extensible templating engine_ - used to generate the HTML report
- pytest (optional): _a mature full-featured Python testing tool that helps you write better programs_ - used for testing### Installation
- Install using Pip through PyPI:
```sh
pip install datadelta
```OR
- Clone the repo locally:
```sh
git clone https://github.com/gibbsbravo/DataDelta.git
```## Usage Examples
- Quick starter code to get summary dataframe changes report:
```sh
import pandas as pd
import datadelta as deltaold_df = pd.read_csv('MainTestData_old_df.csv') # Add your old dataframe here
new_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here
primary_key = 'A' # Set the primary key
column_subset = None # Specify the subset of columns of interest or leave None to compare all columns# The consolidated_report dictionary will contain the summary changes
consolidated_report, record_changes_comparison_df = delta.create_consolidated_report(
old_df, new_df, primary_key, column_subset)# This will create a report named datadelta_html_report.html in the current working directory containing the summary changes
delta.export_html_report(consolidated_report, record_changes_comparison_df,
export_file_name='datadelta_html_report.html',
overwrite_existing_file=False)
```- Get dataframe summary:
```sh
import pandas as pd
import datadelta as deltanew_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here
# Returns a report summarizing the key attributes and values of a dataframe
summary_report = delta.get_df_summary(
input_df=new_df, primary_key=primary_key, column_subset=column_subset, max_cols=15)
```- Get record count changes report:
```sh
old_df = pd.read_csv('MainTestData_old_df.csv') # Add your old dataframe here
new_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here
primary_key = 'A' # Set the primary key
column_subset = None # Specify the subset of columns of interest or leave None to compare all columns# Returns a report summarizing any changes to the number of records (and composition) between two dataframes
record_count_change_report = delta.check_record_count(
old_df, new_df, primary_key)
```Other functions include:
- check_column_names: Returns a report summarizing any changes to column names between two dataframes
- check_datatypes: Returns a report summarizing any columns with different datatypes
- check_chg_in_values: Returns a report summarizing any records with changes in values
- get_records_in_both_tables: Returns the records found in both dataframes
- get_record_changes_comparison_df: Returns a dataframe comparing any records with changes in values by column
- export_html_report: Exports an html report of the differences between two dataframes## Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
Don't forget to give the project a star! Thanks again!1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request## Example HTML Report Output
![Report Screenshot][report-screenshot]
## License
Distributed under the GNU General Public License v3 (GPLV3) License. See `LICENSE.txt` for more information.
## Contact
Andrew Gibbs-Bravo - [email protected]
Project Link: [https://github.com/gibbsbravo/DataDelta](https://github.com/gibbsbravo/DataDelta)
[contributors-shield]: https://img.shields.io/github/contributors/gibbsbravo/DataDelta.svg?style=for-the-badge
[contributors-url]: https://github.com/gibbsbravo/DataDelta/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/gibbsbravo/DataDelta.svg?style=for-the-badge
[forks-url]: https://github.com/gibbsbravo/DataDelta/network/members
[stars-shield]: https://img.shields.io/github/stars/gibbsbravo/DataDelta.svg?style=for-the-badge
[stars-url]: https://github.com/gibbsbravo/DataDelta/stargazers
[issues-shield]: https://img.shields.io/github/issues/gibbsbravo/DataDelta.svg?style=for-the-badge
[issues-url]: https://github.com/gibbsbravo/DataDelta/issues
[license-shield]: https://img.shields.io/github/license/gibbsbravo/DataDelta.svg?style=for-the-badge
[license-url]: https://github.com/gibbsbravo/DataDelta/blob/master/LICENSE.txt
[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555
[report-screenshot]: https://github.com/gibbsbravo/DataDelta/blob/master/images/DatasetComparisonReport.png?raw=true