{"id":13806525,"url":"https://github.com/capitalone/datacompy","last_synced_at":"2025-05-14T04:03:07.771Z","repository":{"id":39635528,"uuid":"126487536","full_name":"capitalone/datacompy","owner":"capitalone","description":"Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!","archived":false,"fork":false,"pushed_at":"2025-05-10T03:04:14.000Z","size":11816,"stargazers_count":565,"open_issues_count":9,"forks_count":141,"subscribers_count":21,"default_branch":"develop","last_synced_at":"2025-05-10T04:18:44.616Z","etag":null,"topics":["compare","dask","data","data-science","dataframes","fugue","numpy","pandas","polars","pyspark","python","snowflake","snowpark","spark"],"latest_commit_sha":null,"homepage":"https://capitalone.github.io/datacompy/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/capitalone.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":"ROADMAP.rst","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-23T13:16:03.000Z","updated_at":"2025-05-10T03:04:18.000Z","dependencies_parsed_at":"2023-09-25T22:52:06.382Z","dependency_job_id":"ef22b3d6-1d72-45a2-86a8-8870371033d4","html_url":"https://github.com/capitalone/datacompy","commit_stats":{"total_commits":172,"total_committers":32,"mean_commits":5.375,"dds":0.8430232558139534,"last_synced_commit":"60c52d2d5d8b1f93954cb979228978c895d2774a"},"previous_names":[],"tags_count":31,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capitalone%2Fdatacompy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capitalone%2Fdatacompy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capitalone%2Fdatacompy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capitalone%2Fdatacompy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/capitalone","download_url":"https://codeload.github.com/capitalone/datacompy/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254067088,"owners_count":22009075,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compare","dask","data","data-science","dataframes","fugue","numpy","pandas","polars","pyspark","python","snowflake","snowpark","spark"],"created_at":"2024-08-04T01:01:12.878Z","updated_at":"2025-05-14T04:03:07.713Z","avatar_url":"https://github.com/capitalone.png","language":"Python","funding_links":[],"categories":["Data Validation","The Data Science Toolbox","Python","Data Comparison"],"sub_categories":["Synthetic Data","Comparison","General-Purpose Machine Learning"],"readme":"# DataComPy\n\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/datacompy)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n[![PyPI version](https://badge.fury.io/py/datacompy.svg)](https://badge.fury.io/py/datacompy)\n[![Anaconda-Server Badge](https://anaconda.org/conda-forge/datacompy/badges/version.svg)](https://anaconda.org/conda-forge/datacompy)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/datacompy)\n\n\nDataComPy is a package to compare two DataFrames (or tables) such as Pandas, Spark, Polars, and\neven Snowflake. Originally it was created to be something of a replacement\nfor SAS's ``PROC COMPARE`` for Pandas DataFrames with some more functionality than\njust ``Pandas.DataFrame.equals(Pandas.DataFrame)`` (in that it prints out some stats,\nand lets you tweak how accurate matches have to be). Supported types include:\n\n- Pandas\n- Polars\n- Spark\n- Snowflake (via snowpark)\n- Dask (via Fugue)\n- DuckDB (via Fugue)\n\n\n## Quick Installation\n\n```shell\npip install datacompy\n```\n\nor\n\n```shell\nconda install datacompy\n```\n\n### Installing extras\n\nIf you would like to use Spark or any other backends please make sure you install via extras:\n\n```shell\npip install datacompy[spark]\npip install datacompy[fugue]\npip install datacompy[snowflake]\n\n```\n\n### Legacy Spark Deprecation\n\nWith version ``v0.12.0`` the original ``SparkCompare`` was replaced with a\nPandas on Spark implementation. The original ``SparkCompare`` implementation differs\nfrom all the other native implementations. To align the API better,  and keep behaviour\nconsistent we are deprecating the original ``SparkCompare`` into a new module ``LegacySparkCompare``\n\nSubsequently in ``v0.13.0`` a PySpark DataFrame class has been introduced (``SparkSQLCompare``)\nwhich accepts ``pyspark.sql.DataFrame`` and should provide better performance. With this version\nthe Pandas on Spark implementation has been renamed to ``SparkPandasCompare`` and all the spark\nlogic is now under the ``spark`` submodule.\n\nIf you wish to use the old SparkCompare moving forward you can import it like so:\n\n```python\nfrom datacompy.spark.legacy import LegacySparkCompare\n```\n\n### SparkPandasCompare Deprecation\n\nStarting with ``v0.14.1``, ``SparkPandasCompare`` is slated for deprecation. ``SparkSQLCompare`` is the prefered and much more performant.\nIt should be noted that if you continue to use ``SparkPandasCompare`` that ``numpy`` 2+ is not supported due to dependency issues.\n\n\n#### Supported versions and dependncies\n\nDifferent versions of Spark, Pandas, and Python interact differently. Below is a matrix of what we test with.\nWith the move to Pandas on Spark API and compatability issues with Pandas 2+ we will for the mean time note support Pandas 2\nwith the Pandas on Spark implementation. Spark plans to support Pandas 2 in [Spark 4](https://issues.apache.org/jira/browse/SPARK-44101)\n\n\n|             | Spark 3.2.4 | Spark 3.3.4 | Spark 3.4.2 | Spark 3.5.1 |\n|-------------|-------------|-------------|-------------|-------------|\n| Python 3.10 | ✅           | ✅           | ✅           | ✅           |\n| Python 3.11 | ❌           | ❌           | ✅           | ✅           |\n| Python 3.12 | ❌           | ❌           | ❌           | ❌           |\n\n\n|                        | Pandas \u003c 1.5.3 | Pandas \u003e=2.0.0 |\n|------------------------|----------------|----------------|\n| ``Compare``            | ✅              | ✅              |\n| ``SparkPandasCompare`` | ✅              | ❌              |\n| ``SparkSQLCompare``    | ✅              | ✅              |\n| Fugue                  | ✅              | ✅              |\n\n\n\n\u003e [!NOTE]\n\u003e At the current time Python `3.12` is not supported by Spark and also Ray within Fugue.\n\u003e If you are using Python `3.12` and above, please note that not all functioanlity will be supported.\n\u003e Pandas and Polars support should work fine and are tested.\n\n## Supported backends\n\n- Pandas: ([See documentation](https://capitalone.github.io/datacompy/pandas_usage.html))\n- Spark: ([See documentation](https://capitalone.github.io/datacompy/spark_usage.html))\n- Polars: ([See documentation](https://capitalone.github.io/datacompy/polars_usage.html))\n- Snowflake/Snowpark: ([See documentation](https://capitalone.github.io/datacompy/snowflake_usage.html))\n- Fugue is a Python library that provides a unified interface for data processing on Pandas, DuckDB, Polars, Arrow,\n  Spark, Dask, Ray, and many other backends. DataComPy integrates with Fugue to provide a simple way to compare data\n  across these backends. Please note that Fugue will use the Pandas (Native) logic at its lowest level\n  ([See documentation](https://capitalone.github.io/datacompy/fugue_usage.html))\n\n## Contributors\n\nWe welcome and appreciate your contributions! Before we can accept any contributions, we ask that you please be sure to\nsign the [Contributor License Agreement (CLA)](https://cla-assistant.io/capitalone/datacompy).\n\nThis project adheres to the [Open Source Code of Conduct](https://developer.capitalone.com/resources/code-of-conduct/).\nBy participating, you are expected to honor this code.\n\n\n## Roadmap\n\nRoadmap details can be found [here](https://github.com/capitalone/datacompy/blob/develop/ROADMAP.rst)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapitalone%2Fdatacompy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcapitalone%2Fdatacompy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapitalone%2Fdatacompy/lists"}