{"id":31585262,"url":"https://github.com/absaoss/dataset-comparison","last_synced_at":"2025-10-06T01:47:00.547Z","repository":{"id":303775418,"uuid":"987025868","full_name":"AbsaOSS/dataset-comparison","owner":"AbsaOSS","description":"A tool for comparing two datasets and finding their differences","archived":false,"fork":false,"pushed_at":"2025-07-09T10:35:36.000Z","size":35566,"stargazers_count":1,"open_issues_count":7,"forks_count":0,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-07-09T11:25:36.199Z","etag":null,"topics":["comparison-tool"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-20T13:16:33.000Z","updated_at":"2025-07-09T10:35:41.000Z","dependencies_parsed_at":"2025-07-09T11:35:47.118Z","dependency_job_id":null,"html_url":"https://github.com/AbsaOSS/dataset-comparison","commit_stats":null,"previous_names":["absaoss/dataset-comparison"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/AbsaOSS/dataset-comparison","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fdataset-comparison","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fdataset-comparison/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fdataset-comparison/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fdataset-comparison/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/dataset-comparison/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fdataset-comparison/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278547867,"owners_count":26004773,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["comparison-tool"],"created_at":"2025-10-06T01:46:39.818Z","updated_at":"2025-10-06T01:47:00.536Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CPS-Dataset-Comparison\n\nTool for exact comparison two Parquet files.\n\n\u003c!-- toc --\u003e\n- [What is CPS-Dataset-Comparison?](#what-is-CPS-Dataset-Comparison)\n    - [Abstract example](#abstract-example)\n    - [Removing noise](#removing-noise)\n    - [Removing same records](#removing-same-records)\n    - [Detailed Analyses](#detailed-analyses)\n- [Project structure](#project-structure)\n    - [bigfiles](#bigfiles)\n    - [smallfiles](#smallfiles)\n\u003c!-- tocstop --\u003e\n\n## What is CPS Dataset Comparison?\n\nThere was a need for a comparison tool that could help when we want to migrate from the legacy system to the new one, moving from Crunch implementation to Spark. The comparison tool should compare both outputs from a new and legacy system to check that changes did not effect the behavior and results.\n\nIn this particular solution, we will consider Parquet files as input. The tool will first find rows that are present in only one table. Then it will focus on detailed analyses of differences between samples. You can see the flow in the following chart:\n\n![alt text](images/mainFlow.png)\n\n### Abstract example\n\nLet's say we have two Parquet files with the following content:\n![img.png](images/tables.png)\nFirstly we will remove the first column because it is always different/autogenerated ...\n![img_1.png](images/remove_id.png)\n\nWe can see that the first file has 1st and 3rd rows exactly the same as the 2nd and 3rd in second file. So we will remove them.\n![img_2.png](images/find_match.png)\n\nThen we can found the difference between other rows.\n![img_3.png](images/find_diff.png)\n\n### Removing noise\n\nNoise removal will not be implemented in the first version. It was decided that this could be implemented afterward if there was a problem with noise columns. But we know some noise columns: Timestamps and Run id.\nThe approach for finding nondeterministic columns (noise columns) will be: Finding which columns are not the same in two Crunch runs (every run is constructed from 2 Crunch runs and one Spark run).\n\n\u003e At first we should compare the schema of both parquet files\n\n### Removing same records\n\nWe have decided not to bother with duplicates so we will remove common rows as described on the following flow chart:\n![alt text](images/removeRecords.png)\n\nFor *hash* we can use: [FNV](https://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function), \n[CRC-64-ISO](https://en.wikipedia.org/wiki/Cyclic_redundancy_check), \n[data-hash-tool](https://github.com/AbsaOSS/data-hash-tool) (PoC)\n\n### Detailed analyses\n\nWe have decided to use row by row comparison for detailed analyses. We can use more advanced heuristics in the future if this approach does not suit us. You can see the approach on the flowing chart.\n![alt text](images/analyses.png)\n\n\n## Project structure\n\nProject is divided into two modules:\n\n### bigfiles\n\n- [How to Run](bigfiles/README.md#how-to-run)\n- bigfile is file that does not fit to RAM\n- module for comparing big files\n- written in Scala\n- more about bigfiles module could be found in [bigfiles README](bigfiles/README.md)\n\n\n### smallfiles\n\n- [How to Run](smallfiles/README.md#how-to-run)\n- smallfile is file that fits to RAM\n- module for comparing small files\n- written in Python\n- more about smallfiles module could be found in [smallfiles README](smallfiles/README.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fdataset-comparison","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fdataset-comparison","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fdataset-comparison/lists"}