{"id":43407352,"url":"https://github.com/dima-ischenko/xoverrr","last_synced_at":"2026-02-02T16:13:28.063Z","repository":{"id":326227929,"uuid":"1078991625","full_name":"dima-ischenko/xoverrr","owner":"dima-ischenko","description":"Data quality library on python","archived":false,"fork":false,"pushed_at":"2026-01-30T17:02:37.000Z","size":189,"stargazers_count":2,"open_issues_count":21,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-31T04:18:15.223Z","etag":null,"topics":["clickhouse","comparison","dataquality","greenplum","oracle","postgresql","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dima-ischenko.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-18T21:15:35.000Z","updated_at":"2026-01-22T18:51:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/dima-ischenko/xoverrr","commit_stats":null,"previous_names":["dima-ischenko/xoverrr"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/dima-ischenko/xoverrr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dima-ischenko%2Fxoverrr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dima-ischenko%2Fxoverrr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dima-ischenko%2Fxoverrr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dima-ischenko%2Fxoverrr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dima-ischenko","download_url":"https://codeload.github.com/dima-ischenko/xoverrr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dima-ischenko%2Fxoverrr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29015145,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-02T14:58:54.169Z","status":"ssl_error","status_checked_at":"2026-02-02T14:58:51.285Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clickhouse","comparison","dataquality","greenplum","oracle","postgresql","python"],"created_at":"2026-02-02T16:13:27.365Z","updated_at":"2026-02-02T16:13:28.054Z","avatar_url":"https://github.com/dima-ischenko.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# xoverrr (pronounced “crossover”)\n\nA tool for cross-database and intra-source data comparison with detailed discrepancy analysis and reporting.\n\n## Usage Example\n**Sample comparison** (Greenplum vs Oracle):\n\n```python\nfrom xoverrr import DataQualityComparator, DataReference, COMPARISON_SUCCESS, COMPARISON_FAILED, COMPARISON_SKIPPED\nimport os\nfrom datetime import date, timedelta\n\nUSER_ORA = os.getenv('USER_ORA', '')\nPASSWORD_ORA = os.getenv('PASSWORD_ORA', '')\n\nUSER_GP = os.getenv('USER_GP', '')\nPASSWORD_GP = os.getenv('PASSWORD_GP', '')\n\nHOST_ORA = os.getenv('HOST_ORA', '')\nHOST_GP = os.getenv('HOST_GP', '')\n\ndef create_src_engine(user, password, host):\n    \"\"\"Source engine (Oracle)\"\"\"\n    os.environ['NLS_LANG'] = '.AL32UTF8'\n    return create_engine(f'oracle+oracledb://{user}:{password}@{host}:1521/?service_name=dwh')\n\ndef create_trg_engine(user, password, host):\n    \"\"\"Target engine (Postgres/Greenplum)\"\"\"\n    connection_string = f'postgresql+psycopg2://{user}:{password}@{host}:5432/adb'\n    engine = create_engine(connection_string)\n    return engine\n\n\nsrc_engine = create_src_engine(USER_ORA, PASSWORD_ORA, HOST_ORA)\ntrg_engine = create_trg_engine(USER_GP, PASSWORD_GP, HOST_GP)\n\ncomparator = DataQualityComparator(\n    source_engine=src_engine,\n    target_engine=trg_engine,\n    timezone='Europe/Athens'\n)\n\nsource = DataReference(\"users\", \"schema1\")\ntarget = DataReference(\"users\", \"schema2\")\n\nFORMAT = '%Y-%m-%d'\nrecent_range_end = date.today()\nrecent_range_begin = recent_range_end - timedelta(days=1)\n\nstatus, report, stats, details = comparator.compare_sample(\n    source,\n    target,\n    date_column=\"created_at\",\n    update_column=\"modified_date\",\n    exclude_columns=[\"audit_timestamp\", \"internal_id\"],\n    exclude_recent_hours=3,\n    date_range=(\n        recent_range_begin.strftime(FORMAT),\n        recent_range_end.strftime(FORMAT)\n    ),\n    tolerance_percentage=0\n)\n\nprint(report)\nif status == COMPARISON_FAILED:\n    raise Exception(\"Sample check failed\")\n```\n\n## Key Features\n- **Multi‑DBMS support**: Oracle, PostgreSQL (+ Greenplum), ClickHouse (extensible via adapter layer) — tables and views.\n- **Universal connections**: Provide SQLAlchemy Engine objects for source and target databases.\n- **Comparison strategies**:\n  * Data sample comparison\n  * Count‑based comparison with daily aggregates\n  * Fully custom (raw) SQL‑query comparison\n- **Smart analysis**:\n  * Excludes “fresh” data to mitigate replication lag\n  * Auto‑detection of primary keys and column types from DBMS metadata (PK must be found on at least one side, or may be supplied manually)\n  * Application‑side type conversion\n  * Automatic exclusion of columns with mismatched names\n- **Optimization**: Two samples of 1 million rows × 10 columns (each ~330 MB) compared in ~3 s (Intel Core i5 / 16 GB RAM)\n- **Detailed reporting**: In‑depth column‑level discrepancy analysis with example records (column view / record view)\n- **Flexible configuration**: Column exclusion/inclusion, tolerance thresholds, custom primary‑key specification\n- **Unit tests**: Coverage for comparison methods, functional and performance validation\n- **Integrations tests**: contains integration tests for xoverrr using real databases started via Docker\n\n## Example Report\n```\n================================================================================\n2025-11-24 20:09:40\nDATA SAMPLE COMPARISON REPORT:\npublic.account\nVS\nstage.account\n================================================================================\ntimezone: Europe/Athens\n\n        SELECT created_at, updated_at, id, code, bank_code, account_type, counterparty_id, special_code, case when updated_at \u003e (now() - INTERVAL '%(exclude_recent_hours)s hours') then 'y' end as xrecently_changed\n        FROM public.account\n        WHERE 1=1\n            AND created_at \u003e= date_trunc('day', %(start_date)s::date)\n            AND created_at \u003c date_trunc('day', %(end_date)s::date)  + interval '1 days'\n\n    params: {'exclude_recent_hours': 1, 'start_date': '2025-11-17', 'end_date': '2025-11-24'}\n----------------------------------------\n\n        SELECT created_at, updated_at, id, code, bank_code, account_type, counterparty_id, special_code, case when updated_at \u003e (sysdate - :exclude_recent_hours/24) then 'y' end as xrecently_changed\n        FROM stage.account\n        WHERE 1=1\n            AND created_at \u003e= trunc(to_date(:start_date, 'YYYY-MM-DD'), 'dd')\n            AND created_at \u003c trunc(to_date(:end_date, 'YYYY-MM-DD'), 'dd') + 1\n\n    params: {'exclude_recent_hours': 1, 'start_date': '2025-11-17', 'end_date': '2025-11-24'}\n----------------------------------------\n\nSUMMARY:\n  Source rows: 10966\n  Target rows: 10966\n  Duplicated source rows: 0\n  Duplicated target rows: 0\n  Only source rows: 0\n  Only target rows: 0\n  Common rows (by primary key): 10966\n  Totally matched rows: 10965\n----------------------------------------\n  Source only rows %: 0.00000\n  Target only rows %: 0.00000\n  Duplicated source rows %: 0.00000\n  Duplicated target rows %: 0.00000\n  Mismatched rows %: 0.00912\n  Final discrepancies score: 0.00456\n  Final data quality score: 99.99544\n  Source-only key examples: None\n  Target-only key examples: None\n  Duplicated source key examples: None\n  Duplicated target key examples: None\n  Common attribute columns: created_at, updated_at, code, bank_code, account_type, counterparty_id, special_code\n  Skipped source columns:\n  Skipped target columns: mt_change_date\n\nCOLUMN DIFFERENCES:\n  Discrepancies per column (max %): 0.00912\n  Count of mismatches per column:\n\n column_name  mismatch_count\nspecial_code               1\n  Some examples:\n\nprimary_key                          column_name  source_value target_value\nf8153447-****-****-****-****** special_code       N/A          XYZ\n\nDISCREPANT DATA (first pairs):\nSorted by primary key and dataset:\n\n\ncreated_at          updated_at          id                                   code                 bank_code account_type counterparty_id                      special_code xflg\n2025-11-24 18:58:27 2025-11-24 18:58:27 f8153447-****-****-****-****** 42****************87 0********* 11           62aa01a6-****-****-****-f17e2b*****4\nN/A       src\n2025-11-24 18:58:27 2025-11-24 18:58:27 f8153447-****-****-****-****** 42****************87 0********* 11           62aa01a6-****-****-****-f17e2b*****4 XYZ       trg\n\n================================================================================\n```\n\n## Metric Calculation\n### for compare_sample/compare_custom_query\n```\nfinal_diff_score =\n (source_dup% × 0.1)\n + (target_dup% × 0.1)\n + (source_only_rows% × 0.15)\n + (target_only_rows% × 0.15)\n + (rows_mismatched_by_any_column% × 0.5)\n```\n\n### for compare_counts\n```\nsum_of_absolute_differences = `abs(source_count - target_count)` per each day\nsum_of_common_counts = `min(source_count, target_count)` per each day\nfinal_diff_score = 100 × (sum_of_absolute_differences) / (sum_of_absolute_differences + sum_of_common_counts)\n```\n\n#### Quality score formula all methods: `100 − final_diff_score`\n#### Scores range 0–100%; higher values indicate better data quality.\n\n## Comparison Methods\n\n### 1. Data Sample Comparison (`compare_sample`)\nSuitable for comparing row sets and column values over a date range.\n\n```python\nstatus, report, stats, details = comparator.compare_sample(\n    source_table=DataReference(\"table_name\", \"schema_name\"),\n    target_table=DataReference(\"table_name\", \"schema_name\"),\n    date_column=\"created_at\",\n    update_column=\"modified_date\",\n    date_range=(\"2024-01-01\", \"2024-01-31\"),\n    exclude_columns=[\"audit_timestamp\", \"internal_id\"],\n    include_columns=None,\n    custom_primary_key=[\"id\", \"user_id\"],\n    tolerance_percentage=1.0,\n    exclude_recent_hours=24,\n    max_examples=3\n)\n```\n\n**Parameters:**\n- `source_table`, `target_table` – names of the tables or views to compare\n- `date_column` – column used for date‑range filtering\n- `update_column` – column identifying “fresh” data (excluded from both sides)\n- `date_range` – tuple `(start_date, end_date)` in “YYYY‑MM‑DD” format\n- `exclude_columns` – list of columns to omit from comparison, aka blacklist\n- `include_columns` – list of columns to include, aka whitelist\n- `custom_primary_key` – user‑specified primary key (if not provided, auto‑detected)\n- `tolerance_percentage` – acceptable discrepancy threshold (0.0–100.0)\n- `exclude_recent_hours` – exclude data modified within the last N hours\n- `max_examples` – maximum number of discrepancy examples included in the report\n\n### 2. Count‑Based Comparison (`compare_counts`)\nEfficient for large‑volume comparisons over extended date ranges, identifying missing rows or duplicates.\n\n```python\nstatus, report, stats, details = comparator.compare_counts(\n    source_table=DataReference(\"users\", \"schema1\"),\n    target_table=DataReference(\"users\", \"schema2\"),\n    date_column=\"created_at\",\n    date_range=(\"2024-01-01\", \"2024-01-31\"),\n    tolerance_percentage=2.0,\n    max_examples=5\n)\n```\n\n**Parameters:**\n- `source_table`, `target_table` – references to the tables/views to compare\n- `date_column` – column for daily grouping\n- `date_range` – date interval for analysis\n- `tolerance_percentage` – acceptable discrepancy threshold\n- `max_examples` – maximum number of daily discrepancy examples included in the report\n\n### 3. Custom‑Query Comparison (`compare_custom_query`)\nCompares data from arbitrary SQL queries. Suitable for complex scenarios.\n\n```python\nstatus, report, stats, details = comparator.compare_custom_query(\n    source_query=\"\"\"SELECT id as user_id, name as user_name, created_at as created_date FROM scott.source_table WHERE status = %(status)s\"\"\",\n    source_params={'status': 'active'},\n    target_query=\"\"\"SELECT user_id, user_name, created_date FROM scott.target_table WHERE status = :status\"\"\",\n    target_params={'status': 'active'},\n    custom_primary_key=[\"id\"],\n    exclude_columns=[\"internal_code\"],\n    tolerance_percentage=0.5,\n    max_examples=3\n)\n```\n\n**Parameters:**\n- `source_query`, `target_query` – parameterised SQL queries for the source and target\n- `source_params`, `target_params` – query parameters\n- `custom_primary_key` – mandatory list of column names constituting the primary key\n- `exclude_columns` – columns to omit from comparison\n- `tolerance_percentage` – acceptable discrepancy threshold\n- `max_examples` – maximum number of discrepancy examples included in the report\n- To automatically exclude recently changed records, add the following expression to your SELECT clause in `compare_custom_query`:\n  ```sql\n  case when updated_at \u003e (sysdate - 3/24) then 'y' end as xrecently_changed\n  ```\n\n**Automatic Primary‑Key Detection:**\n- If `custom_primary_key` is not supplied, the system automatically infers the PK from metadata.\n- When source and target PKs differ, the source PK is used with a warning.\n\n**Performance Considerations:**\n- DataFrame size validation (hard limit: 3 GB per sample)\n- Efficient comparison via XOR properties\n- Configurable limits via constants\n\n**Return Values:**\nAll methods return a tuple:\n- `status` – comparison status (`COMPARISON_SUCCESS` / `COMPARISON_FAILED` / `COMPARISON_SKIPPED`)\n- `report` – textual report detailing discrepancies\n- `stats` – `ComparisonStats` dataclass instance containing comparison statistics\n- `details` – `ComparisonDiffDetails` dataclass instance with discrepancy examples and details\n\n### Status Types\n- **COMPARISON_SUCCESS**: Comparison completed within tolerance limits.\n- **COMPARISON_FAILED**: Discrepancies exceed tolerance threshold, or a technical error occurred.\n- **COMPARISON_SKIPPED**: No data available for comparison (both tables empty).\n\n### Structured Logging\nLogs include timing information and structured context:\n```\n2024-01-15 10:30:45 - INFO - xoverrr.core._compare_samples - Query executed in 2.34s\n2024-01-15 10:30:46 - INFO - xoverrr.core._compare_samples - Source: 150000 rows, Target: 149950 rows\n2024-01-15 10:30:47 - INFO - xoverrr.utils.compare_dataframes - Comparison completed in 1.2s\n```\n\n### Tolerance Percentage\n- **tolerance_percentage**: Acceptable discrepancy threshold (0.0–100.0).\n- If `final_diff_score \u003e tolerance`: status = `COMPARISON_FAILED`\n- If `final_diff_score ≤ tolerance`: status = `COMPARISON_SUCCESS`\n- Enables configuration of acceptable discrepancy levels.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdima-ischenko%2Fxoverrr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdima-ischenko%2Fxoverrr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdima-ischenko%2Fxoverrr/lists"}