https://github.com/dima-ischenko/xoverrr

Data quality library on python
https://github.com/dima-ischenko/xoverrr
clickhouse comparison dataquality greenplum oracle postgresql python
Last synced: 4 months ago
JSON representation
Data quality library on python
Host: GitHub
URL: https://github.com/dima-ischenko/xoverrr
Owner: dima-ischenko
License: mit
Created: 2025-10-18T21:15:35.000Z (7 months ago)
Default Branch: main
Last Pushed: 2026-01-30T17:02:37.000Z (4 months ago)
Last Synced: 2026-01-31T04:18:15.223Z (4 months ago)
Topics: clickhouse, comparison, dataquality, greenplum, oracle, postgresql, python
Language: Python
Homepage:
Size: 185 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 21
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # xoverrr (pronounced “crossover”)

A tool for cross-database and intra-source data comparison with detailed discrepancy analysis and reporting.

## Usage Example

**Sample comparison** (Greenplum vs Oracle):

```python

from xoverrr import DataQualityComparator, DataReference, COMPARISON_SUCCESS, COMPARISON_FAILED, COMPARISON_SKIPPED

import os

from datetime import date, timedelta

USER_ORA = os.getenv('USER_ORA', '')

PASSWORD_ORA = os.getenv('PASSWORD_ORA', '')

USER_GP = os.getenv('USER_GP', '')

PASSWORD_GP = os.getenv('PASSWORD_GP', '')

HOST_ORA = os.getenv('HOST_ORA', '')

HOST_GP = os.getenv('HOST_GP', '')

def create_src_engine(user, password, host):

    """Source engine (Oracle)"""

    os.environ['NLS_LANG'] = '.AL32UTF8'

    return create_engine(f'oracle+oracledb://{user}:{password}@{host}:1521/?service_name=dwh')

def create_trg_engine(user, password, host):

    """Target engine (Postgres/Greenplum)"""

    connection_string = f'postgresql+psycopg2://{user}:{password}@{host}:5432/adb'

    engine = create_engine(connection_string)

    return engine

src_engine = create_src_engine(USER_ORA, PASSWORD_ORA, HOST_ORA)

trg_engine = create_trg_engine(USER_GP, PASSWORD_GP, HOST_GP)

comparator = DataQualityComparator(

    source_engine=src_engine,

    target_engine=trg_engine,

    timezone='Europe/Athens'

)

source = DataReference("users", "schema1")

target = DataReference("users", "schema2")

FORMAT = '%Y-%m-%d'

recent_range_end = date.today()

recent_range_begin = recent_range_end - timedelta(days=1)

status, report, stats, details = comparator.compare_sample(

    source,

    target,

    date_column="created_at",

    update_column="modified_date",

    exclude_columns=["audit_timestamp", "internal_id"],

    exclude_recent_hours=3,

    date_range=(

        recent_range_begin.strftime(FORMAT),

        recent_range_end.strftime(FORMAT)

    ),

    tolerance_percentage=0

)

print(report)

if status == COMPARISON_FAILED:

    raise Exception("Sample check failed")

```

## Key Features

- **Multi‑DBMS support**: Oracle, PostgreSQL (+ Greenplum), ClickHouse (extensible via adapter layer) — tables and views.

- **Universal connections**: Provide SQLAlchemy Engine objects for source and target databases.

- **Comparison strategies**:

  * Data sample comparison

  * Count‑based comparison with daily aggregates

  * Fully custom (raw) SQL‑query comparison

- **Smart analysis**:

  * Excludes “fresh” data to mitigate replication lag

  * Auto‑detection of primary keys and column types from DBMS metadata (PK must be found on at least one side, or may be supplied manually)

  * Application‑side type conversion

  * Automatic exclusion of columns with mismatched names

- **Optimization**: Two samples of 1 million rows × 10 columns (each ~330 MB) compared in ~3 s (Intel Core i5 / 16 GB RAM)

- **Detailed reporting**: In‑depth column‑level discrepancy analysis with example records (column view / record view)

- **Flexible configuration**: Column exclusion/inclusion, tolerance thresholds, custom primary‑key specification

- **Unit tests**: Coverage for comparison methods, functional and performance validation

- **Integrations tests**: contains integration tests for xoverrr using real databases started via Docker

## Example Report

```

================================================================================

2025-11-24 20:09:40

DATA SAMPLE COMPARISON REPORT:

public.account

VS

stage.account

================================================================================

timezone: Europe/Athens

        SELECT created_at, updated_at, id, code, bank_code, account_type, counterparty_id, special_code, case when updated_at > (now() - INTERVAL '%(exclude_recent_hours)s hours') then 'y' end as xrecently_changed

        FROM public.account

        WHERE 1=1

            AND created_at >= date_trunc('day', %(start_date)s::date)

            AND created_at < date_trunc('day', %(end_date)s::date)  + interval '1 days'

    params: {'exclude_recent_hours': 1, 'start_date': '2025-11-17', 'end_date': '2025-11-24'}

----------------------------------------

        SELECT created_at, updated_at, id, code, bank_code, account_type, counterparty_id, special_code, case when updated_at > (sysdate - :exclude_recent_hours/24) then 'y' end as xrecently_changed

        FROM stage.account

        WHERE 1=1

            AND created_at >= trunc(to_date(:start_date, 'YYYY-MM-DD'), 'dd')

            AND created_at < trunc(to_date(:end_date, 'YYYY-MM-DD'), 'dd') + 1

    params: {'exclude_recent_hours': 1, 'start_date': '2025-11-17', 'end_date': '2025-11-24'}

----------------------------------------

SUMMARY:

  Source rows: 10966

  Target rows: 10966

  Duplicated source rows: 0

  Duplicated target rows: 0

  Only source rows: 0

  Only target rows: 0

  Common rows (by primary key): 10966

  Totally matched rows: 10965

----------------------------------------

  Source only rows %: 0.00000

  Target only rows %: 0.00000

  Duplicated source rows %: 0.00000

  Duplicated target rows %: 0.00000

  Mismatched rows %: 0.00912

  Final discrepancies score: 0.00456

  Final data quality score: 99.99544

  Source-only key examples: None

  Target-only key examples: None

  Duplicated source key examples: None

  Duplicated target key examples: None

  Common attribute columns: created_at, updated_at, code, bank_code, account_type, counterparty_id, special_code

  Skipped source columns:

  Skipped target columns: mt_change_date

COLUMN DIFFERENCES:

  Discrepancies per column (max %): 0.00912

  Count of mismatches per column:

 column_name  mismatch_count

special_code               1

  Some examples:

primary_key                          column_name  source_value target_value

f8153447-****-****-****-****** special_code       N/A          XYZ

DISCREPANT DATA (first pairs):

Sorted by primary key and dataset:

created_at          updated_at          id                                   code                 bank_code account_type counterparty_id                      special_code xflg

2025-11-24 18:58:27 2025-11-24 18:58:27 f8153447-****-****-****-****** 42****************87 0********* 11           62aa01a6-****-****-****-f17e2b*****4

N/A       src

2025-11-24 18:58:27 2025-11-24 18:58:27 f8153447-****-****-****-****** 42****************87 0********* 11           62aa01a6-****-****-****-f17e2b*****4 XYZ       trg

================================================================================

```

## Metric Calculation

### for compare_sample/compare_custom_query

```

final_diff_score =

 (source_dup% × 0.1)

 + (target_dup% × 0.1)

 + (source_only_rows% × 0.15)

 + (target_only_rows% × 0.15)

 + (rows_mismatched_by_any_column% × 0.5)

```

### for compare_counts

```

sum_of_absolute_differences = `abs(source_count - target_count)` per each day

sum_of_common_counts = `min(source_count, target_count)` per each day

final_diff_score = 100 × (sum_of_absolute_differences) / (sum_of_absolute_differences + sum_of_common_counts)

```

#### Quality score formula all methods: `100 − final_diff_score`

#### Scores range 0–100%; higher values indicate better data quality.

## Comparison Methods

### 1. Data Sample Comparison (`compare_sample`)

Suitable for comparing row sets and column values over a date range.

```python

status, report, stats, details = comparator.compare_sample(

    source_table=DataReference("table_name", "schema_name"),

    target_table=DataReference("table_name", "schema_name"),

    date_column="created_at",

    update_column="modified_date",

    date_range=("2024-01-01", "2024-01-31"),

    exclude_columns=["audit_timestamp", "internal_id"],

    include_columns=None,

    custom_primary_key=["id", "user_id"],

    tolerance_percentage=1.0,

    exclude_recent_hours=24,

    max_examples=3

)

```

**Parameters:**

- `source_table`, `target_table` – names of the tables or views to compare

- `date_column` – column used for date‑range filtering

- `update_column` – column identifying “fresh” data (excluded from both sides)

- `date_range` – tuple `(start_date, end_date)` in “YYYY‑MM‑DD” format

- `exclude_columns` – list of columns to omit from comparison, aka blacklist

- `include_columns` – list of columns to include, aka whitelist

- `custom_primary_key` – user‑specified primary key (if not provided, auto‑detected)

- `tolerance_percentage` – acceptable discrepancy threshold (0.0–100.0)

- `exclude_recent_hours` – exclude data modified within the last N hours

- `max_examples` – maximum number of discrepancy examples included in the report

### 2. Count‑Based Comparison (`compare_counts`)

Efficient for large‑volume comparisons over extended date ranges, identifying missing rows or duplicates.

```python

status, report, stats, details = comparator.compare_counts(

    source_table=DataReference("users", "schema1"),

    target_table=DataReference("users", "schema2"),

    date_column="created_at",

    date_range=("2024-01-01", "2024-01-31"),

    tolerance_percentage=2.0,

    max_examples=5

)

```

**Parameters:**

- `source_table`, `target_table` – references to the tables/views to compare

- `date_column` – column for daily grouping

- `date_range` – date interval for analysis

- `tolerance_percentage` – acceptable discrepancy threshold

- `max_examples` – maximum number of daily discrepancy examples included in the report

### 3. Custom‑Query Comparison (`compare_custom_query`)

Compares data from arbitrary SQL queries. Suitable for complex scenarios.

```python

status, report, stats, details = comparator.compare_custom_query(

    source_query="""SELECT id as user_id, name as user_name, created_at as created_date FROM scott.source_table WHERE status = %(status)s""",

    source_params={'status': 'active'},

    target_query="""SELECT user_id, user_name, created_date FROM scott.target_table WHERE status = :status""",

    target_params={'status': 'active'},

    custom_primary_key=["id"],

    exclude_columns=["internal_code"],

    tolerance_percentage=0.5,

    max_examples=3

)

```

**Parameters:**

- `source_query`, `target_query` – parameterised SQL queries for the source and target

- `source_params`, `target_params` – query parameters

- `custom_primary_key` – mandatory list of column names constituting the primary key

- `exclude_columns` – columns to omit from comparison

- `tolerance_percentage` – acceptable discrepancy threshold

- `max_examples` – maximum number of discrepancy examples included in the report

- To automatically exclude recently changed records, add the following expression to your SELECT clause in `compare_custom_query`:

  ```sql

  case when updated_at > (sysdate - 3/24) then 'y' end as xrecently_changed

  ```

**Automatic Primary‑Key Detection:**

- If `custom_primary_key` is not supplied, the system automatically infers the PK from metadata.

- When source and target PKs differ, the source PK is used with a warning.

**Performance Considerations:**

- DataFrame size validation (hard limit: 3 GB per sample)

- Efficient comparison via XOR properties

- Configurable limits via constants

**Return Values:**

All methods return a tuple:

- `status` – comparison status (`COMPARISON_SUCCESS` / `COMPARISON_FAILED` / `COMPARISON_SKIPPED`)

- `report` – textual report detailing discrepancies

- `stats` – `ComparisonStats` dataclass instance containing comparison statistics

- `details` – `ComparisonDiffDetails` dataclass instance with discrepancy examples and details

### Status Types

- **COMPARISON_SUCCESS**: Comparison completed within tolerance limits.

- **COMPARISON_FAILED**: Discrepancies exceed tolerance threshold, or a technical error occurred.

- **COMPARISON_SKIPPED**: No data available for comparison (both tables empty).

### Structured Logging

Logs include timing information and structured context:

```

2024-01-15 10:30:45 - INFO - xoverrr.core._compare_samples - Query executed in 2.34s

2024-01-15 10:30:46 - INFO - xoverrr.core._compare_samples - Source: 150000 rows, Target: 149950 rows

2024-01-15 10:30:47 - INFO - xoverrr.utils.compare_dataframes - Comparison completed in 1.2s

```

### Tolerance Percentage

- **tolerance_percentage**: Acceptable discrepancy threshold (0.0–100.0).

- If `final_diff_score > tolerance`: status = `COMPARISON_FAILED`

- If `final_diff_score ≤ tolerance`: status = `COMPARISON_SUCCESS`

- Enables configuration of acceptable discrepancy levels.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dima-ischenko/xoverrr

Awesome Lists containing this project

README