https://github.com/dima-ischenko/xoverrr
Data quality library on python
https://github.com/dima-ischenko/xoverrr
clickhouse comparison dataquality greenplum oracle postgresql python
Last synced: 2 months ago
JSON representation
Data quality library on python
- Host: GitHub
- URL: https://github.com/dima-ischenko/xoverrr
- Owner: dima-ischenko
- License: mit
- Created: 2025-10-18T21:15:35.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-01-30T17:02:37.000Z (2 months ago)
- Last Synced: 2026-01-31T04:18:15.223Z (2 months ago)
- Topics: clickhouse, comparison, dataquality, greenplum, oracle, postgresql, python
- Language: Python
- Homepage:
- Size: 185 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 21
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# xoverrr (pronounced “crossover”)
A tool for cross-database and intra-source data comparison with detailed discrepancy analysis and reporting.
## Usage Example
**Sample comparison** (Greenplum vs Oracle):
```python
from xoverrr import DataQualityComparator, DataReference, COMPARISON_SUCCESS, COMPARISON_FAILED, COMPARISON_SKIPPED
import os
from datetime import date, timedelta
USER_ORA = os.getenv('USER_ORA', '')
PASSWORD_ORA = os.getenv('PASSWORD_ORA', '')
USER_GP = os.getenv('USER_GP', '')
PASSWORD_GP = os.getenv('PASSWORD_GP', '')
HOST_ORA = os.getenv('HOST_ORA', '')
HOST_GP = os.getenv('HOST_GP', '')
def create_src_engine(user, password, host):
"""Source engine (Oracle)"""
os.environ['NLS_LANG'] = '.AL32UTF8'
return create_engine(f'oracle+oracledb://{user}:{password}@{host}:1521/?service_name=dwh')
def create_trg_engine(user, password, host):
"""Target engine (Postgres/Greenplum)"""
connection_string = f'postgresql+psycopg2://{user}:{password}@{host}:5432/adb'
engine = create_engine(connection_string)
return engine
src_engine = create_src_engine(USER_ORA, PASSWORD_ORA, HOST_ORA)
trg_engine = create_trg_engine(USER_GP, PASSWORD_GP, HOST_GP)
comparator = DataQualityComparator(
source_engine=src_engine,
target_engine=trg_engine,
timezone='Europe/Athens'
)
source = DataReference("users", "schema1")
target = DataReference("users", "schema2")
FORMAT = '%Y-%m-%d'
recent_range_end = date.today()
recent_range_begin = recent_range_end - timedelta(days=1)
status, report, stats, details = comparator.compare_sample(
source,
target,
date_column="created_at",
update_column="modified_date",
exclude_columns=["audit_timestamp", "internal_id"],
exclude_recent_hours=3,
date_range=(
recent_range_begin.strftime(FORMAT),
recent_range_end.strftime(FORMAT)
),
tolerance_percentage=0
)
print(report)
if status == COMPARISON_FAILED:
raise Exception("Sample check failed")
```
## Key Features
- **Multi‑DBMS support**: Oracle, PostgreSQL (+ Greenplum), ClickHouse (extensible via adapter layer) — tables and views.
- **Universal connections**: Provide SQLAlchemy Engine objects for source and target databases.
- **Comparison strategies**:
* Data sample comparison
* Count‑based comparison with daily aggregates
* Fully custom (raw) SQL‑query comparison
- **Smart analysis**:
* Excludes “fresh” data to mitigate replication lag
* Auto‑detection of primary keys and column types from DBMS metadata (PK must be found on at least one side, or may be supplied manually)
* Application‑side type conversion
* Automatic exclusion of columns with mismatched names
- **Optimization**: Two samples of 1 million rows × 10 columns (each ~330 MB) compared in ~3 s (Intel Core i5 / 16 GB RAM)
- **Detailed reporting**: In‑depth column‑level discrepancy analysis with example records (column view / record view)
- **Flexible configuration**: Column exclusion/inclusion, tolerance thresholds, custom primary‑key specification
- **Unit tests**: Coverage for comparison methods, functional and performance validation
- **Integrations tests**: contains integration tests for xoverrr using real databases started via Docker
## Example Report
```
================================================================================
2025-11-24 20:09:40
DATA SAMPLE COMPARISON REPORT:
public.account
VS
stage.account
================================================================================
timezone: Europe/Athens
SELECT created_at, updated_at, id, code, bank_code, account_type, counterparty_id, special_code, case when updated_at > (now() - INTERVAL '%(exclude_recent_hours)s hours') then 'y' end as xrecently_changed
FROM public.account
WHERE 1=1
AND created_at >= date_trunc('day', %(start_date)s::date)
AND created_at < date_trunc('day', %(end_date)s::date) + interval '1 days'
params: {'exclude_recent_hours': 1, 'start_date': '2025-11-17', 'end_date': '2025-11-24'}
----------------------------------------
SELECT created_at, updated_at, id, code, bank_code, account_type, counterparty_id, special_code, case when updated_at > (sysdate - :exclude_recent_hours/24) then 'y' end as xrecently_changed
FROM stage.account
WHERE 1=1
AND created_at >= trunc(to_date(:start_date, 'YYYY-MM-DD'), 'dd')
AND created_at < trunc(to_date(:end_date, 'YYYY-MM-DD'), 'dd') + 1
params: {'exclude_recent_hours': 1, 'start_date': '2025-11-17', 'end_date': '2025-11-24'}
----------------------------------------
SUMMARY:
Source rows: 10966
Target rows: 10966
Duplicated source rows: 0
Duplicated target rows: 0
Only source rows: 0
Only target rows: 0
Common rows (by primary key): 10966
Totally matched rows: 10965
----------------------------------------
Source only rows %: 0.00000
Target only rows %: 0.00000
Duplicated source rows %: 0.00000
Duplicated target rows %: 0.00000
Mismatched rows %: 0.00912
Final discrepancies score: 0.00456
Final data quality score: 99.99544
Source-only key examples: None
Target-only key examples: None
Duplicated source key examples: None
Duplicated target key examples: None
Common attribute columns: created_at, updated_at, code, bank_code, account_type, counterparty_id, special_code
Skipped source columns:
Skipped target columns: mt_change_date
COLUMN DIFFERENCES:
Discrepancies per column (max %): 0.00912
Count of mismatches per column:
column_name mismatch_count
special_code 1
Some examples:
primary_key column_name source_value target_value
f8153447-****-****-****-****** special_code N/A XYZ
DISCREPANT DATA (first pairs):
Sorted by primary key and dataset:
created_at updated_at id code bank_code account_type counterparty_id special_code xflg
2025-11-24 18:58:27 2025-11-24 18:58:27 f8153447-****-****-****-****** 42****************87 0********* 11 62aa01a6-****-****-****-f17e2b*****4
N/A src
2025-11-24 18:58:27 2025-11-24 18:58:27 f8153447-****-****-****-****** 42****************87 0********* 11 62aa01a6-****-****-****-f17e2b*****4 XYZ trg
================================================================================
```
## Metric Calculation
### for compare_sample/compare_custom_query
```
final_diff_score =
(source_dup% × 0.1)
+ (target_dup% × 0.1)
+ (source_only_rows% × 0.15)
+ (target_only_rows% × 0.15)
+ (rows_mismatched_by_any_column% × 0.5)
```
### for compare_counts
```
sum_of_absolute_differences = `abs(source_count - target_count)` per each day
sum_of_common_counts = `min(source_count, target_count)` per each day
final_diff_score = 100 × (sum_of_absolute_differences) / (sum_of_absolute_differences + sum_of_common_counts)
```
#### Quality score formula all methods: `100 − final_diff_score`
#### Scores range 0–100%; higher values indicate better data quality.
## Comparison Methods
### 1. Data Sample Comparison (`compare_sample`)
Suitable for comparing row sets and column values over a date range.
```python
status, report, stats, details = comparator.compare_sample(
source_table=DataReference("table_name", "schema_name"),
target_table=DataReference("table_name", "schema_name"),
date_column="created_at",
update_column="modified_date",
date_range=("2024-01-01", "2024-01-31"),
exclude_columns=["audit_timestamp", "internal_id"],
include_columns=None,
custom_primary_key=["id", "user_id"],
tolerance_percentage=1.0,
exclude_recent_hours=24,
max_examples=3
)
```
**Parameters:**
- `source_table`, `target_table` – names of the tables or views to compare
- `date_column` – column used for date‑range filtering
- `update_column` – column identifying “fresh” data (excluded from both sides)
- `date_range` – tuple `(start_date, end_date)` in “YYYY‑MM‑DD” format
- `exclude_columns` – list of columns to omit from comparison, aka blacklist
- `include_columns` – list of columns to include, aka whitelist
- `custom_primary_key` – user‑specified primary key (if not provided, auto‑detected)
- `tolerance_percentage` – acceptable discrepancy threshold (0.0–100.0)
- `exclude_recent_hours` – exclude data modified within the last N hours
- `max_examples` – maximum number of discrepancy examples included in the report
### 2. Count‑Based Comparison (`compare_counts`)
Efficient for large‑volume comparisons over extended date ranges, identifying missing rows or duplicates.
```python
status, report, stats, details = comparator.compare_counts(
source_table=DataReference("users", "schema1"),
target_table=DataReference("users", "schema2"),
date_column="created_at",
date_range=("2024-01-01", "2024-01-31"),
tolerance_percentage=2.0,
max_examples=5
)
```
**Parameters:**
- `source_table`, `target_table` – references to the tables/views to compare
- `date_column` – column for daily grouping
- `date_range` – date interval for analysis
- `tolerance_percentage` – acceptable discrepancy threshold
- `max_examples` – maximum number of daily discrepancy examples included in the report
### 3. Custom‑Query Comparison (`compare_custom_query`)
Compares data from arbitrary SQL queries. Suitable for complex scenarios.
```python
status, report, stats, details = comparator.compare_custom_query(
source_query="""SELECT id as user_id, name as user_name, created_at as created_date FROM scott.source_table WHERE status = %(status)s""",
source_params={'status': 'active'},
target_query="""SELECT user_id, user_name, created_date FROM scott.target_table WHERE status = :status""",
target_params={'status': 'active'},
custom_primary_key=["id"],
exclude_columns=["internal_code"],
tolerance_percentage=0.5,
max_examples=3
)
```
**Parameters:**
- `source_query`, `target_query` – parameterised SQL queries for the source and target
- `source_params`, `target_params` – query parameters
- `custom_primary_key` – mandatory list of column names constituting the primary key
- `exclude_columns` – columns to omit from comparison
- `tolerance_percentage` – acceptable discrepancy threshold
- `max_examples` – maximum number of discrepancy examples included in the report
- To automatically exclude recently changed records, add the following expression to your SELECT clause in `compare_custom_query`:
```sql
case when updated_at > (sysdate - 3/24) then 'y' end as xrecently_changed
```
**Automatic Primary‑Key Detection:**
- If `custom_primary_key` is not supplied, the system automatically infers the PK from metadata.
- When source and target PKs differ, the source PK is used with a warning.
**Performance Considerations:**
- DataFrame size validation (hard limit: 3 GB per sample)
- Efficient comparison via XOR properties
- Configurable limits via constants
**Return Values:**
All methods return a tuple:
- `status` – comparison status (`COMPARISON_SUCCESS` / `COMPARISON_FAILED` / `COMPARISON_SKIPPED`)
- `report` – textual report detailing discrepancies
- `stats` – `ComparisonStats` dataclass instance containing comparison statistics
- `details` – `ComparisonDiffDetails` dataclass instance with discrepancy examples and details
### Status Types
- **COMPARISON_SUCCESS**: Comparison completed within tolerance limits.
- **COMPARISON_FAILED**: Discrepancies exceed tolerance threshold, or a technical error occurred.
- **COMPARISON_SKIPPED**: No data available for comparison (both tables empty).
### Structured Logging
Logs include timing information and structured context:
```
2024-01-15 10:30:45 - INFO - xoverrr.core._compare_samples - Query executed in 2.34s
2024-01-15 10:30:46 - INFO - xoverrr.core._compare_samples - Source: 150000 rows, Target: 149950 rows
2024-01-15 10:30:47 - INFO - xoverrr.utils.compare_dataframes - Comparison completed in 1.2s
```
### Tolerance Percentage
- **tolerance_percentage**: Acceptable discrepancy threshold (0.0–100.0).
- If `final_diff_score > tolerance`: status = `COMPARISON_FAILED`
- If `final_diff_score ≤ tolerance`: status = `COMPARISON_SUCCESS`
- Enables configuration of acceptable discrepancy levels.