https://github.com/crunchydata/pgcompare
pgCompare – a straightforward utility crafted to simplify the data comparison process, providing a robust solution for comparing data across various database platforms.
https://github.com/crunchydata/pgcompare
compare data migration oracle postgres
Last synced: 10 months ago
JSON representation
pgCompare – a straightforward utility crafted to simplify the data comparison process, providing a robust solution for comparing data across various database platforms.
- Host: GitHub
- URL: https://github.com/crunchydata/pgcompare
- Owner: CrunchyData
- License: apache-2.0
- Created: 2023-11-27T14:41:37.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-17T21:07:33.000Z (10 months ago)
- Last Synced: 2025-03-29T05:05:43.059Z (10 months ago)
- Topics: compare, data, migration, oracle, postgres
- Language: Java
- Homepage:
- Size: 911 KB
- Stars: 133
- Watchers: 5
- Forks: 20
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
Awesome Lists containing this project
README
pgCompare
Data Compare
[](LICENSE.md)
# Data Compare Made Simple
**pgCompare** is a Java-based tool for validating data consistency after replication or migration between databases. It's designed for scenarios like:
- **Data migration from Oracle/DB2/MariaDB/MySQL/MSSQL to Postgres:** Compare data during or post-migration.
- **Logical replication between same or different database platforms:** Validate data across platforms while minimizing database overhead.
- **Active-Active replication configuration:** Regularly verify data consistency to mitigate risks.
pgCompare uses hashing to compare table data efficiently. Hash values for primary keys and remaining columns are stored in a repository, reducing storage and network demands. Comparisons are processed in parallel, improving performance.
This open-source project is maintained by **Crunchy Data** under the **Apache 2.0 License** and is made available for broader use, testing, and feedback.
# Features
- Supports Oracle, PostgreSQL, DB2, MariaDB, MySQL, and MSSQL.
- Efficient parallel comparisons using hashing.
- Handles batch processing for performance tuning.
- Stores configurations for multiple comparison projects in a central repository.
# Installation
## Requirements
Before initiating the build and installation process, ensure the following prerequisites are met:
1. **Java** 21 or later.
2. **Maven** 3.9 or later.
3. **Postgres** 15 or later (for the repository).
4. Supported JDBC drivers (DB2, Postgres, MySQL, MSSQL and Oracle currently supported).
5. Direct Postgres connections (e.g., no pgBouncer).
## Limitations
- Date/Timestamps compared only to the second (format: DDMMYYYYHH24MISS).
- Unsupported data types: blob, long, longraw, bytea.
- Cross-platform comparison limitations with boolean type.
- Reserved words cannot be used for table/column names.
- If a column is quoted in the RDBMS's native case, you will need to override the `preserve_case` in the `dc_table_column_map` table for that column. For example, if a column was created in Oracle with quotes in upper case ("MYCOL").
# Getting Started
## 1. Fork the repository
## 2. Clone and Build
```shell
git clone --depth 1 git@github.com:/pgCompare.git
cd pgCompare
mvn clean install
```
## 3. Configure Properties
Copy `pgcompare.properties.sample` to `pgcompare.properties` and update the connection parameters for your repository, source, and target databases.
By default, the application looks for the properties file in the execution directory. Use `PGCOMPARE_CONFIG` environment variable to specify a custom properties file location.
At a minimal the `repo-xxxxx` parameters are required in the properties file (or specified by environment parameters). Besides the properties file and environment variables, another alternative is to store the property settings in the `dc_project` table. Settings can be stored in the `project_config` column in JSON format ({"parameter": "value"}). Certain system parameters like log-destination can only be specified via the properties file or environment variables.
## 4. Initialize Repository
Run the script or use the command below to set up the PostgreSQL repository:
```shell
java -jar pgcompare.jar --init
```
## 5. Discover Tables
Discover and map tables in specified schemas:
```shell
java -jar pgcompare.jar --discovery
```
# Usage
## Define Table Mapping
1. Automatic Discovery
Discover and map tables in specified schemas:
```shell
java -jar pgcompare.jar --discovery
```
2. Manual Registration
Insert mappings into `dc_table` and `dc_table_map` tables in the repository.
## Run Data Comparison
```shell
java -jar pgcompare.jar --batch 0
```
Batch 0 processes all data. Use `PGCOMPARE-BATCH` or specify the batch number using the `--batch` argument to specify a batch number.
## Recheck Discrepancies
Revalidate flagged rows:
```shell
java -jar pgcompare.jar --batch 0 --check
```
# Upgrading
## Version 0.3.0 Enhancements
- DB2 support.
- Case-sensitive table/column name handling.
- New project configurations for easier management.
**Note:** Drop and recreate the repository to upgrade to 0.3.0.
# Advanced Configuration
## Properties
Define properties via a file, environment variables, or the `dc_project` table. Environment variables override file settings and must be prefixed with `PGCOMPARE_`.
Examples:
- File: `batch-fetch-size=2000`
- Env: `PGCOMPARE_BATCH_FETCH_SIZE=2000`
## Tuning Performance
- **Batch size:** Adjust `batch-fetch-size` and `batch-commit-size` for memory efficiency.
- **Threads:** Use loader-threads (default: 4) for parallel processing.
- **Observer throttle:** Enable to prevent overloading temporary tables (observer-throttle=true).
- **Java Heap Size:** For larger datasets, there may be a need to increase the Java Heap size. Use the options `-Xms` and `-Xmx` when executing pgCompare (`java -Xms512m -Xmx2g -jar pgcompare.jar`).
## Repository Recommendations
- Minimal requirements: 2 vCPUs, 8 GB RAM.
- PostgreSQL settings:
- shared_buffers=2048MB
- work_mem=256MB
- max_parallel_workers=16
## Projects
Projects allow for the repository to maintain different mappings for different compare objectives. This allows a central pgCompare repository to be used for multiple compare projects. Each table has a `pid` column which is the project id. If no project is specified, the default project (pid = 1) is used.
# Viewing Results
## Summary from Last Run
```sql
WITH mr AS (SELECT max(rid) rid FROM dc_result)
SELECT compare_start, table_name, status, equal_cnt+not_equal_cnt+missing_source_cnt+missing_target_cnt AS total_cnt,
equal_cnt, not_equal_cnt, missing_source_cnt + missing_target_cnt AS missing_cnt
FROM dc_result r
JOIN mr ON (mr.rid = r.rid)
ORDER BY table_name;
```
## Out-of-Sync Rows
```sql
SELECT COALESCE(s.table_name, t.table_name) AS table_name,
CASE
WHEN s.compare_result = 'n' THEN 'out-of-sync'
WHEN s.compare_result = 'm' THEN 'missing target'
WHEN t.compare_result = 'm' THEN 'missing source'
END AS compare_result,
COALESCE(s.pk, t.pk) AS primary_key
FROM dc_source s
FULL OUTER JOIN dc_target t ON s.pk = t.pk and s.tid=t.tid;
```
# Reference
## Column Map
The system will automatically generate a column mapping during the first execution on a table. This column mapping will be stored in the `dc_table_column` and `dc_table_column_map` repository tables. This mapping can be performed ahead of time or the generated mapping modified as needed. If a column mapping is present, the program will not perform a remap unless instructed to using the `maponly` flag.
To create or overwrite current column mappings stored in column_map colum of dc_table, execute the following:
```shell
java -jar pgcompare.jar --batch 0 --maponly
```
## Properties
Properties are categorized into four sections: system, repository, source, and target. Each section has specific properties, as described in detail in the documentation. The properties can be specified via a configuration file, environment variables or a combination of both. To use environment variables, the environment variable will be the name of the property in upper case with dashes '-' converted to underscore '_' and prefixed with PGCOMPARE_. For example, batch-fetch-size can be set by using the environment variable PGCOMPARE_BATCH_FETCH_SIZE.
### System
- batch-fetch-size: Sets the fetch size for retrieving rows from the source or target database.
- batch-commit-size: The commit size controls the array size and number of rows concurrently inserted into the dc_source/dc_target staging tables.
- batch-progress-report-size: Defines the number of rows used in mod to report progress.
- database-source: Determines if the sorting of the rows based on primary key occurs on the source/target database. If set to true, the default, the rows will be sorted before being compared. If set to false, the sorting will take place in the repository database.
- loader-threads: Sets the number of threads to load data into the temporary tables. Default is 4. Set to 0 to disable loader threads.
- log-level: Level to determine the amount of log messages written to the log destination.
- log-destination: Location where log messages will be written. Default is stdout.
- message-queue-size: Size of message queue used by loader threads (nbr messages). Default is 100.
- number-cast: Defines how numbers are cast for hash function (notation|standard). Default is notation (for scientific notation).
- observer-throttle: Set to true or false, instructs the loader threads to pause and wait for the observer thread to catch up before continuing to load more data into the staging tables.
- observer-throttle-size: Number of rows loaded before the loader thread will sleep and wait for clearance from the observer thread.
- observer-vacuum: Set to true or false, instructs the observer whether to perform a vacuum on the staging tables during checkpoints.
### Repository
- repo-dbname: Repository database name.
- repo-host: Host name of server hosting the Postgres repository database.
- repo-password: Postgres database user password.
- repo-port: Repository Postgres instance port.
- repo-schema: Name of schema that owns the repository tables.
- repo-sslmode: Set the SSL mode to use for the database connection (disable|prefer|require)
- repo-user: Postgres database username.
### Source
- source-database-hash: True or false, instructs the application where the hash should be computed (on the database or by the application).
- source-dbname: Database or service name.
- source-host: Database server name.
- source-password: Database password.
- source-port: Database port.
- source-schema: Name of schema that owns the tables.
- source-sslmode: Set the SSL mode to use for the database connection (disable|prefer|require)
- source-type: Database type: oracle, postgres
- source-user: Database username.
### Target
- target-database-hash: True or false, instructs the application where the hash should be computed (on the database or by the application).
- target-dbname: Database or service name.
- target-host: Database server name.
- target-password: Database password.
- target-port: Database port.
- target-schema: Name of schema that owns the tables.
- target-sslmode: Set the SSL mode to use for the database connection (disable|prefer|require)
- target-type: Database type: oracle, postgres
- target-user: Database username.
## Property Precedence
The system contains default values for every parameter. These can be over-ridden using environment variables, properties file, or values saved in the `dc_project` table. The following is the order of precedence used:
- Default values
- Properties file
- Environment variables
- Settings stored in `dc_project` table
# License
**pgCompare** is licensed under the [Apache 2.0 license](LICENSE.md).