{"id":19236824,"url":"https://github.com/crunchydata/pgcompare","last_synced_at":"2025-04-05T06:04:49.389Z","repository":{"id":242098661,"uuid":"724172108","full_name":"CrunchyData/pgCompare","owner":"CrunchyData","description":"pgCompare – a straightforward utility crafted to simplify the data comparison process, providing a robust solution for comparing data across various database platforms.","archived":false,"fork":false,"pushed_at":"2025-03-17T21:07:33.000Z","size":933,"stargazers_count":133,"open_issues_count":0,"forks_count":20,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-29T05:05:43.059Z","etag":null,"topics":["compare","data","migration","oracle","postgres"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CrunchyData.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-27T14:41:37.000Z","updated_at":"2025-03-28T12:58:09.000Z","dependencies_parsed_at":"2024-05-31T17:18:24.483Z","dependency_job_id":"f6e7c298-7e2c-49d4-92cd-c1cb63c8a9d2","html_url":"https://github.com/CrunchyData/pgCompare","commit_stats":{"total_commits":64,"total_committers":5,"mean_commits":12.8,"dds":0.390625,"last_synced_commit":"a42a0c21045cf9deb51a2ee441524094bd9ca429"},"previous_names":["crunchydata/pgcompare"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CrunchyData%2FpgCompare","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CrunchyData%2FpgCompare/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CrunchyData%2FpgCompare/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CrunchyData%2FpgCompare/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CrunchyData","download_url":"https://codeload.github.com/CrunchyData/pgCompare/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247294538,"owners_count":20915340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compare","data","migration","oracle","postgres"],"created_at":"2024-11-09T16:23:29.920Z","updated_at":"2025-04-05T06:04:49.362Z","avatar_url":"https://github.com/CrunchyData.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv\u003e\n  \u003ch1 style=\"font-size: 70px;text-align: center\"\u003epgCompare\u003c/h1\u003e\n  \u003ch2 style=\"text-align: center\"\u003eData Compare\u003c/h2\u003e\n\u003c/div\u003e\n\u003chr\u003e\n\n[![License](https://img.shields.io/github/license/CrunchyData/postgres-operator)](LICENSE.md)\n\n# Data Compare Made Simple\n\n**pgCompare** is a Java-based tool for validating data consistency after replication or migration between databases. It's designed for scenarios like:\n\n\n- **Data migration from Oracle/DB2/MariaDB/MySQL/MSSQL to Postgres:**  Compare data during or post-migration.\n\n- **Logical replication between same or different database platforms:** Validate data across platforms while minimizing database overhead.\n\n- **Active-Active replication configuration:**  Regularly verify data consistency to mitigate risks.\n\npgCompare uses hashing to compare table data efficiently. Hash values for primary keys and remaining columns are stored in a repository, reducing storage and network demands. Comparisons are processed in parallel, improving performance.\n\nThis open-source project is maintained by **Crunchy Data** under the **Apache 2.0 License** and is made available for broader use, testing, and feedback.\n\n# Features\n\n- Supports Oracle, PostgreSQL, DB2, MariaDB, MySQL, and MSSQL.\n- Efficient parallel comparisons using hashing.\n- Handles batch processing for performance tuning.\n- Stores configurations for multiple comparison projects in a central repository.\n\n# Installation\n\n## Requirements\n\nBefore initiating the build and installation process, ensure the following prerequisites are met:\n\n1. **Java** 21 or later.\n2. **Maven** 3.9 or later.\n3. **Postgres** 15 or later (for the repository).\n4. Supported JDBC drivers (DB2, Postgres, MySQL, MSSQL and Oracle currently supported).\n5. Direct Postgres connections (e.g., no pgBouncer).\n\n## Limitations\n\n- Date/Timestamps compared only to the second (format: DDMMYYYYHH24MISS).\n- Unsupported data types: blob, long, longraw, bytea.\n- Cross-platform comparison limitations with boolean type.\n- Reserved words cannot be used for table/column names.\n- If a column is quoted in the RDBMS's native case, you will need to override the `preserve_case` in the `dc_table_column_map` table for that column.  For example, if a column was created in Oracle with quotes in upper case (\"MYCOL\").\n\n# Getting Started\n\n## 1. Fork the repository\n\n## 2. Clone and Build\n\n```shell\ngit clone --depth 1 git@github.com:\u003cyour-github-username\u003e/pgCompare.git\ncd pgCompare\nmvn clean install\n```\n\n## 3. Configure Properties\n\nCopy `pgcompare.properties.sample` to `pgcompare.properties` and update the connection parameters for your repository, source, and target databases.\nBy default, the application looks for the properties file in the execution directory. Use `PGCOMPARE_CONFIG` environment variable to specify a custom properties file location.\n\nAt a minimal the `repo-xxxxx` parameters are required in the properties file (or specified by environment parameters).  Besides the properties file and environment variables, another alternative is to store the property settings in the `dc_project` table.  Settings can be stored in the `project_config` column in JSON format ({\"parameter\": \"value\"}).  Certain system parameters like log-destination can only be specified via the properties file or environment variables.\n\n## 4. Initialize Repository\n\nRun the script or use the command below to set up the PostgreSQL repository:\n\n```shell\njava -jar pgcompare.jar --init\n```\n\n## 5. Discover Tables\n\nDiscover and map tables in specified schemas:\n\n```shell\njava -jar pgcompare.jar --discovery\n```\n\n# Usage\n\n## Define Table Mapping\n\n1. Automatic Discovery\n\n    Discover and map tables in specified schemas:\n\n    ```shell\n    java -jar pgcompare.jar --discovery\n    ```\n\n2. Manual Registration \n\n    Insert mappings into `dc_table` and `dc_table_map` tables in the repository.\n\n## Run Data Comparison\n\n```shell\njava -jar pgcompare.jar --batch 0\n```\n\nBatch 0 processes all data. Use `PGCOMPARE-BATCH` or specify the batch number using the `--batch` argument to specify a batch number.\n\n## Recheck Discrepancies\n\nRevalidate flagged rows:\n\n```shell\njava -jar pgcompare.jar --batch 0 --check\n```\n\n# Upgrading\n\n## Version 0.3.0 Enhancements\n\n- DB2 support.\n- Case-sensitive table/column name handling.\n- New project configurations for easier management.\n\n**Note:** Drop and recreate the repository to upgrade to 0.3.0.\n\n# Advanced Configuration\n\n## Properties\n\nDefine properties via a file, environment variables, or the `dc_project` table. Environment variables override file settings and must be prefixed with `PGCOMPARE_`.\n\nExamples:\n- File: `batch-fetch-size=2000`\n- Env: `PGCOMPARE_BATCH_FETCH_SIZE=2000`\n\n## Tuning Performance\n\n- **Batch size:** Adjust `batch-fetch-size` and `batch-commit-size` for memory efficiency.\n- **Threads:** Use loader-threads (default: 4) for parallel processing.\n- **Observer throttle:** Enable to prevent overloading temporary tables (observer-throttle=true).\n- **Java Heap Size:** For larger datasets, there may be a need to increase the Java Heap size.  Use the options `-Xms` and `-Xmx` when executing pgCompare (`java -Xms512m -Xmx2g -jar pgcompare.jar`). \n\n## Repository Recommendations\n\n- Minimal requirements: 2 vCPUs, 8 GB RAM.\n- PostgreSQL settings:\n  - shared_buffers=2048MB\n  - work_mem=256MB\n  - max_parallel_workers=16\n\n## Projects\n\nProjects allow for the repository to maintain different mappings for different compare objectives.  This allows a central pgCompare repository to be used for multiple compare projects.  Each table has a `pid` column which is the project id.  If no project is specified, the default project (pid = 1) is used.\n\n# Viewing Results\n\n## Summary from Last Run\n\n```sql\nWITH mr AS (SELECT max(rid) rid FROM dc_result)\nSELECT compare_start, table_name, status, equal_cnt+not_equal_cnt+missing_source_cnt+missing_target_cnt  AS total_cnt,\n       equal_cnt, not_equal_cnt, missing_source_cnt + missing_target_cnt AS missing_cnt\nFROM dc_result r\n         JOIN mr ON (mr.rid = r.rid)\nORDER BY table_name;\n```\n\n## Out-of-Sync Rows\n\n```sql\nSELECT COALESCE(s.table_name, t.table_name) AS table_name,\n       CASE\n           WHEN s.compare_result = 'n' THEN 'out-of-sync'\n           WHEN s.compare_result = 'm' THEN 'missing target'\n           WHEN t.compare_result = 'm' THEN 'missing source'\n           END AS compare_result,\n       COALESCE(s.pk, t.pk) AS primary_key\nFROM dc_source s\n         FULL OUTER JOIN dc_target t ON s.pk = t.pk and s.tid=t.tid;\n```\n\n# Reference\n\n## Column Map\n\nThe system will automatically generate a column mapping during the first execution on a table.  This column mapping will be stored in the `dc_table_column` and `dc_table_column_map` repository tables. This mapping can be performed ahead of time or the generated mapping modified as needed.  If a column mapping is present, the program will not perform a remap unless instructed to using the `maponly` flag.\n\nTo create or overwrite current column mappings stored in column_map colum of dc_table, execute the following:\n\n```shell\njava -jar pgcompare.jar --batch 0 --maponly\n```\n\n## Properties\n\nProperties are categorized into four sections: system, repository, source, and target. Each section has specific properties, as described in detail in the documentation.  The properties can be specified via a configuration file, environment variables or a combination of both.  To use environment variables, the environment variable will be the name of the property in upper case with dashes '-' converted to underscore '_' and prefixed with PGCOMPARE_.  For example, batch-fetch-size can be set by using the environment variable PGCOMPARE_BATCH_FETCH_SIZE.\n\n### System\n- batch-fetch-size: Sets the fetch size for retrieving rows from the source or target database.\n- batch-commit-size:  The commit size controls the array size and number of rows concurrently inserted into the dc_source/dc_target staging tables.\n- batch-progress-report-size:  Defines the number of rows used in mod to report progress.\n- database-source:  Determines if the sorting of the rows based on primary key occurs on the source/target database.  If set to true, the default, the rows will be sorted before being compared.  If set to false, the sorting will take place in the repository database.\n- loader-threads: Sets the number of threads to load data into the temporary tables. Default is 4.  Set to 0 to disable loader threads.\n- log-level:   Level to determine the amount of log messages written to the log destination.\n- log-destination:  Location where log messages will be written.  Default is stdout.\n- message-queue-size:  Size of message queue used by loader threads (nbr messages).  Default is 100.\n- number-cast: Defines how numbers are cast for hash function (notation|standard).  Default is notation (for scientific notation).\n- observer-throttle:  Set to true or false, instructs the loader threads to pause and wait for the observer thread to catch up before continuing to load more data into the staging tables.\n- observer-throttle-size:  Number of rows loaded before the loader thread will sleep and wait for clearance from the observer thread.\n- observer-vacuum:  Set to true or false, instructs the observer whether to perform a vacuum on the staging tables during checkpoints.\n\n### Repository\n- repo-dbname:  Repository database name.\n- repo-host: Host name of server hosting the Postgres repository database.\n- repo-password:  Postgres database user password.\n- repo-port:  Repository Postgres instance port.\n- repo-schema:  Name of schema that owns the repository tables.\n- repo-sslmode: Set the SSL mode to use for the database connection (disable|prefer|require)\n- repo-user:  Postgres database username.\n\n### Source\n\n- source-database-hash: True or false, instructs the application where the hash should be computed (on the database or by the application).\n- source-dbname:  Database or service name.\n- source-host:  Database server name.\n- source-password:  Database password.\n- source-port:  Database port.\n- source-schema:  Name of schema that owns the tables.\n- source-sslmode: Set the SSL mode to use for the database connection (disable|prefer|require)\n- source-type:  Database type: oracle, postgres\n- source-user:   Database username.\n\n### Target\n\n- target-database-hash: True or false, instructs the application where the hash should be computed (on the database or by the application).\n- target-dbname:  Database or service name.\n- target-host:  Database server name.\n- target-password:  Database password.\n- target-port:  Database port.\n- target-schema:  Name of schema that owns the tables.\n- target-sslmode: Set the SSL mode to use for the database connection (disable|prefer|require)\n- target-type:  Database type: oracle, postgres\n- target-user:  Database username.\n\n## Property Precedence\n\nThe system contains default values for every parameter.  These can be over-ridden using environment variables, properties file, or values saved in the `dc_project` table.  The following is the order of precedence used:\n\n- Default values\n- Properties file\n- Environment variables\n- Settings stored in `dc_project` table\n\n# License\n\n**pgCompare** is licensed under the [Apache 2.0 license](LICENSE.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrunchydata%2Fpgcompare","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrunchydata%2Fpgcompare","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrunchydata%2Fpgcompare/lists"}