{"id":13513445,"url":"https://github.com/dimitri/pgcopydb","last_synced_at":"2025-05-14T10:12:16.733Z","repository":{"id":37082861,"uuid":"440229033","full_name":"dimitri/pgcopydb","owner":"dimitri","description":"Copy a Postgres database to a target Postgres server (pg_dump | pg_restore on steroids)","archived":false,"fork":false,"pushed_at":"2024-10-14T15:59:45.000Z","size":21835,"stargazers_count":1178,"open_issues_count":64,"forks_count":78,"subscribers_count":21,"default_branch":"main","last_synced_at":"2024-10-29T15:34:45.251Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dimitri.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-20T16:04:47.000Z","updated_at":"2024-10-28T16:13:08.000Z","dependencies_parsed_at":"2023-02-16T09:31:38.173Z","dependency_job_id":"3166960a-40fc-4ec9-8071-795d15e94dc6","html_url":"https://github.com/dimitri/pgcopydb","commit_stats":null,"previous_names":[],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimitri%2Fpgcopydb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimitri%2Fpgcopydb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimitri%2Fpgcopydb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dimitri%2Fpgcopydb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dimitri","download_url":"https://codeload.github.com/dimitri/pgcopydb/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248497812,"owners_count":21113984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T05:00:25.535Z","updated_at":"2025-04-11T23:55:46.082Z","avatar_url":"https://github.com/dimitri.png","language":"C","readme":"# pgcopydb\n\n[![Documentation Status](https://readthedocs.org/projects/pgcopydb/badge/?version=latest)](https://pgcopydb.readthedocs.io/en/latest/?badge=latest)\n\n## Introduction\n\npgcopydb is a tool that automates running `pg_dump | pg_restore` between two\nrunning Postgres servers. To make a copy of a database to another server as\nquickly as possible, one would like to use the parallel options of `pg_dump`\nand still be able to stream the data to as many `pg_restore` jobs.\n\nThe idea would be to use `pg_dump --jobs=N --format=directory\npostgres://user@source/dbname | pg_restore --jobs=N --format=directory -d\npostgres://user@target/dbname` in a way. This command line can't be made to\nwork, unfortunately, because `pg_dump --format=directory` writes to local\nfiles and directories first, and then later `pg_restore --format=directory`\ncan be used to read from those files again.\n\nGiven that, pgcopydb then uses pg_dump and pg_restore for the schema parts\nof the process, and implements its own data copying multi-process streaming\nparts. Also, pgcopydb bypasses pg_restore index building and drives that\ninternally so that all indexes may be built concurrently.\n\n## Base Copy and Change Data Capture\n\npgcopydb implements both the base copy of a database and also Change Data\nCapture to allow replay of changes from the source database to the target\ndatabase. The Change Data Capture facility is implemented using Postgres\nLogical Decoding infrastructure and the wal2json plugin.\n\nThe `pgcopydb follow` command implements a logical replication client for\nthe logical decoding plugin wal2json.\n\nThe `pgcopydb clone --follow` command implements a full solution for online\nmigration. Beware that online migrations involve a lot more complexities\nwhen compared to offline migration. It is always a good idea to first\nimplement offline migration first. The command `pgcopydb clone` is used to\nimplement the offline migration approach.\n\n## Documentation\n\nFull documentation is available online, including manual pages of all the\npgcopydb sub-commands. Check out\n[https://pgcopydb.readthedocs.io/](https://pgcopydb.readthedocs.io/en/latest/).\n\n```\n$ pgcopydb help\n  pgcopydb\n    clone     Clone an entire database from source to target\n    fork      Clone an entire database from source to target\n    follow    Replay changes from the source database to the target database\n    snapshot  Create and export a snapshot on the source database\n  + compare   Compare source and target databases\n  + copy      Implement the data section of the database copy\n  + dump      Dump database objects from a Postgres instance\n  + restore   Restore database objects into a Postgres instance\n  + list      List database objects from a Postgres instance\n  + stream    Stream changes from the source database\n    ping      Attempt to connect to the source and target instances\n    help      Print help message\n    version   Print pgcopydb version\n\n  pgcopydb compare\n    schema  Compare source and target schema\n    data    Compare source and target data\n\n  pgcopydb copy\n    db           Copy an entire database from source to target\n    roles        Copy the roles from the source instance to the target instance\n    extensions   Copy the extensions from the source instance to the target instance\n    schema       Copy the database schema from source to target\n    data         Copy the data section from source to target\n    table-data   Copy the data from all tables in database from source to target\n    blobs        Copy the blob data from the source database to the target\n    sequences    Copy the current value from all sequences in database from source to target\n    indexes      Create all the indexes found in the source database in the target\n    constraints  Create all the constraints found in the source database in the target\n\n  pgcopydb dump\n    schema     Dump source database schema as custom files in work directory\n    roles      Dump source database roles as custom file in work directory\n\n  pgcopydb restore\n    schema      Restore a database schema from custom files to target database\n    pre-data    Restore a database pre-data schema from custom file to target database\n    post-data   Restore a database post-data schema from custom file to target database\n    roles       Restore database roles from SQL file to target database\n    parse-list  Parse pg_restore --list output from custom file\n\n  pgcopydb list\n    databases    List databases\n    extensions   List all the source extensions to copy\n    collations   List all the source collations to copy\n    tables       List all the source tables to copy data from\n    table-parts  List a source table copy partitions\n    sequences    List all the source sequences to copy data from\n    indexes      List all the indexes to create again after copying the data\n    depends      List all the dependencies to filter-out\n    schema       List the schema to migrate, formatted in JSON\n    progress     List the progress\n\n  pgcopydb stream\n    setup      Setup source and target systems for logical decoding\n    cleanup    Cleanup source and target systems for logical decoding\n    prefetch   Stream JSON changes from the source database and transform them to SQL\n    catchup    Apply prefetched changes from SQL files to the target database\n    replay     Replay changes from the source to the target database, live\n  + sentinel   Maintain a sentinel table\n    receive    Stream changes from the source database\n    transform  Transform changes from the source database into SQL commands\n    apply      Apply changes from the source database into the target database\n\n  pgcopydb stream sentinel\n    setup   Setup the sentinel table\n    get     Get the sentinel table values\n  + set     Set the sentinel table values\n\n  pgcopydb stream sentinel set\n    startpos  Set the sentinel start position LSN\n    endpos    Set the sentinel end position LSN\n    apply     Set the sentinel apply mode\n    prefetch  Set the sentinel prefetch mode\n```\n\n## Example\n\nWhen using `pgcopydb` it is possible to achieve the result outlined before\nwith this simple command line:\n\n```bash\n$ export PGCOPYDB_SOURCE_PGURI=\"postgres://user@source.host.dev/dbname\"\n$ export PGCOPYDB_TARGET_PGURI=\"postgres://role@target.host.dev/dbname\"\n\n$ pgcopydb clone --table-jobs 8 --index-jobs 2\n```\n\nA typical output from the command would contain lots of lines of logs, and\nthen a table summary with a line per table and some information (timing for\nthe table COPY, cumulative timing for the CREATE INDEX commands), and then\nan overall summary that looks like the following:\n\n```\n19:18:24.447 76974 INFO   Running pgcopydb version 0.15.74.gc74047a from \"/usr/bin/pgcopydb\"\n19:18:24.451 76974 INFO   [SOURCE] Copying database from \"postgres://pagila:0wn3d@source/pagila?keepalives=1\u0026keepalives_idle=10\u0026keepalives_interval=10\u0026keepalives_count=60\"\n19:18:24.451 76974 INFO   [TARGET] Copying database into \"postgres://pagila:0wn3d@target/pagila?keepalives=1\u0026keepalives_idle=10\u0026keepalives_interval=10\u0026keepalives_count=60\"\n19:18:24.506 76974 INFO   Using work dir \"/tmp/pgcopydb\"\n19:18:24.519 76974 INFO   Exported snapshot \"00000003-00000023-1\" from the source database\n19:18:24.522 76985 INFO   STEP 1: fetch source database tables, indexes, and sequences\n19:18:24.886 76985 INFO   Fetched information for 5 tables (including 0 tables split in 0 partitions total), with an estimated total of 1000 thousands tuples and 128 MB on-disk\n19:18:24.892 76985 INFO   Fetched information for 4 indexes (supporting 4 constraints)\n19:18:24.894 76985 INFO   Fetching information for 1 sequences\n19:18:24.909 76985 INFO   Fetched information for 1 extensions\n19:18:25.030 76985 INFO   Found 0 indexes (supporting 0 constraints) in the target database\n19:18:25.042 76985 INFO   STEP 2: dump the source database schema (pre/post data)\n19:18:25.046 76985 INFO    /usr/bin/pg_dump -Fc --snapshot 00000003-00000023-1 --section=pre-data --section=post-data --file /tmp/pgcopydb/schema/schema.dump 'postgres://pagila:0wn3d@source/pagila?keepalives=1\u0026keepalives_idle=10\u0026keepalives_interval=10\u0026keepalives_count=60'\n19:18:25.182 76985 INFO   STEP 3: restore the pre-data section to the target database\n19:18:25.202 76985 INFO    /usr/bin/pg_restore --dbname 'postgres://pagila:0wn3d@target/pagila?keepalives=1\u0026keepalives_idle=10\u0026keepalives_interval=10\u0026keepalives_count=60' --section pre-data --jobs 2 --use-list /tmp/pgcopydb/schema/pre-filtered.list /tmp/pgcopydb/schema/schema.dump\n19:18:25.354 77000 INFO   STEP 4: starting 8 table-data COPY processes\n19:18:25.428 77002 INFO   STEP 8: starting 8 VACUUM processes\n19:18:25.451 76985 INFO   Skipping large objects: none found.\n2024-06-10 19:18:25.462 +03 [77031] LOG:  unexpected EOF on client connection with an open transaction\n19:18:25.471 77001 INFO   STEP 6: starting 2 CREATE INDEX processes\n19:18:25.471 77001 INFO   STEP 7: constraints are built by the CREATE INDEX processes\n19:18:25.482 76985 INFO   STEP 9: reset sequences values\n19:18:25.483 77040 INFO   Set sequences values on the target database\n19:18:33.807 76985 INFO   STEP 10: restore the post-data section to the target database\n19:18:33.821 76985 INFO    /usr/bin/pg_restore --dbname 'postgres://pagila:0wn3d@target/pagila?keepalives=1\u0026keepalives_idle=10\u0026keepalives_interval=10\u0026keepalives_count=60' --section post-data --jobs 2 --use-list /tmp/pgcopydb/schema/post-filtered.list /tmp/pgcopydb/schema/schema.dump\n19:18:33.879 76985 INFO   All step are now done,  9s352 elapsed\n19:18:33.880 76985 INFO   Printing summary for 5 tables and 4 indexes\n\n  OID | Schema |             Name | Parts | copy duration | transmitted bytes | indexes | create index duration\n------+--------+------------------+-------+---------------+-------------------+---------+----------------------\n16398 | public | pgbench_accounts |     1 |         7s130 |             91 MB |       1 |                 878ms\n16395 | public |  pgbench_tellers |     1 |          69ms |            1002 B |       1 |                  44ms\n16401 | public | pgbench_branches |     1 |          46ms |              71 B |       1 |                  37ms\n16386 | public |           table1 |     1 |          56ms |               0 B |       1 |                  40ms\n16392 | public |  pgbench_history |     1 |          67ms |               0 B |       0 |                   0ms\n\n\n                                               Step   Connection    Duration    Transfer   Concurrency\n --------------------------------------------------   ----------  ----------  ----------  ------------\n   Catalog Queries (table ordering, filtering, etc)       source       183ms                         1\n                                        Dump Schema       source       134ms                         1\n                                     Prepare Schema       target       128ms                         1\n      COPY, INDEX, CONSTRAINTS, VACUUM (wall clock)         both       8s483                        18\n                                  COPY (cumulative)         both       7s368      128 MB             8\n                          CREATE INDEX (cumulative)       target       965ms                         2\n                           CONSTRAINTS (cumulative)       target        34ms                         2\n                                VACUUM (cumulative)       target       120ms                         8\n                                    Reset Sequences         both        38ms                         1\n                         Large Objects (cumulative)       (null)         0ms                         0\n                                    Finalize Schema         both        61ms                         2\n --------------------------------------------------   ----------  ----------  ----------  ------------\n                          Total Wall Clock Duration         both       9s352                        24\n```\n\n## Installing pgcopydb\n\nSee our [documentation](https://pgcopydb.readthedocs.io/en/latest/install.html).\n\n## Design Considerations (why oh why)\n\nThe reason why `pgcopydb` has been developed is mostly to allow two aspects\nthat are not possible to achieve directly with `pg_dump` and `pg_restore`,\nand that requires just enough fiddling around that not many scripts have\nbeen made available to automate around.\n\n### Bypass intermediate files for the TABLE DATA\n\nFirst aspect is that for `pg_dump` and `pg_restore` to implement concurrency\nthey need to write to an intermediate file first.\n\nThe [docs for\npg_dump](https://www.postgresql.org/docs/current/app-pgdump.html) say the\nfollowing about the `--jobs` parameter:\n\n\u003e You can only use this option with the directory output format because this\n\u003e is the only output format where multiple processes can write their data at\n\u003e the same time.\n\nThe [docs for\npg_restore](https://www.postgresql.org/docs/current/app-pgrestore.html) say\nthe following about the `--jobs` parameter:\n\n\u003e Only the custom and directory archive formats are supported with this\n\u003e option. The input must be a regular file or directory (not, for example, a\n\u003e pipe or standard input).\n\nSo the first idea with `pgcopydb` is to provide the `--jobs` concurrency and\nbypass intermediate files (and directories) altogether, at least as far as\nthe actual TABLE DATA set is concerned.\n\nThe trick to achieve that is that `pgcopydb` must be able to connect to the\nsource database during the whole operation, when `pg_restore` may be used\nfrom an export on-disk, without having to still be able to connect to the\nsource database. In the context of `pgcopydb` requiring access to the source\ndatabase is fine. In the context of `pg_restore`, it would not be\nacceptable.\n\n### For each table, build all indexes concurrently\n\nThe other aspect that `pg_dump` and `pg_restore` are not very smart about is\nhow they deal with the indexes that are used to support constraints, in\nparticular unique constraints and primary keys.\n\nThose indexes are exported using the `ALTER TABLE` command directly. This is\nfine because the command creates both the constraint and the underlying\nindex, so the schema in the end is found as expected.\n\nThat said, those `ALTER TABLE ... ADD CONSTRAINT` commands require a level\nof locking that prevents any concurrency. As we can read on the [docs for\nALTER TABLE](https://www.postgresql.org/docs/current/sql-altertable.html):\n\n\u003e Although most forms of ADD table_constraint require an ACCESS EXCLUSIVE\n\u003e lock, ADD FOREIGN KEY requires only a SHARE ROW EXCLUSIVE lock. Note that\n\u003e ADD FOREIGN KEY also acquires a SHARE ROW EXCLUSIVE lock on the referenced\n\u003e table, in addition to the lock on the table on which the constraint is\n\u003e declared.\n\nThe trick is then to first issue a `CREATE UNIQUE INDEX` statement and when\nthe index has been built then issue a second command in the form of `ALTER\nTABLE ... ADD CONSTRAINT ... PRIMARY KEY USING INDEX ...`, as in the\nfollowing example taken from the logs of actually running `pgcopydb`:\n\n```\n...\n21:52:06 68898 INFO  COPY \"demo\".\"tracking\";\n21:52:06 68899 INFO  COPY \"demo\".\"client\";\n21:52:06 68899 INFO  Creating 2 indexes for table \"demo\".\"client\"\n21:52:06 68906 INFO  CREATE UNIQUE INDEX client_pkey ON demo.client USING btree (client);\n21:52:06 68907 INFO  CREATE UNIQUE INDEX client_pid_key ON demo.client USING btree (pid);\n21:52:06 68898 INFO  Creating 1 indexes for table \"demo\".\"tracking\"\n21:52:06 68908 INFO  CREATE UNIQUE INDEX tracking_pkey ON demo.tracking USING btree (client, ts);\n21:52:06 68907 INFO  ALTER TABLE \"demo\".\"client\" ADD CONSTRAINT \"client_pid_key\" UNIQUE USING INDEX \"client_pid_key\";\n21:52:06 68906 INFO  ALTER TABLE \"demo\".\"client\" ADD CONSTRAINT \"client_pkey\" PRIMARY KEY USING INDEX \"client_pkey\";\n21:52:06 68908 INFO  ALTER TABLE \"demo\".\"tracking\" ADD CONSTRAINT \"tracking_pkey\" PRIMARY KEY USING INDEX \"tracking_pkey\";\n...\n```\n\nThis trick is worth a lot of performance gains on its own, as has been\ndiscovered and experienced and appreciated by\n[pgloader](https://github.com/dimitri/pgloader) users already.\n\n## Dependencies\n\nAt run-time `pgcopydb` depends on the `pg_dump` and `pg_restore` tools being\navailable in the `PATH`. The tools version should match the Postgres version\nof the target database.\n\nWhen you have multiple versions of Postgres installed, consider exporting\nthe `PG_CONFIG` environment variable to the version you want to use.\n`pgcopydb` then uses the `PG_CONFIG` from the path and runs `${PG_CONFIG}\n--bindir` to find the `pg_dump` and `pg_restore` binaries it needs.\n\n## Manual Steps\n\nThe `pgcopydb` command line also includes entry points that allows\nimplementing any step on its own.\n\n  1. `pgcopydb snapshot \u0026`\n  2. `pgcopydb dump schema`\n  3. `pgcopydb restore pre-data`\n  4. `pgcopydb copy table-data`\n  5. `pgcopydb copy blobs`\n  6. `pgcopydb copy sequences`\n  7. `pgcopydb copy indexes`\n  8. `pgcopydb copy constraints`\n  9. `pgcopydb restore post-data`\n 10. `kill %1`\n\nUsing individual commands fails to provide the advanced concurrency\ncapabilities of the main `pgcopydb clone` command, so it is strongly\nadvised to prefer that main command.\n\nAlso when using separate commands, one has to consider the `--snapshot`\noption that allows for consistent operations. A background process should\nthen export the snapshot and maintain a transaction opened for the duration\nof the operations. See documentation for `pgcopydb snapshot`.\n\n## Authors\n\n* [Dimitri Fontaine](https://github.com/dimitri)\n\n## License\n\nCopyright (c) The PostgreSQL Global Development Group.\n\nThis project is licensed under the PostgreSQL License, see LICENSE file for details.\n\nThis project includes bundled third-party dependencies, see NOTICE file for details.\n","funding_links":[],"categories":["C","Backup"],"sub_categories":["Samples"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdimitri%2Fpgcopydb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdimitri%2Fpgcopydb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdimitri%2Fpgcopydb/lists"}