{"id":14483116,"url":"https://github.com/somasays/pgwarehouse","last_synced_at":"2025-08-30T03:33:11.357Z","repository":{"id":153209791,"uuid":"609712039","full_name":"somasays/pgwarehouse","owner":"somasays","description":"Easily sync your Postgres database to a Snowflake, ClickHouse, or DuckDB warehouse.","archived":false,"fork":false,"pushed_at":"2024-11-21T14:59:32.000Z","size":194,"stargazers_count":84,"open_issues_count":4,"forks_count":7,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-08-16T09:40:16.018Z","etag":null,"topics":["analytics","clickhouse","data-warehouse","postgres","postgresql","snowflake","synchronization","warehouse"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/somasays.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-03-05T01:59:39.000Z","updated_at":"2025-06-30T13:51:08.000Z","dependencies_parsed_at":"2024-01-17T08:44:09.336Z","dependency_job_id":"cc3b7e63-5c58-4f75-9ecd-5f292cb34188","html_url":"https://github.com/somasays/pgwarehouse","commit_stats":null,"previous_names":["scottpersinger/pgwarehouse"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/somasays/pgwarehouse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somasays%2Fpgwarehouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somasays%2Fpgwarehouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somasays%2Fpgwarehouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somasays%2Fpgwarehouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/somasays","download_url":"https://codeload.github.com/somasays/pgwarehouse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/somasays%2Fpgwarehouse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272800743,"owners_count":24995138,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","clickhouse","data-warehouse","postgres","postgresql","snowflake","synchronization","warehouse"],"created_at":"2024-09-03T00:01:31.428Z","updated_at":"2025-08-30T03:33:11.029Z","avatar_url":"https://github.com/somasays.png","language":"Python","funding_links":[],"categories":["Integrations","Python"],"sub_categories":["ETL and Data Processing"],"readme":"# 🚨 **Important notice** 🚨\n\nI'm afraid I have no time to support this library. I'm a bit busy being CEO over at [Supercog.ai](https://supercog.ai). (Come over and \nsign up for our beta to get a smart AI assistant right in Slack 😁). If anyone wants to take this over this project please\nlet me know. I can transfer the repo or just give you ownership of the Pypi package.\n\n-------------\n\n# pgwarehouse - quickly sync Postgres data to your cloud warehouse\n\n## Introduction\n\nPostgres is an amazing, general purpose OLTP database. But it's not designed for heavy analytic (OLAP) usage. Analytic queries are much better served by a columnar store database like Snowflake or Clickhouse.\n\nThis package allows you to easily sync data from a Postgres database into a local or cloud data warehouse (currently [Snowflake](https://docs.snowflake.com/), [ClickHouse](https://clickhouse-docs.vercel.app/docs/en/intro), or [DuckDB](https://duckdb.org/docs/)). You can perform a one-time sync operation, or run periodic incremental syncs to keep your warehouse up to date.\n\n## Features\n\n* High performance by using `COPY` to move lots of data efficiently. `pgwarehouse` can easily sync hundreds of millions of rows of data (tens of GB) per hour.\n* Supports multiple update strategies for immutable or mutable tables.\n* Easy to configure and run.\n\n## Installation\n\n    pip install pgwarehouse\n\nNow you need to configure credentials for your **Postgres** source and the warehouse destination.\n\nYou can place Postgres credentials either in your config file or in your environment. If using the environment you need to set these variables:\n\n    PGHOST\n    PGDATABASE\n    PGUSER\n    PGPASSWORD\n    PGSCHEMA (defaults to 'public')\n\n## Creating a config file\n\nRun this command to create a template config file:\n\n    pgwarehouse init\n\nThis will create a local `pgwarehouse_conf.yaml` file. Now you can edit your Postgres credentials in the `postgres` stanza of the config file:\n\n    postgres:\n        pghost: (defaults to $PGHOST)\n        pgdatabase: (defaults to $PGDATABASE\n        pguser: (defaults to $PGUSER)\n        pgpassword: (defaults to $PGPASSWORD)\n        pgschema: (defaults to 'public')\n\n## Specifying the warehouse credentials\n\nAgain you can use the environment or the config file. Set these sets of vars in your env:\n\n    CLICKHOUSE_HOST\n    CLICKHOUSE_DATABASE\n    CLICKHOUSE_USER\n    CLICKHOUSE_PWD\n    CLICKHOUSE_SECURE\n\nor\n\n    SNOWSQL_ACCOUNT\n    SNOWSQL_DATABASE\n    SNOWSQL_SCHEMA\n    SNOWSQL_WAREHOUSE\n    SNOWSQL_USER\n    SNOWSQL_ROLE\n    SNOWSQL_PWD\n\nor\n\n    DUCKDB_PATH (path to the duckdb database file)\n\n(The Snowflake parameters are the same as those for the [SnowSQL](https://docs.snowflake.com/en/user-guide/snowsql-start)\nCLI tool. The `SNOWSQL_ACCOUNT` value should be your \"account identifier\".)\n\nor set these values in the `warehouse` stanza in the config file:\n\n    warehouse:\n        backend: (clickhouse|snowflake)\n        clickhouse_host: \n        clickhouse_database: \n        clickhouse_user:\n        clickhouse_password:\n        clickhouse_secure:\n        --or--\n        snowsql_account:\n        snowsql_database:\n        snowsql_schema:\n        snowsql_warehouse:\n        snowsql_user:\n        snowsql_pwd:\n        --or--\n        duckdb_path:\n\n# Usage\n\nGeneral way for run:\n```\nsource .env.local; pgwarehouse --config .local.yaml sync users\n```\n\nOnce the credentials are configured you can start syncing data. Start by listing tables from the Postgres database:\n\n    pgwarehouse list\n\nAnd you can see which tables exist so far in the warehouse:\n\n    pgwarehouse listwh\n    \nNow use `sync` to sync a table (eg. the 'users' table):\n\n    pgwarehouse sync users\n\nData will be downloaded from the Postgres database into CSV files on the local machine, and then those files will be uploaded to the warehouse. Running `pgwarehouse listwh` will show the new table.\n\n## Updating a table\n\nAfter the initial sync has run, you can update the warehouse table with new records by running `sync` again:\n\n    pgwarehouse sync users\n\nSee [update strategies](#table-update-strategies) for different ways to update your table on each sync.\n\n## Syncing multiple tables\n\nThere are two ways to manage multiple tables. The first is just to pass `all` in place of the table name:\n\n    pgwarehouse sync all\n\nThis will attempt to sync ALL tables from Postgres into the warehouse. This could take a while!\n\nThe other way is to specify the `tables` list in the config file:\n\n    tables:\n        - users\n        - charges\n        - logs\n\nNow when you specify `sync all` the tool will use the list of tables specified in the config file.\n\n**Pro tip!** You can add the `max_records` settings to your `postgres` configuration to limit the number\nof records copied per table. This can be useful for testing the initial sync in case you have some\nlarge tables. Set this value to something reasonable (like 10000) and then try syncrhonizing all\ntables to make sure they copy properly. Once you have verified the tables in the warehouse then you\ncan remove this setting, drop any large tables, and then copy them in full (just run `sync all` again).\n\n## Table update strategies\n\n#### New Records Only (default)\nThe default update strategy is \"new records only\". This is done by selecting records with a greater value\nfor their primary id column than the greatest value currently in the warehouse. This strategy is simple\nand quick, but only works for monotonically incrementing primary keys, and only finds new records.\n\n#### Reload each time\nAnother supported strategy is \"reload each time\". This is the simplest strategy and we simply reload the\nentire table every time we sync. This strategy should be fine for small-ish tables (like \u003c10m rows).\n\n#### Last Modified\nFinally, if your table has a `last modified` column then you can use the \"all modifications strategy\".\nIn this case all records with a `last modified` timestamp greater than the maximum value found in the\nwarehouse will be selected and \"upserted\" into the warehouse. Records that are already present\n(via matching the primary key) will be updated, and new records will be inserted.\n\n* The Snowflake backend uses the [MERGE](https://docs.snowflake.com/en/sql-reference/sql/merge) operation. \n* The Clickhouse backend uses `ALTER TABLE .. DELETE` to remove matching records and then `INSERT` to insert the new values.\n\n### What about deletes?\n\nThere is no simple way to capture deletes - you have to reload the entire table. A common pattern is\nto apply new records on a daily basis, and reload the entire table every week to remove deleted records.\n\n### What if my table has no primary key?\n\nAll the update strategies except \"reload each time\" require your table to have a primary key column.\n\n## Specifying update strategy at the command line\n\n    pgwarehouse sync \u003ctable\u003e   (defaults to NEW RECORDS)\n    pgwarehouse sync \u003ctable\u003e last_modified=\u003clast modified column\u003e   (MODIFIED RECORDS)\n    pgwarehouse reload \u003ctable\u003e (reloads the whole table)\n\n## Specifying update strategy in the config file\n\nYou can configure the update strategy selectively for each table in the config file. To do so,\nspecify the table as a nested dictionary with options:\n\n    tables:\n        - accounts\n        - users:\n            reload: true\n        - orders:\n            last_modified: updated_at\n        - shoppers\n            last_modified: update_time\n            reload: sun\n        - original_orders:\n            skip: true\n\nIn this example:\n\n* `accounts` will have new records only applied at each sync\n* `users` will be reloaded completely on each sync\n* `orders` will have modified records (found by the 'updated_at' column) applied on each sync\n* `shoppers` will have modified records applied on each sync, except for any sync\nwhich happens on Sunday, in which case the entire table will be reloaded.\n* `original_orders` will be skipped entirely\n\nThe `reload` argument can take 3 forms:\n\n    reload: true    - reload the table every sync\n    reload: [sun,mon,tue,wed,thur,fri]  - reload if the sync occurs on this day of the week\n    reload: 1-31    - reload if the sync occurs on this numeric day of the month (don't use 31!)\n\n## Scheduling regular data syncs\n\n`pgwarehouse` does not including any scheduling itself, you will need an external trigger like\n`cron`, [Heroku Scheduler](https://devcenter.heroku.com/articles/scheduler), or a K8s\n[CronJob](https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/).\n\nWhen running, the tool will need access to local storage - potentially a lot if you are synchronizing\nbig tables. But nothing needs to persist between sync runs (except the config file) - the tool \nonly relies on state it can query from Postgres or the warehouse.\n\n## Troubleshooting\n\nSometimes when you are testing things out it can be helpful to do the sync in two phases:\n1)download the data, 2)upload the data. You can use `extract` and `load` for this:\n\n    pgwarehouse extract \u003ctable\u003e     - only downloads data\n    pgwarehouse load \u003ctable\u003e        - loads the data into the warehouse\n\nWhen the `extract` process runs, its stores data in `./pgw_data/\u003ctable name\u003e_data`. As\nfiles are uploaded they are moved into an `archive` subdirectory. When the **next sync**\nruns then this archive directory will be cleaned up. This allows you to go examine\nthe CSV downloaded data in case the upload fails for some reason. \n\n## Development\n\nRequirements:\n```bash\nsudo apt install libpq-dev postgresql postgresql-contrib\n```\n\nRun tests:\n```poetry run python -m pytest```\n\n## Limitations\n\nColumn type mapping today is [very limited](https://github.com/scottpersinger/pgwarehouse/blob/a20dc316bbdbc78317cfdd796216a847919d8755/pgwarehouse/snowflake_backend.py). More esoteric column types like JSON or ARRAY are simply\nmapped as VARCHAR columns. Some of these types are supported in the warehouse and could be\nimplemented more accurately.\n\nComposite primary keys (using multiple columns) have limited support. Today they will only work\nwith the RELOAD strategy.\n\nNon-numeric primary key types (like UUIDs) probably won't work unless they have a good lexigraphic\nsort that supports a `\u003e` where clause.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomasays%2Fpgwarehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsomasays%2Fpgwarehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomasays%2Fpgwarehouse/lists"}