{"id":14447617,"url":"https://github.com/paradedb/pg_analytics","last_synced_at":"2025-03-24T21:02:55.429Z","repository":{"id":239073799,"uuid":"798430703","full_name":"paradedb/pg_analytics","owner":"paradedb","description":"DuckDB-powered data lake analytics from Postgres","archived":false,"fork":false,"pushed_at":"2025-03-05T15:40:32.000Z","size":765,"stargazers_count":517,"open_issues_count":0,"forks_count":21,"subscribers_count":5,"default_branch":"dev","last_synced_at":"2025-03-17T19:52:35.160Z","etag":null,"topics":["analytics","arrow","big-data","columnar","database","datafusion","datalake","deltalake","duckdb","iceberg","lakehouse","lakehouse-platform","object-storage","olap","paradedb","parquet","postgres","postgresql","realtime-analytics","sql"],"latest_commit_sha":null,"homepage":"https://paradedb.com","language":"Rust","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"postgresql","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/paradedb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["paradedb"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":null}},"created_at":"2024-05-09T19:01:45.000Z","updated_at":"2025-03-17T14:26:33.000Z","dependencies_parsed_at":"2024-05-09T21:23:43.783Z","dependency_job_id":"b3ae6395-1352-4028-8a82-d1b720e09510","html_url":"https://github.com/paradedb/pg_analytics","commit_stats":{"total_commits":83,"total_committers":11,"mean_commits":7.545454545454546,"dds":0.4698795180722891,"last_synced_commit":"95785179ec8f4058fea70bcf1a6e9e8d10a85a41"},"previous_names":["paradedb/pg_analytics"],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradedb%2Fpg_analytics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradedb%2Fpg_analytics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradedb%2Fpg_analytics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradedb%2Fpg_analytics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/paradedb","download_url":"https://codeload.github.com/paradedb/pg_analytics/tar.gz/refs/heads/dev","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245351747,"owners_count":20601090,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","arrow","big-data","columnar","database","datafusion","datalake","deltalake","duckdb","iceberg","lakehouse","lakehouse-platform","object-storage","olap","paradedb","parquet","postgres","postgresql","realtime-analytics","sql"],"created_at":"2024-09-01T07:01:32.118Z","updated_at":"2025-03-24T21:02:55.413Z","avatar_url":"https://github.com/paradedb.png","language":"Rust","readme":"\u003ch1 align=\"center\"\u003e\n  \u003cimg src=\"assets/pg_analytics.svg\" alt=\"pg_analytics\"\u003e\n\u003cbr\u003e\n\u003c/h1\u003e\n\n\u003e **Notice**\n\u003e\n\u003e The `paradedb/pg_analytics` extension has been discontinued and is archived. This decision was made because ParadeDB's work on Postgres analytics is now being done in our primary extension, `pg_search`. If you are looking for fast analytics on Postgres, we recommend you check out our [`paradedb/paradedb`](https://github.com/paradedb/paradedb) repository.\n\u003e\n\u003e The code in this repository is no longer maintained.\n\u003e\n\u003e [Learn more](https://github.com/paradedb/paradedb/blob/dev/README.md).\n\n[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/paradedb)](https://artifacthub.io/packages/search?repo=paradedb)\n[![Docker Pulls](https://img.shields.io/docker/pulls/paradedb/paradedb)](https://hub.docker.com/r/paradedb/paradedb)\n[![License](https://img.shields.io/badge/License-PostgreSQL-blue)](https://github.com/paradedb/pg_analytics?tab=PostgreSQL-1-ov-file#readme)\n[![Slack URL](https://img.shields.io/badge/Join%20Slack-purple?logo=slack\u0026link=https%3A%2F%2Fjoin.slack.com%2Ft%2Fparadedbcommunity%2Fshared_invite%2Fzt-2lkzdsetw-OiIgbyFeiibd1DG~6wFgTQ)](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-2lkzdsetw-OiIgbyFeiibd1DG~6wFgTQ)\n[![X URL](https://img.shields.io/twitter/url?url=https%3A%2F%2Ftwitter.com%2Fparadedb\u0026label=Follow%20%40paradedb)](https://x.com/paradedb)\n\n## Overview\n\n`pg_analytics` (formerly named `pg_lakehouse`) puts DuckDB inside Postgres. With `pg_analytics` installed, Postgres can query foreign object stores like AWS S3 and table formats like Iceberg or Delta Lake. Queries are pushed down to DuckDB, a high performance analytical query engine.\n\n`pg_analytics` uses DuckDB v1.1.0 and is supported on Postgres 13+.\n\n### Motivation\n\nToday, a vast amount of non-operational data — events, metrics, historical snapshots, vendor data, etc. — is ingested into data lakes like AWS S3. Querying this data by moving it into a cloud data warehouse or operating a new query engine is expensive and time-consuming. The goal of `pg_analytics` is to enable this data to be queried directly from Postgres. This eliminates the need for new infrastructure, loss of data freshness, data movement, and non-Postgres dialects of other query engines.\n\n`pg_analytics` uses the foreign data wrapper (FDW) API to connect to any object store or table format and the executor hook API to push queries to DuckDB. While other FDWs like `aws_s3` have existed in the Postgres extension ecosystem, these FDWs suffer from two limitations:\n\n1. Lack of support for most object stores and table formats\n2. Too slow over large datasets to be a viable analytical engine\n\n`pg_analytics` differentiates itself by supporting a wide breadth of stores and formats and by being very fast (thanks to DuckDB).\n\n### Roadmap\n\n- [x] Read support for `pg_analytics`\n- [x] `EXPLAIN` support\n- [x] `VIEW` support\n- [x] Automatic schema detection\n\n#### Object Stores\n\n- [x] AWS S3\n- [x] S3-compatible stores (MinIO, R2)\n- [x] Google Cloud Storage\n- [x] Azure Blob Storage\n- [x] Azure Data Lake Storage Gen2\n- [x] Hugging Face (`.parquet`, `.csv`, `.jsonl`)\n- [x] HTTP server\n- [x] Local file system\n\n#### File/Table Formats\n\n- [x] Parquet\n- [x] CSV\n- [x] JSON\n- [x] Geospatial (`.geojson`, `.xlsx`)\n- [x] Delta Lake\n- [x] Apache Iceberg\n\n## Installation\n\n### From ParadeDB\n\nThe easiest way to use the extension is to run the ParadeDB Dockerfile:\n\n```bash\ndocker run --name paradedb -e POSTGRES_PASSWORD=password paradedb/paradedb\ndocker exec -it paradedb psql -U postgres\n```\n\nThis will spin up a PostgreSQL 16 instance with `pg_analytics` preinstalled.\n\n### From Self-Hosted PostgreSQL\n\nBecause this extension uses Postgres hooks to intercept and push queries down to DuckDB, it is **very important** that it is added to `shared_preload_libraries` inside `postgresql.conf`.\n\n```bash\n# Inside postgresql.conf\nshared_preload_libraries = 'pg_analytics'\n```\n\nThis ensures the best query performance from the extension.\n\n#### Linux \u0026 macOS\n\nWe provide prebuilt binaries for macOS, Debian, Ubuntu, and Red Hat Enterprise Linux for Postgres 14+. You can download the latest version for your architecture from the [GitHub Releases page](https://github.com/paradedb/paradedb/releases).\n\n#### Windows\n\nWindows is not supported. This restriction is [inherited from pgrx not supporting Windows](https://github.com/pgcentralfoundation/pgrx?tab=readme-ov-file#caveats--known-issues).\n\n## Usage\n\nThe following example uses `pg_analytics` to query an example dataset of 3 million NYC taxi trips from January 2024, hosted in a public `us-east-1` S3 bucket provided by ParadeDB.\n\n```sql\nCREATE EXTENSION pg_analytics;\nCREATE FOREIGN DATA WRAPPER parquet_wrapper HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;\n\n-- Provide S3 credentials\nCREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;\n\n-- Create foreign table with auto schema creation\nCREATE FOREIGN TABLE trips ()\nSERVER parquet_server\nOPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');\n\n-- Success! Now you can query the remote Parquet file like a regular Postgres table\nSELECT COUNT(*) FROM trips;\n  count\n---------\n 2964624\n(1 row)\n```\n\n## Documentation\n\nComplete documentation for `pg_analytics` can be found [here](https://docs.paradedb.com/integrations/overview).\n\n## Development\n\n### Install Rust\n\nTo develop the extension, first install Rust via `rustup`.\n\n```bash\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\nrustup install \u003cversion\u003e\n\nrustup default \u003cversion\u003e\n```\n\nNote: While it is possible to install Rust via your package manager, we recommend using `rustup` as we've observed inconsistencies with Homebrew's Rust installation on macOS.\n\n### Install Dependencies\n\nBefore compiling the extension, you'll need to have the following dependencies installed.\n\n```bash\n# macOS\nbrew install make gcc pkg-config openssl\n\n# Ubuntu\nsudo apt-get install -y make gcc pkg-config libssl-dev libclang-dev\n\n# Arch Linux\nsudo pacman -S core/openssl extra/clang\n```\n\n### Install Postgres\n\n```bash\n# macOS\nbrew install postgresql@17\n\n# Ubuntu\nwget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -\nsudo sh -c 'echo \"deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main\" \u003e /etc/apt/sources.list.d/pgdg.list'\nsudo apt-get update \u0026\u0026 sudo apt-get install -y postgresql-17 postgresql-server-dev-17\n\n# Arch Linux\nsudo pacman -S extra/postgresql\n```\n\nIf you are using Postgres.app to manage your macOS PostgreSQL, you'll need to add the `pg_config` binary to your path before continuing:\n\n```bash\nexport PATH=\"$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin\"\n```\n\n### Install pgrx\n\nThen, install and initialize `pgrx`:\n\n```bash\n# Note: Replace --pg17 with your version of Postgres, if different (i.e. --pg16)\ncargo install --locked cargo-pgrx --version 0.12.7\n\n# macOS arm64\ncargo pgrx init --pg17=/opt/homebrew/opt/postgresql@17/bin/pg_config\n\n# macOS amd64\ncargo pgrx init --pg17=/usr/local/opt/postgresql@17/bin/pg_config\n\n# Ubuntu\ncargo pgrx init --pg17=/usr/lib/postgresql/17/bin/pg_config\n\n# Arch Linux\ncargo pgrx init --pg17=/usr/bin/pg_config\n```\n\nIf you prefer to use a different version of Postgres, update the `--pg` flag accordingly.\n\n### Running the Extension\n\nFirst, start pgrx:\n\n```bash\ncargo pgrx run\n```\n\nThis will launch an interactive connection to Postgres. Inside Postgres, create the extension by running:\n\n```sql\nCREATE EXTENSION pg_analytics;\n```\n\nYou now have access to all the extension functions.\n\n### Modifying the Extension\n\nIf you make changes to the extension code, follow these steps to update it:\n\n1. Recompile the extension:\n\n```bash\ncargo pgrx run\n```\n\n2. Recreate the extension to load the latest changes:\n\n```sql\nDROP EXTENSION pg_analytics;\nCREATE EXTENSION pg_analytics;\n```\n\n### Running Tests\n\nWe use `cargo test` as our runner for `pg_analytics` tests. Tests are conducted using [testcontainers](https://github.com/testcontainers/testcontainers-rs) to manage testing containers like [LocalStack](https://hub.docker.com/r/localstack/localstack). `testcontainers` will pull any Docker images that it requires to perform the test.\n\nYou also need a running Postgres instance to run the tests. The test suite will look for a connection string on the `DATABASE_URL` environment variable. You can set this variable manually, or use `.env` file with contents like this:\n\n```text\nDATABASE_URL=postgres://\u003cusername\u003e@\u003chost\u003e:\u003cport\u003e/\u003cdatabase\u003e\n```\n\n## License\n\n`pg_analytics` is licensed under the [PostgreSQL License](https://www.postgresql.org/about/licence/).\n","funding_links":["https://github.com/sponsors/paradedb"],"categories":["Rust","Client-Server Setups"],"sub_categories":["Web Clients (WebAssembly)"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparadedb%2Fpg_analytics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparadedb%2Fpg_analytics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparadedb%2Fpg_analytics/lists"}