{"id":17508139,"url":"https://github.com/CrunchyData/pg_parquet","last_synced_at":"2025-03-05T13:31:46.697Z","repository":{"id":258615413,"uuid":"852463155","full_name":"CrunchyData/pg_parquet","owner":"CrunchyData","description":"Copy to/from Parquet in S3 or Azure Blob Storage from within PostgreSQL","archived":false,"fork":false,"pushed_at":"2025-02-24T14:32:31.000Z","size":593,"stargazers_count":436,"open_issues_count":21,"forks_count":16,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-03T11:12:29.841Z","etag":null,"topics":["columnar","data-ingestion","data-migration","parquet","postgresql"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CrunchyData.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-04T21:11:44.000Z","updated_at":"2025-02-27T21:01:35.000Z","dependencies_parsed_at":"2024-11-17T14:31:10.088Z","dependency_job_id":"edc2c7ac-8938-4bf9-99a5-7d3db56086f9","html_url":"https://github.com/CrunchyData/pg_parquet","commit_stats":null,"previous_names":["crunchydata/pg_parquet"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CrunchyData%2Fpg_parquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CrunchyData%2Fpg_parquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CrunchyData%2Fpg_parquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CrunchyData%2Fpg_parquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CrunchyData","download_url":"https://codeload.github.com/CrunchyData/pg_parquet/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242035059,"owners_count":20061247,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["columnar","data-ingestion","data-migration","parquet","postgresql"],"created_at":"2024-10-20T04:01:48.248Z","updated_at":"2025-03-05T13:31:46.690Z","avatar_url":"https://github.com/CrunchyData.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# pg_parquet\n\n\u003e Copy from/to Parquet files in PostgreSQL!\n\n[![CI lints and tests](https://github.com/CrunchyData/pg_parquet/actions/workflows/ci.yml/badge.svg)](https://github.com/CrunchyData/pg_parquet/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/CrunchyData/pg_parquet/graph/badge.svg?token=6BPS0DSKJ2)](https://codecov.io/gh/CrunchyData/pg_parquet)\n\n`pg_parquet` is a PostgreSQL extension that allows you to read and write [Parquet files](https://parquet.apache.org), which are located in `S3` or `file system`, from PostgreSQL via `COPY TO/FROM` commands. It depends on [Apache Arrow](https://arrow.apache.org/rust/arrow/) project to read and write Parquet files and [pgrx](https://github.com/pgcentralfoundation/pgrx) project to extend PostgreSQL's `COPY` command.\n\n```sql\n-- Copy a query result into Parquet in S3\nCOPY (SELECT * FROM table) TO 's3://mybucket/data.parquet' WITH (format 'parquet');\n\n-- Load data from Parquet in S3\nCOPY table FROM 's3://mybucket/data.parquet' WITH (format 'parquet');\n```\n\n## Quick Reference\n- [Installation From Source](#installation-from-source)\n- [Usage](#usage)\n  - [Copy FROM/TO Parquet files TO/FROM Postgres tables](#copy-tofrom-parquet-files-fromto-postgres-tables)\n  - [Inspect Parquet schema](#inspect-parquet-schema)\n  - [Inspect Parquet metadata](#inspect-parquet-metadata)\n- [Object Store Support](#object-store-support)\n- [Copy Options](#copy-options)\n- [Configuration](#configuration)\n- [Supported Types](#supported-types)\n  - [Nested Types](#nested-types)\n- [Postgres Support Matrix](#postgres-support-matrix)\n\n## Installation From Source\nAfter installing `Postgres`, you need to set up `rustup`, `cargo-pgrx` to build the extension.\n\n```bash\n# install rustup\n\u003e curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n\n# install cargo-pgrx\n\u003e cargo install cargo-pgrx\n\n# configure pgrx\n\u003e cargo pgrx init --pg17 $(which pg_config)\n\n# append the extension to shared_preload_libraries in ~/.pgrx/data-17/postgresql.conf \n\u003e echo \"shared_preload_libraries = 'pg_parquet'\" \u003e\u003e ~/.pgrx/data-17/postgresql.conf\n\n# run cargo-pgrx to build and install the extension\n\u003e cargo pgrx run\n\n# create the extension in the database\npsql\u003e \"CREATE EXTENSION pg_parquet;\"\n```\n\n## Usage\nThere are mainly 3 things that you can do with `pg_parquet`:\n1. You can export Postgres tables/queries to Parquet files,\n2. You can ingest data from Parquet files to Postgres tables,\n3. You can inspect the schema and metadata of Parquet files.\n\n### COPY to/from Parquet files from/to Postgres tables\nYou can use PostgreSQL's `COPY` command to read and write Parquet files. Below is an example of how to write a PostgreSQL table, with complex types, into a Parquet file and then to read the Parquet file content back into the same table.\n\n```sql\n-- create composite types\nCREATE TYPE product_item AS (id INT, name TEXT, price float4);\nCREATE TYPE product AS (id INT, name TEXT, items product_item[]);\n\n-- create a table with complex types\nCREATE TABLE product_example (\n    id int,\n    product product,\n    products product[],\n    created_at TIMESTAMP,\n    updated_at TIMESTAMPTZ\n);\n\n-- insert some rows into the table\ninsert into product_example values (\n    1,\n    ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,\n    ARRAY[ROW(1, NULL, NULL)::product, NULL],\n    now(),\n    '2022-05-01 12:00:00-04'\n);\n\n-- copy the table to a parquet file\nCOPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');\n\n-- show table\nSELECT * FROM product_example;\n\n-- copy the parquet file to the table\nCOPY product_example FROM '/tmp/product_example.parquet';\n\n-- show table\nSELECT * FROM product_example;\n```\n\n### Inspect Parquet schema\nYou can call `SELECT * FROM parquet.schema(\u003curi\u003e)` to discover the schema of the Parquet file at given uri.\n\n```sql\nSELECT * FROM parquet.schema('/tmp/product_example.parquet') LIMIT 10;\n             uri              |     name     | type_name  | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type \n------------------------------+--------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------\n /tmp/product_example.parquet | arrow_schema |            |             |                 |            5 |                |       |           |          | \n /tmp/product_example.parquet | id           | INT32      |             | OPTIONAL        |              |                |       |           |        0 | \n /tmp/product_example.parquet | product      |            |             | OPTIONAL        |            3 |                |       |           |        1 | \n /tmp/product_example.parquet | id           | INT32      |             | OPTIONAL        |              |                |       |           |        2 | \n /tmp/product_example.parquet | name         | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |        3 | STRING\n /tmp/product_example.parquet | items        |            |             | OPTIONAL        |            1 | LIST           |       |           |        4 | LIST\n /tmp/product_example.parquet | list         |            |             | REPEATED        |            1 |                |       |           |          | \n /tmp/product_example.parquet | element        |            |             | OPTIONAL        |            3 |                |       |           |        5 | \n /tmp/product_example.parquet | id           | INT32      |             | OPTIONAL        |              |                |       |           |        6 | \n /tmp/product_example.parquet | name         | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |        7 | STRING\n(10 rows)\n```\n\n### Inspect Parquet metadata\nYou can call `SELECT * FROM parquet.metadata(\u003curi\u003e)` to discover the detailed metadata of the Parquet file, such as column statistics, at given uri.\n\n```sql\nSELECT uri, row_group_id, row_group_num_rows, row_group_num_columns, row_group_bytes, column_id, file_offset, num_values, path_in_schema, type_name FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1;\n             uri              | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type_name \n------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-----------\n /tmp/product_example.parquet |            0 |                  1 |                    13 |             842 |         0 |           0 |          1 | id             | INT32\n(1 row)\n```\n\n```sql\nSELECT stats_null_count, stats_distinct_count, stats_min, stats_max, compression, encodings, index_page_offset, dictionary_page_offset, data_page_offset, total_compressed_size, total_uncompressed_size FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1;\n stats_null_count | stats_distinct_count | stats_min | stats_max |    compression     |        encodings         | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size \n------------------+----------------------+-----------+-----------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------\n                0 |                      | 1         | 1         | GZIP(GzipLevel(6)) | PLAIN,RLE,RLE_DICTIONARY |                   |                      4 |               42 |                   101 |                      61\n(1 row)\n```\n\nYou can call `SELECT * FROM parquet.file_metadata(\u003curi\u003e)` to discover file level metadata of the Parquet file, such as format version, at given uri.\n\n```sql\nSELECT * FROM parquet.file_metadata('/tmp/product_example.parquet')\n             uri              | created_by | num_rows | num_row_groups | format_version \n------------------------------+------------+----------+----------------+----------------\n /tmp/product_example.parquet | pg_parquet |        1 |              1 | 1\n(1 row)\n```\n\nYou can call `SELECT * FROM parquet.kv_metadata(\u003curi\u003e)` to query custom key-value metadata of the Parquet file at given uri.\n\n```sql\nSELECT uri, encode(key, 'escape') as key, encode(value, 'escape') as value FROM parquet.kv_metadata('/tmp/product_example.parquet');\n             uri              |     key      |    value\n------------------------------+--------------+---------------------\n /tmp/product_example.parquet | ARROW:schema | /////5gIAAAQAAAA ...\n(1 row)\n```\n\n## Object Store Support\n`pg_parquet` supports reading and writing Parquet files from/to `S3` and `Azure Blob Storage` object stores.\n\n\u003e [!NOTE]\n\u003e To be able to write into a object store location, you need to grant `parquet_object_store_write` role to your current postgres user.\n\u003e Similarly, to read from an object store location, you need to grant `parquet_object_store_read` role to your current postgres user.\n\n#### S3 Storage\n\nThe simplest way to configure object storage is by creating the standard `~/.aws/credentials` and `~/.aws/config` files:\n\n```bash\n$ cat ~/.aws/credentials\n[default]\naws_access_key_id = AKIAIOSFODNN7EXAMPLE\naws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\n\n$ cat ~/.aws/config \n[default]\nregion = eu-central-1\n```\n\nAlternatively, you can use the following environment variables when starting postgres to configure the S3 client:\n- `AWS_ACCESS_KEY_ID`: the access key ID of the AWS account\n- `AWS_SECRET_ACCESS_KEY`: the secret access key of the AWS account\n- `AWS_SESSION_TOKEN`: the session token for the AWS account\n- `AWS_REGION`: the default region of the AWS account\n- `AWS_ENDPOINT_URL`: the endpoint\n- `AWS_SHARED_CREDENTIALS_FILE`: an alternative location for the credentials file **(only via environment variables)**\n- `AWS_CONFIG_FILE`: an alternative location for the config file **(only via environment variables)**\n- `AWS_PROFILE`: the name of the profile from the credentials and config file (default profile name is `default`) **(only via environment variables)**\n- `AWS_ALLOW_HTTP`: allows http endpoints **(only via environment variables)**\n\nConfig source priority order is shown below:\n1. Environment variables,\n2. Config file.\n\nSupported S3 uri formats are shown below:\n- s3:// \\\u003cbucket\\\u003e / \\\u003cpath\\\u003e\n- https:// \\\u003cbucket\\\u003e.s3.amazonaws.com / \\\u003cpath\\\u003e\n- https:// s3.amazonaws.com / \\\u003cbucket\\\u003e / \\\u003cpath\\\u003e\n\nSupported authorization methods' priority order is shown below:\n1. Temporary session tokens by assuming roles,\n2. Long term credentials.\n\n#### Azure Blob Storage\n\nThe simplest way to configure object storage is by creating the standard [`~/.azure/config`](https://learn.microsoft.com/en-us/cli/azure/azure-cli-configuration?view=azure-cli-latest) file:\n\n```bash\n$ cat ~/.azure/config\n[storage]\naccount = devstoreaccount1\nkey = Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==\n```\n\nAlternatively, you can use the following environment variables when starting postgres to configure the Azure Blob Storage client:\n- `AZURE_STORAGE_ACCOUNT`: the storage account name of the Azure Blob\n- `AZURE_STORAGE_KEY`: the storage key of the Azure Blob\n- `AZURE_STORAGE_CONNECTION_STRING`: the connection string for the Azure Blob (overrides any other config)\n- `AZURE_STORAGE_SAS_TOKEN`: the storage SAS token for the Azure Blob\n- `AZURE_TENANT_ID`: the tenant id for client secret auth **(only via environment variables)**\n- `AZURE_CLIENT_ID`: the client id for client secret auth **(only via environment variables)**\n- `AZURE_CLIENT_SECRET`: the client secret for client secret auth **(only via environment variables)**\n- `AZURE_STORAGE_ENDPOINT`: the endpoint **(only via environment variables)**\n- `AZURE_CONFIG_FILE`: an alternative location for the config file **(only via environment variables)**\n- `AZURE_ALLOW_HTTP`: allows http endpoints **(only via environment variables)**\n\nConfig source priority order is shown below:\n1. Connection string (read from environment variable or config file),\n2. Environment variables,\n3. Config file.\n\nSupported Azure Blob Storage uri formats are shown below:\n- az:// \\\u003ccontainer\\\u003e / \\\u003cpath\\\u003e\n- azure:// \\\u003ccontainer\\\u003e / \\\u003cpath\\\u003e\n- https:// \\\u003caccount\\\u003e.blob.core.windows.net / \\\u003ccontainer\\\u003e\n\nSupported authorization methods' priority order is shown below:\n1. Bearer token via client secret,\n2. Sas token,\n3. Storage key.\n\n## Copy Options\n`pg_parquet` supports the following options in the `COPY TO` command:\n- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.\u003ccompression\u003e]` extension,\n- `row_group_size \u003cint\u003e`: the number of rows in each row group while writing Parquet files. The default row group size is `122880`,\n- `row_group_size_bytes \u003cint\u003e`: the total byte size of rows in each row group while writing Parquet files. The default row group size bytes is `row_group_size * 1024`,\n- `compression \u003cstring\u003e`: the compression format to use while writing Parquet files. The supported compression formats are `uncompressed`, `snappy`, `gzip`, `brotli`, `lz4`, `lz4raw` and `zstd`. The default compression format is `snappy`. If not specified, the compression format is determined by the file extension,\n- `compression_level \u003cint\u003e`: the compression level to use while writing Parquet files. The supported compression levels are only supported for `gzip`, `zstd` and `brotli` compression formats. The default compression level is `6` for `gzip (0-10)`, `1` for `zstd (1-22)` and `1` for `brotli (0-11)`.\n\n`pg_parquet` supports the following options in the `COPY FROM` command:\n- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.\u003ccompression\u003e]` extension,\n- `match_by \u003cstring\u003e`: method to match Parquet file fields to PostgreSQL table columns. The available methods are `position` and `name`. The default method is `position`. You can set it to `name` to match the columns by their name rather than by their position in the schema (default). Match by `name` is useful when field order differs between the Parquet file and the table, but their names match.\n\n## Configuration\nThere is currently only one GUC parameter to enable/disable the `pg_parquet`:\n- `pg_parquet.enable_copy_hooks`: you can set this parameter to `on` or `off` to enable or disable the `pg_parquet` extension. The default value is `on`.\n\n## Supported Types\n`pg_parquet` has rich type support, including PostgreSQL's primitive, array, and composite types. Below is the table of the supported types in PostgreSQL and their corresponding Parquet types.\n\n| PostgreSQL Type   | Parquet Physical Type     | Logical Type     |\n|-------------------|---------------------------|------------------|\n| `bool`            | BOOLEAN                   |                  |\n| `smallint`        | INT16                     |                  |\n| `integer`         | INT32                     |                  |\n| `bigint`          | INT64                     |                  |\n| `real`            | FLOAT                     |                  |\n| `oid`             | INT32                     |                  |\n| `double`          | DOUBLE                    |                  |\n| `numeric`(1)      | FIXED_LEN_BYTE_ARRAY(16)  | DECIMAL(128)     |\n| `text`            | BYTE_ARRAY                | STRING           |\n| `json`            | BYTE_ARRAY                | STRING           |\n| `bytea`           | BYTE_ARRAY                |                  |\n| `date` (2)        | INT32                     | DATE             |\n| `timestamp`       | INT64                     | TIMESTAMP_MICROS |\n| `timestamptz` (3) | INT64                     | TIMESTAMP_MICROS |\n| `time`            | INT64                     | TIME_MICROS      |\n| `timetz`(3)       | INT64                     | TIME_MICROS      |\n| `geometry`(4)     | BYTE_ARRAY                |                  |\n\n### Nested Types\n| PostgreSQL Type   | Parquet Physical Type     | Logical Type     |\n|-------------------|---------------------------|------------------|\n| `composite`       | GROUP                     | STRUCT           |\n| `array`           | element's physical type   | LIST             |\n| `crunchy_map`(5)  | GROUP                     | MAP              |\n\n\u003e [!WARNING]\n\u003e - (1) `numeric` type is written the smallest possible memory width to parquet file as follows:\n\u003e    * `numeric(P \u003c= 9, S)` is represented as `INT32` with `DECIMAL` logical type\n\u003e    * `numeric(9 \u003c P \u003c= 18, S)` is represented as `INT64` with `DECIMAL` logical type\n\u003e    * `numeric(18 \u003c P \u003c= 38, S)` is represented as `FIXED_LEN_BYTE_ARRAY(9-16)` with `DECIMAL` logical type\n\u003e    * `numeric(38 \u003c P, S)` is represented as `BYTE_ARRAY` with `STRING` logical type\n\u003e    * `numeric` is allowed by Postgres. (precision and scale not specified). These are represented by a default precision (38) and scale (9) instead of writing them as string. You get runtime error if your table tries to read or write a numeric value which is not allowed by the default precision and scale (29 integral digits before decimal point, 9 digits after decimal point).\n\u003e - (2) The `date` type is represented according to `Unix epoch` when writing to Parquet files. It is converted back according to `PostgreSQL epoch` when reading from Parquet files.\n\u003e - (3) The `timestamptz` and `timetz` types are adjusted to `UTC` when writing to Parquet files. They are converted back with `UTC` timezone when reading from Parquet files.\n\u003e - (4) The `geometry` type is represented as `BYTE_ARRAY` encoded as `WKB`, specified by [geoparquet spec](https://geoparquet.org/releases/v1.1.0/), when `postgis` extension is created. Otherwise, it is represented as `BYTE_ARRAY` with `STRING` logical type.\n\u003e - (5) `crunchy_map` is dependent on functionality provided by [Crunchy Bridge](https://www.crunchydata.com/products/crunchy-bridge). The `crunchy_map` type is represented as `GROUP` with `MAP` logical type when `crunchy_map` extension is created. Otherwise, it is represented as `BYTE_ARRAY` with `STRING` logical type.\n\n\u003e [!WARNING]\n\u003e Any type that does not have a corresponding Parquet type will be represented, as a fallback mechanism, as `BYTE_ARRAY` with `STRING` logical type. e.g. `enum`\n\n## Postgres Support Matrix\n`pg_parquet` supports the following PostgreSQL versions:\n| PostgreSQL Major Version | Supported |\n|--------------------------|-----------|\n| 14                       |    ✅     |\n| 15                       |    ✅     |\n| 16                       |    ✅     |\n| 17                       |    ✅     |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCrunchyData%2Fpg_parquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCrunchyData%2Fpg_parquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCrunchyData%2Fpg_parquet/lists"}