{"id":20643862,"url":"https://github.com/adjust/parquet_fdw","last_synced_at":"2025-04-05T05:06:38.796Z","repository":{"id":37549614,"uuid":"159898403","full_name":"adjust/parquet_fdw","owner":"adjust","description":"Parquet foreign data wrapper for PostgreSQL","archived":false,"fork":false,"pushed_at":"2024-08-27T12:35:51.000Z","size":436,"stargazers_count":371,"open_issues_count":28,"forks_count":38,"subscribers_count":49,"default_branch":"master","last_synced_at":"2025-03-29T04:05:30.182Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"postgresql","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adjust.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-01T01:49:02.000Z","updated_at":"2025-03-12T23:19:10.000Z","dependencies_parsed_at":"2024-06-03T17:49:49.299Z","dependency_job_id":"b3f88f79-b18d-4c99-93a8-8a96d7cde33a","html_url":"https://github.com/adjust/parquet_fdw","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjust%2Fparquet_fdw","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjust%2Fparquet_fdw/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjust%2Fparquet_fdw/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjust%2Fparquet_fdw/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adjust","download_url":"https://codeload.github.com/adjust/parquet_fdw/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247289427,"owners_count":20914464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T16:14:08.991Z","updated_at":"2025-04-05T05:06:38.776Z","avatar_url":"https://github.com/adjust.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![build](https://github.com/adjust/parquet_fdw/actions/workflows/ci.yml/badge.svg)](https://github.com/adjust/parquet_fdw/actions/workflows/ci.yml) ![experimental](https://img.shields.io/badge/status-experimental-orange)\n\n# parquet_fdw\n\nRead-only Apache Parquet foreign data wrapper for PostgreSQL.\n\n## Installation\n\n`parquet_fdw` requires `libarrow` and `libparquet` installed in your system (requires version 0.15+, for previous versions use branch [arrow-0.14](https://github.com/adjust/parquet_fdw/tree/arrow-0.14)). Please refer to [libarrow installation page](https://arrow.apache.org/install/) or [building guide](https://github.com/apache/arrow/blob/master/docs/source/developers/cpp/building.rst).\nTo build `parquet_fdw` run:\n```sh\nmake install\n```\nor in case when PostgreSQL is installed in a custom location:\n```sh\nmake install PG_CONFIG=/path/to/pg_config\n```\nIt is possible to pass additional compilation flags through either custom\n`CCFLAGS` or standard `PG_CFLAGS`, `PG_CXXFLAGS`, `PG_CPPFLAGS` variables.\n\nAfter extension was successfully installed run in `psql`:\n```sql\ncreate extension parquet_fdw;\n```\n\n## Basic usage\n\nTo start using `parquet_fdw` one should first create a server and user mapping. For example:\n```sql\ncreate server parquet_srv foreign data wrapper parquet_fdw;\ncreate user mapping for postgres server parquet_srv options (user 'postgres');\n```\n\nNow you should be able to create foreign table for Parquet files.\n```sql\ncreate foreign table userdata (\n    id           int,\n    first_name   text,\n    last_name    text\n)\nserver parquet_srv\noptions (\n    filename '/mnt/userdata1.parquet'\n);\n```\n\n## Advanced\n\nCurrently `parquet_fdw` supports the following column [types](https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h):\n\n|   Arrow type |  SQL type |\n|-------------:|----------:|\n|         INT8 |      INT2 |\n|        INT16 |      INT2 |\n|        INT32 |      INT4 |\n|        INT64 |      INT8 |\n|        FLOAT |    FLOAT4 |\n|       DOUBLE |    FLOAT8 |\n|    TIMESTAMP | TIMESTAMP |\n|       DATE32 |      DATE |\n|       STRING |      TEXT |\n|       BINARY |     BYTEA |\n|         LIST |     ARRAY |\n|          MAP |     JSONB |\n\nCurrently `parquet_fdw` doesn't support structs and nested lists.\n\nForeign table may be created for a single Parquet file and for a set of files. It is also possible to specify a user defined function, which would return a list of file paths. Depending on the number of files and table options `parquet_fdw` may use one of the following execution strategies:\n\n| Strategy                | Description              |\n|-------------------------|--------------------------|\n| **Single File**         | Basic single file reader\n| **Multifile**           | Reader which process Parquet files one by one in sequential manner |\n| **Multifile Merge**     | Reader which merges presorted Parquet files so that the produced result is also ordered; used when `sorted` option is specified and the query plan implies ordering (e.g. contains `ORDER BY` clause) |\n| **Caching Multifile Merge** | Same as `Multifile Merge`, but keeps the number of simultaneously open files limited; used when the number of specified Parquet files exceeds `max_open_files` |\n\nFollowing table options are supported:\n* **filename** - space separated list of paths to Parquet files to read;\n* **sorted** - space separated list of columns that Parquet files are presorted by; that would help postgres to avoid redundant sorting when running query with `ORDER BY` clause or in other cases when having a presorted set is beneficial (Group Aggregate, Merge Join);\n* **files_in_order** - specifies that files specified by `filename` or returned by `files_func` are ordered according to `sorted` option and have no intersection rangewise; this allows to use `Gather Merge` node on top of parallel Multifile scan (default `false`);\n* **use_mmap** - whether memory map operations will be used instead of file read operations (default `false`);\n* **use_threads** - enables Apache Arrow's parallel columns decoding/decompression (default `false`);\n* **files_func** - user defined function that is used by parquet_fdw to retrieve the list of parquet files on each query; function must take one `JSONB` argument and return text array of full paths to parquet files;\n* **files_func_arg** - argument for the function, specified by **files_func**;\n* **max_open_files** - the limit for the number of Parquet files open simultaneously.\n\nGUC variables:\n* **parquet_fdw.use_threads** - global switch that allow user to enable or disable threads (default `true`);\n* **parquet_fdw.enable_multifile** - enable Multifile reader (default `true`).\n* **parquet_fdw.enable_multifile_merge** - enable Multifile Merge reader (default `true`).\n\n### Parallel queries\n\n`parquet_fdw` also supports [parallel query execution](https://www.postgresql.org/docs/current/parallel-query.html) (not to confuse with multi-threaded decoding feature of Apache Arrow).\n\n### Import\n\n`parquet_fdw` also supports [`IMPORT FOREIGN SCHEMA`](https://www.postgresql.org/docs/current/sql-importforeignschema.html) command to discover parquet files in the specified directory on filesystem and create foreign tables according to those files. It can be used as follows:\n\n```sql\nimport foreign schema \"/path/to/directory\"\nfrom server parquet_srv\ninto public;\n```\n\nIt is important that `remote_schema` here is a path to a local filesystem directory and is double quoted.\n\nAnother way to import parquet files into foreign tables is to use `import_parquet` or `import_parquet_explicit`:\n\n```sql\ncreate function import_parquet(\n    tablename   text,\n    schemaname  text,\n    servername  text,\n    userfunc    regproc,\n    args        jsonb,\n    options     jsonb)\n\ncreate function import_parquet_explicit(\n    tablename   text,\n    schemaname  text,\n    servername  text,\n    attnames    text[],\n    atttypes    regtype[],\n    userfunc    regproc,\n    args        jsonb,\n    options     jsonb)\n```\n\nThe only difference between `import_parquet` and `import_parquet_explicit` is that the latter allows to specify a set of attributes (columns) to import. `attnames` and `atttypes` here are the attributes names and attributes types arrays respectively (see the example below).\n\n`userfunc` is a user-defined function. It must take a `jsonb` argument and return a text array of filesystem paths to parquet files to be imported. `args` is user-specified jsonb object that is passed to `userfunc` as its argument. A simple implementation of such function and its usage may look like this:\n\n```sql\ncreate function list_parquet_files(args jsonb)\nreturns text[] as\n$$\nbegin\n    return array_agg(args-\u003e\u003e'dir' || '/' || filename)\n           from pg_ls_dir(args-\u003e\u003e'dir') as files(filename)\n           where filename ~~ '%.parquet';\nend\n$$\nlanguage plpgsql;\n\nselect import_parquet_explicit(\n    'abc',\n    'public',\n    'parquet_srv',\n    array['one', 'three', 'six'],\n    array['int8', 'text', 'bool']::regtype[],\n    'list_parquet_files',\n    '{\"dir\": \"/path/to/directory\"}',\n    '{\"sorted\": \"one\"}'\n);\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadjust%2Fparquet_fdw","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadjust%2Fparquet_fdw","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadjust%2Fparquet_fdw/lists"}