https://github.com/adjust/parquet_fdw

Parquet foreign data wrapper for PostgreSQL
https://github.com/adjust/parquet_fdw
Last synced: about 1 year ago
JSON representation
Parquet foreign data wrapper for PostgreSQL
Host: GitHub
URL: https://github.com/adjust/parquet_fdw
Owner: adjust
License: postgresql
Created: 2018-12-01T01:49:02.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-08-27T12:35:51.000Z (almost 2 years ago)
Last Synced: 2025-03-29T04:05:30.182Z (about 1 year ago)
Language: C++
Size: 426 KB
Stars: 371
Watchers: 49
Forks: 38
Open Issues: 28
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

README

          [![build](https://github.com/adjust/parquet_fdw/actions/workflows/ci.yml/badge.svg)](https://github.com/adjust/parquet_fdw/actions/workflows/ci.yml) ![experimental](https://img.shields.io/badge/status-experimental-orange)

# parquet_fdw

Read-only Apache Parquet foreign data wrapper for PostgreSQL.

## Installation

`parquet_fdw` requires `libarrow` and `libparquet` installed in your system (requires version 0.15+, for previous versions use branch [arrow-0.14](https://github.com/adjust/parquet_fdw/tree/arrow-0.14)). Please refer to [libarrow installation page](https://arrow.apache.org/install/) or [building guide](https://github.com/apache/arrow/blob/master/docs/source/developers/cpp/building.rst).

To build `parquet_fdw` run:

```sh

make install

```

or in case when PostgreSQL is installed in a custom location:

```sh

make install PG_CONFIG=/path/to/pg_config

```

It is possible to pass additional compilation flags through either custom

`CCFLAGS` or standard `PG_CFLAGS`, `PG_CXXFLAGS`, `PG_CPPFLAGS` variables.

After extension was successfully installed run in `psql`:

```sql

create extension parquet_fdw;

```

## Basic usage

To start using `parquet_fdw` one should first create a server and user mapping. For example:

```sql

create server parquet_srv foreign data wrapper parquet_fdw;

create user mapping for postgres server parquet_srv options (user 'postgres');

```

Now you should be able to create foreign table for Parquet files.

```sql

create foreign table userdata (

    id           int,

    first_name   text,

    last_name    text

)

server parquet_srv

options (

    filename '/mnt/userdata1.parquet'

);

```

## Advanced

Currently `parquet_fdw` supports the following column [types](https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h):

|   Arrow type |  SQL type |

|-------------:|----------:|

|         INT8 |      INT2 |

|        INT16 |      INT2 |

|        INT32 |      INT4 |

|        INT64 |      INT8 |

|        FLOAT |    FLOAT4 |

|       DOUBLE |    FLOAT8 |

|    TIMESTAMP | TIMESTAMP |

|       DATE32 |      DATE |

|       STRING |      TEXT |

|       BINARY |     BYTEA |

|         LIST |     ARRAY |

|          MAP |     JSONB |

Currently `parquet_fdw` doesn't support structs and nested lists.

Foreign table may be created for a single Parquet file and for a set of files. It is also possible to specify a user defined function, which would return a list of file paths. Depending on the number of files and table options `parquet_fdw` may use one of the following execution strategies:

| Strategy                | Description              |

|-------------------------|--------------------------|

| **Single File**         | Basic single file reader

| **Multifile**           | Reader which process Parquet files one by one in sequential manner |

| **Multifile Merge**     | Reader which merges presorted Parquet files so that the produced result is also ordered; used when `sorted` option is specified and the query plan implies ordering (e.g. contains `ORDER BY` clause) |

| **Caching Multifile Merge** | Same as `Multifile Merge`, but keeps the number of simultaneously open files limited; used when the number of specified Parquet files exceeds `max_open_files` |

Following table options are supported:

* **filename** - space separated list of paths to Parquet files to read;

* **sorted** - space separated list of columns that Parquet files are presorted by; that would help postgres to avoid redundant sorting when running query with `ORDER BY` clause or in other cases when having a presorted set is beneficial (Group Aggregate, Merge Join);

* **files_in_order** - specifies that files specified by `filename` or returned by `files_func` are ordered according to `sorted` option and have no intersection rangewise; this allows to use `Gather Merge` node on top of parallel Multifile scan (default `false`);

* **use_mmap** - whether memory map operations will be used instead of file read operations (default `false`);

* **use_threads** - enables Apache Arrow's parallel columns decoding/decompression (default `false`);

* **files_func** - user defined function that is used by parquet_fdw to retrieve the list of parquet files on each query; function must take one `JSONB` argument and return text array of full paths to parquet files;

* **files_func_arg** - argument for the function, specified by **files_func**;

* **max_open_files** - the limit for the number of Parquet files open simultaneously.

GUC variables:

* **parquet_fdw.use_threads** - global switch that allow user to enable or disable threads (default `true`);

* **parquet_fdw.enable_multifile** - enable Multifile reader (default `true`).

* **parquet_fdw.enable_multifile_merge** - enable Multifile Merge reader (default `true`).

### Parallel queries

`parquet_fdw` also supports [parallel query execution](https://www.postgresql.org/docs/current/parallel-query.html) (not to confuse with multi-threaded decoding feature of Apache Arrow).

### Import

`parquet_fdw` also supports [`IMPORT FOREIGN SCHEMA`](https://www.postgresql.org/docs/current/sql-importforeignschema.html) command to discover parquet files in the specified directory on filesystem and create foreign tables according to those files. It can be used as follows:

```sql

import foreign schema "/path/to/directory"

from server parquet_srv

into public;

```

It is important that `remote_schema` here is a path to a local filesystem directory and is double quoted.

Another way to import parquet files into foreign tables is to use `import_parquet` or `import_parquet_explicit`:

```sql

create function import_parquet(

    tablename   text,

    schemaname  text,

    servername  text,

    userfunc    regproc,

    args        jsonb,

    options     jsonb)

create function import_parquet_explicit(

    tablename   text,

    schemaname  text,

    servername  text,

    attnames    text[],

    atttypes    regtype[],

    userfunc    regproc,

    args        jsonb,

    options     jsonb)

```

The only difference between `import_parquet` and `import_parquet_explicit` is that the latter allows to specify a set of attributes (columns) to import. `attnames` and `atttypes` here are the attributes names and attributes types arrays respectively (see the example below).

`userfunc` is a user-defined function. It must take a `jsonb` argument and return a text array of filesystem paths to parquet files to be imported. `args` is user-specified jsonb object that is passed to `userfunc` as its argument. A simple implementation of such function and its usage may look like this:

```sql

create function list_parquet_files(args jsonb)

returns text[] as

$$

begin

    return array_agg(args->>'dir' || '/' || filename)

           from pg_ls_dir(args->>'dir') as files(filename)

           where filename ~~ '%.parquet';

end

$$

language plpgsql;

select import_parquet_explicit(

    'abc',

    'public',

    'parquet_srv',

    array['one', 'three', 'six'],

    array['int8', 'text', 'bool']::regtype[],

    'list_parquet_files',

    '{"dir": "/path/to/directory"}',

    '{"sorted": "one"}'

);

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adjust/parquet_fdw

Awesome Lists containing this project

README