An open API service indexing awesome lists of open source software.

https://github.com/exyi/pg2parquet

Export PostgreSQL table or query into Parquet file
https://github.com/exyi/pg2parquet

parquet postgres postgresql

Last synced: 8 months ago
JSON representation

Export PostgreSQL table or query into Parquet file

Awesome Lists containing this project

README

          

# PostgreSQL -> Parquet

Simple tool for exporting PostgreSQL tables into parquet, with support for more esoteric Postgres features than just `int` and `text`.

## Installation

### Download Binary from Github

Download the binary from [Github Actions](https://github.com/exyi/pg2parquet/actions/workflows/build.yaml?query=branch%3Amain) artifacts (click on the latest run, scroll to the bottom, choose your system).

### Using Nix flakes

If you use Nix, this command will install the latest pg2parquet version. It compiles it from sources, so the installation will take some time.

```
nix shell github:exyi/pg2parquet
```

Then use the `pg2parquet` in the new shell. Note that you might need to add `--extra-experimental-features 'nix-command flakes'` argument to the nix invocation.

### Using Cargo

```
cargo install pg2parquet
```

### From Sources

Install Rust and Cargo. Clone the repo.

```bash
cd cli
env RUSTFLAGS="-C target-cpu=native" cargo build --release
```

It should finish in few minutes (~10 CPU minutes). Take the `target/release/pg2parquet` file, delete rest of the target directory (it takes quite a bit of disk space). You can optionally `strip` the binary, but you'll get poor stack trace if it crashes.

## Basic usage

```
pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -t the_table_to_export
```

Alternatively, you can export result of a SQL query

```
pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -q 'select column_a, column_b::text from another_table'
```

You can also use environment variables `$PGPASSWORD` and `$PGUSER`

## Supported types

* **Basic SQL types**: `text`, `char`, `varchar` and friends, all kinds of `int`s, `bool`, floating point numbers, `timestamp`, `timestamptz`, `date`, `time`, `uuid`
* `interval` - interval has lower precision in Parquet (ms) than in Postgres (µs), so the conversion is lossy. There is an option `--interval-handling=struct` which serializes it differently without rounding.
* **Decimal numeric types**
* `numeric` will have fixed precision according to the `--decimal-scale` and `--decimal-precision` parameters. Alternatively use `--numeric-handling` to write a float or string instead.
* `money` is always a 64-bit decimal with 2 decimal places
* **`json` and `jsonb`**: by default serialized as a text field with the JSON. `--json-handling` option allows setting parquet LogicalType to [JSON](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json), but the feature is not widely supported, thus it's disabled by default.
* **`xml`**: serialized as text
* **`macaddr` and `inet`**: by default written out in text representation. It's possible to serialize macaddr as bytes or Int64 using `--macaddr-handling` option.
* **`bit` and `varbit`**: represented as text of `0` and `1`
* **[Enums](https://www.postgresql.org/docs/current/datatype-enum.html)**
* By default serialized as text, use `--enum-handling int` to serialize them as integers
* **[Ranges](https://www.postgresql.org/docs/current/rangetypes.html)**
- Serialized as `struct { lower: T, upper: T, lower_inclusive: bool, upper_inclusive: bool, is_empty: bool }`
* **[Arrays](https://www.postgresql.org/docs/current/arrays.html)**
- Serialized as parquet List
- Always serialized as single-dimensional arrays, and information about starting index is dropped
* **[Composite Types](https://www.postgresql.org/docs/current/rowtypes.html)**
- Serialized as Parquet struct type

## Known Limitations (and workarounds)

* Not all PostgreSQL types are supported
* Workaround: Convert it to text (or other supported type) on PostgreSQL side `--query 'select weird_type_column::text from my_table'`
* Please [submit an issue](https://github.com/exyi/pg2parquet/issues/new)
* I need the file in slightly different format (rename columns, ...)
* Workaround 1: Use the `--query` parameter to shape the resulting schema
* Workaround 2: Use DuckDB or Spark to postprocess the parquet file
- DuckDB `COPY (SELECT my_col as myCol, ... FROM 'export.parquet') TO 'export2.parquet' (FORMAT PARQUET);`

## Options

**`> pg2parquet export --help`**

```
Exports a PostgreSQL table or query to a Parquet file

Usage: pg2parquet export [OPTIONS] --output-file --host --dbname

Options:
-o, --output-file
Path to the output file. If the file exists, it will be overwritten

-q, --query
SQL query to execute. Exclusive with --table

-t, --table
Which table should be exported. Exclusive with --query

--compression
Compression applied on the output file. Default: zstd, change to Snappy or None if it's too slow

[possible values: none, snappy, gzip, lzo, brotli, lz4, zstd]

--compression-level
Compression level of the output file compressor. Only relevant for zstd, brotli and gzip. Default: 3

--quiet
Avoid printing unnecessary information (schema and progress). Only errors will be written to stderr

-H, --host
Database server host

-U, --user
Database user name. If not specified, PGUSER environment variable is used

-d, --dbname

-p, --port

--password
Password to use for the connection. It is recommended to use the PGPASSWORD environment variable instead, since process arguments are visible to other users on the system

--sslmode
Controls whether to use SSL/TLS to connect to the server

Possible values:
- disable: Do not use TLS
- prefer: Attempt to connect with TLS but allow sessions without (default behavior compiled with SSL support)
- require: Require the use of TLS

--ssl-root-cert
File with a TLS root certificate in PEM or DER (.crt) format. When specified, the default CA certificates are considered untrusted. The option can be specified multiple times. Using this options implies --sslmode=require

--macaddr-handling
How to handle `macaddr` columns

[default: text]

Possible values:
- text: MAC address is converted to a string
- byte-array: MAC is stored as fixed byte array of length 6
- int64: MAC is stored in Int64 (lowest 6 bytes)

--json-handling
How to handle `json` and `jsonb` columns

[default: text]

Possible values:
- text-marked-as-json: JSON is stored as a Parquet JSON type. This is essentially the same as text, but with a different ConvertedType, so it may not be supported in all tools
- text: JSON is stored as a UTF8 text

--enum-handling
How to handle enum (Enumerated Type) columns

[default: text]

Possible values:
- text: Enum is stored as the postgres enum name, Parquet LogicalType is set to ENUM
- plain-text: Enum is stored as the postgres enum name, Parquet LogicalType is set to String
- int: Enum is stored as an 32-bit integer (one-based index of the value in the enum definition)

--interval-handling
How to handle `interval` columns

[default: interval]

Possible values:
- interval: Enum is stored as the Parquet INTERVAL type. This has lower precision than postgres interval (milliseconds instead of microseconds)
- struct: Enum is stored as struct { months: i32, days: i32, microseconds: i64 }, exactly as PostgreSQL stores it

--numeric-handling
How to handle `numeric` columns

[default: double]

Possible values:
- decimal: Numeric is stored using the DECIMAL parquet type. Use --decimal-precision and --decimal-scale to set the desired precision and scale
- double: Numeric is converted to float64 (DOUBLE)
- float32: Numeric is converted to float32 (FLOAT)
- string: Convert the numeric to a string and store it as UTF8 text. This option never looses precision. Note that text "NaN" may be present if NaN is present in the database

--decimal-scale
How many decimal digits after the decimal point are stored in the Parquet file in DECIMAL data type

[default: 18]

--decimal-precision
How many decimal digits are allowed in numeric/DECIMAL column. By default 38, the largest value which fits in 128 bits. If <= 9, the column is stored as INT32; if <= 18, the column is stored as INT64; otherwise BYTE_ARRAY

[default: 38]

--array-handling
Parquet does not support multi-dimensional arrays and arrays with different starting index. pg2parquet flattens the arrays, and this options allows including the stripped information in additional columns

[default: plain]

Possible values:
- plain: Postgres arrays are simply stored as Parquet LIST
- dimensions: Postgres arrays are stored as struct of { data: List[T], dims: List[int] }
- dimensions+lowerbound: Postgres arrays are stored as struct of { data: List[T], dims: List[int], lower_bound: List[int] }
```