{"id":21028210,"url":"https://github.com/exyi/pg2parquet","last_synced_at":"2025-04-04T13:05:25.027Z","repository":{"id":63043882,"uuid":"547412556","full_name":"exyi/pg2parquet","owner":"exyi","description":"Export PostgreSQL table or query into Parquet file","archived":false,"fork":false,"pushed_at":"2025-03-12T21:10:10.000Z","size":293,"stargazers_count":70,"open_issues_count":6,"forks_count":14,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-28T12:01:43.178Z","etag":null,"topics":["parquet","postgres","postgresql"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/exyi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-07T16:35:02.000Z","updated_at":"2025-03-17T11:34:56.000Z","dependencies_parsed_at":"2024-01-18T17:25:27.883Z","dependency_job_id":"1e940524-d5f4-46f0-bafc-543cc3edfafa","html_url":"https://github.com/exyi/pg2parquet","commit_stats":{"total_commits":29,"total_committers":3,"mean_commits":9.666666666666666,"dds":0.06896551724137934,"last_synced_commit":"745323955adc80240ba154756f3e0d1b533f834c"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exyi%2Fpg2parquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exyi%2Fpg2parquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exyi%2Fpg2parquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exyi%2Fpg2parquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/exyi","download_url":"https://codeload.github.com/exyi/pg2parquet/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247174453,"owners_count":20896078,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["parquet","postgres","postgresql"],"created_at":"2024-11-19T11:54:19.089Z","updated_at":"2025-04-04T13:05:24.980Z","avatar_url":"https://github.com/exyi.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PostgreSQL -\u003e Parquet\n\nSimple tool for exporting PostgreSQL tables into parquet, with support for more esoteric Postgres features than just `int` and `text`.\n\n## Installation\n\n### Download Binary from Github\n\nDownload the binary from [Github Actions](https://github.com/exyi/pg2parquet/actions/workflows/build.yaml?query=branch%3Amain) artifacts (click on the latest run, scroll to the bottom, choose your system).\n\n### Using Nix flakes\n\nIf you use Nix, this command will install the latest pg2parquet version. It compiles it from sources, so the installation will take some time.\n\n```\nnix shell github:exyi/pg2parquet\n```\n\nThen use the `pg2parquet` in the new shell. Note that you might need to add `--extra-experimental-features 'nix-command flakes'` argument to the nix invocation.\n\n### Using Cargo\n\n```\ncargo install pg2parquet\n```\n\n### From Sources\n\nInstall Rust and Cargo. Clone the repo.\n\n```bash\ncd cli\nenv RUSTFLAGS=\"-C target-cpu=native\" cargo build --release\n```\n\nIt should finish in few minutes (~10 CPU minutes). Take the `target/release/pg2parquet` file, delete rest of the target directory (it takes quite a bit of disk space). You can optionally `strip` the binary, but you'll get poor stack trace if it crashes.\n\n## Basic usage\n\n```\npg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -t the_table_to_export\n```\n\nAlternatively, you can export result of a SQL query\n\n```\npg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -q 'select column_a, column_b::text from another_table'\n```\n\nYou can also use environment variables `$PGPASSWORD` and `$PGUSER`\n\n## Supported types\n\n* **Basic SQL types**: `text`, `char`, `varchar` and friends, all kinds of `int`s, `bool`, floating point numbers, `timestamp`, `timestamptz`, `date`, `time`, `uuid`\n  * `interval` - interval has lower precision in Parquet (ms) than in Postgres (µs), so the conversion is lossy. There is an option `--interval-handling=struct` which serializes it differently without rounding.\n* **Decimal numeric types**\n\t* `numeric` will have fixed precision according to the `--decimal-scale` and `--decimal-precision` parameters. Alternatively use `--numeric-handling` to write a float or string instead.\n\t* `money` is always a 64-bit decimal with 2 decimal places\n* **`json` and `jsonb`**: by default serialized as a text field with the JSON. `--json-handling` option allows setting parquet LogicalType to [JSON](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json), but the feature is not widely supported, thus it's disabled by default.\n* **`xml`**: serialized as text\n* **`macaddr` and `inet`**: by default written out in text representation. It's possible to serialize macaddr as bytes or Int64 using `--macaddr-handling` option.\n* **`bit` and `varbit`**: represented as text of `0` and `1`\n* **[Enums](https://www.postgresql.org/docs/current/datatype-enum.html)**\n\t* By default serialized as text, use `--enum-handling int` to serialize them as integers\n* **[Ranges](https://www.postgresql.org/docs/current/rangetypes.html)**\n\t- Serialized as `struct { lower: T, upper: T, lower_inclusive: bool, upper_inclusive: bool, is_empty: bool }`\n* **[Arrays](https://www.postgresql.org/docs/current/arrays.html)**\n\t- Serialized as parquet List\n\t- Always serialized as single-dimensional arrays, and information about starting index is dropped\n* **[Composite Types](https://www.postgresql.org/docs/current/rowtypes.html)**\n\t- Serialized as Parquet struct type\n\n## Known Limitations (and workarounds)\n\n* Not all PostgreSQL types are supported\n\t* Workaround: Convert it to text (or other supported type) on PostgreSQL side `--query 'select weird_type_column::text from my_table'`\n\t* Please [submit an issue](https://github.com/exyi/pg2parquet/issues/new)\n* I need the file in slightly different format (rename columns, ...)\n\t* Workaround 1: Use the `--query` parameter to shape the resulting schema\n\t* Workaround 2: Use DuckDB or Spark to postprocess the parquet file\n\t\t- DuckDB `COPY (SELECT my_col as myCol, ... FROM 'export.parquet') TO 'export2.parquet' (FORMAT PARQUET);`\n\n\n## Options\n\n**`\u003e pg2parquet export --help`**\n\n```\nExports a PostgreSQL table or query to a Parquet file\n\nUsage: pg2parquet export [OPTIONS] --output-file \u003cOUTPUT_FILE\u003e --host \u003cHOST\u003e --dbname \u003cDBNAME\u003e\n\nOptions:\n  -o, --output-file \u003cOUTPUT_FILE\u003e\n          Path to the output file. If the file exists, it will be overwritten\n\n  -q, --query \u003cQUERY\u003e\n          SQL query to execute. Exclusive with --table\n\n  -t, --table \u003cTABLE\u003e\n          Which table should be exported. Exclusive with --query\n\n      --compression \u003cCOMPRESSION\u003e\n          Compression applied on the output file. Default: zstd, change to Snappy or None if it's too slow\n          \n          [possible values: none, snappy, gzip, lzo, brotli, lz4, zstd]\n\n      --compression-level \u003cCOMPRESSION_LEVEL\u003e\n          Compression level of the output file compressor. Only relevant for zstd, brotli and gzip. Default: 3\n\n      --quiet\n          Avoid printing unnecessary information (schema and progress). Only errors will be written to stderr\n\n  -H, --host \u003cHOST\u003e\n          Database server host\n\n  -U, --user \u003cUSER\u003e\n          Database user name. If not specified, PGUSER environment variable is used\n\n  -d, --dbname \u003cDBNAME\u003e\n          \n\n  -p, --port \u003cPORT\u003e\n          \n\n      --password \u003cPASSWORD\u003e\n          Password to use for the connection. It is recommended to use the PGPASSWORD environment variable instead, since process arguments are visible to other users on the system\n\n      --sslmode \u003cSSLMODE\u003e\n          Controls whether to use SSL/TLS to connect to the server\n\n          Possible values:\n          - disable: Do not use TLS\n          - prefer:  Attempt to connect with TLS but allow sessions without (default behavior compiled with SSL support)\n          - require: Require the use of TLS\n\n      --ssl-root-cert \u003cSSL_ROOT_CERT\u003e\n          File with a TLS root certificate in PEM or DER (.crt) format. When specified, the default CA certificates are considered untrusted. The option can be specified multiple times. Using this options implies --sslmode=require\n\n      --macaddr-handling \u003cMACADDR_HANDLING\u003e\n          How to handle `macaddr` columns\n          \n          [default: text]\n\n          Possible values:\n          - text:       MAC address is converted to a string\n          - byte-array: MAC is stored as fixed byte array of length 6\n          - int64:      MAC is stored in Int64 (lowest 6 bytes)\n\n      --json-handling \u003cJSON_HANDLING\u003e\n          How to handle `json` and `jsonb` columns\n          \n          [default: text]\n\n          Possible values:\n          - text-marked-as-json: JSON is stored as a Parquet JSON type. This is essentially the same as text, but with a different ConvertedType, so it may not be supported in all tools\n          - text:                JSON is stored as a UTF8 text\n\n      --enum-handling \u003cENUM_HANDLING\u003e\n          How to handle enum (Enumerated Type) columns\n          \n          [default: text]\n\n          Possible values:\n          - text:       Enum is stored as the postgres enum name, Parquet LogicalType is set to ENUM\n          - plain-text: Enum is stored as the postgres enum name, Parquet LogicalType is set to String\n          - int:        Enum is stored as an 32-bit integer (one-based index of the value in the enum definition)\n\n      --interval-handling \u003cINTERVAL_HANDLING\u003e\n          How to handle `interval` columns\n          \n          [default: interval]\n\n          Possible values:\n          - interval: Enum is stored as the Parquet INTERVAL type. This has lower precision than postgres interval (milliseconds instead of microseconds)\n          - struct:   Enum is stored as struct { months: i32, days: i32, microseconds: i64 }, exactly as PostgreSQL stores it\n\n      --numeric-handling \u003cNUMERIC_HANDLING\u003e\n          How to handle `numeric` columns\n          \n          [default: double]\n\n          Possible values:\n          - decimal: Numeric is stored using the DECIMAL parquet type. Use --decimal-precision and --decimal-scale to set the desired precision and scale\n          - double:  Numeric is converted to float64 (DOUBLE)\n          - float32: Numeric is converted to float32 (FLOAT)\n          - string:  Convert the numeric to a string and store it as UTF8 text. This option never looses precision. Note that text \"NaN\" may be present if NaN is present in the database\n\n      --decimal-scale \u003cDECIMAL_SCALE\u003e\n          How many decimal digits after the decimal point are stored in the Parquet file in DECIMAL data type\n          \n          [default: 18]\n\n      --decimal-precision \u003cDECIMAL_PRECISION\u003e\n          How many decimal digits are allowed in numeric/DECIMAL column. By default 38, the largest value which fits in 128 bits. If \u003c= 9, the column is stored as INT32; if \u003c= 18, the column is stored as INT64; otherwise BYTE_ARRAY\n          \n          [default: 38]\n\n      --array-handling \u003cARRAY_HANDLING\u003e\n          Parquet does not support multi-dimensional arrays and arrays with different starting index. pg2parquet flattens the arrays, and this options allows including the stripped information in additional columns\n          \n          [default: plain]\n\n          Possible values:\n          - plain:                 Postgres arrays are simply stored as Parquet LIST\n          - dimensions:            Postgres arrays are stored as struct of { data: List[T], dims: List[int] }\n          - dimensions+lowerbound: Postgres arrays are stored as struct of { data: List[T], dims: List[int], lower_bound: List[int] }\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexyi%2Fpg2parquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexyi%2Fpg2parquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexyi%2Fpg2parquet/lists"}