https://github.com/findinpath/parquet-table-generator

Proof of concept project for generating a wide Parquet partitioned table containing a lot of columns.
https://github.com/findinpath/parquet-table-generator

Last synced: over 1 year ago
JSON representation

Proof of concept project for generating a wide Parquet partitioned table containing a lot of columns.

Host: GitHub
URL: https://github.com/findinpath/parquet-table-generator
Owner: findinpath
Created: 2022-01-19T19:09:10.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-01-21T04:59:14.000Z (over 4 years ago)
Last Synced: 2025-01-29T18:32:14.274Z (over 1 year ago)
Language: Scala
Size: 4.88 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Parquet Table Generator
============================

Proof of concept project for generating:
- wide Parquet partitioned table containing a lot of columns.
- long Parquet partitioned table containing a lot of partitions.

The purpose of generating the wide table is to upload it AWS Glue and hit into
the use case of dealing with _UnprocessedKeys_ when trying to obtain the partitions.

https://docs.aws.amazon.com/cli/latest/reference/glue/batch-get-partition.html

It is initially not that obvious why AWS Glue API response for `batch-get-partition`
contains the field _UnprocessedKeys_ :

> A list of the partition values in the request for which partitions were not returned.

When actually calling for `1000` partitions which each have information about `3000` columns,
the payload to be delivered is really big. This is probably why the AWS Glue API developers
did opt for introducing the _UnprocessedKeys_ field. It is somehow a soft way of saying that
the request involves delivering a response payload too big. By using this rather unusual method
the response can be batched by forcing the client to do multiple smaller batch partition calls.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/findinpath/parquet-table-generator

Awesome Lists containing this project

README