An open API service indexing awesome lists of open source software.

https://github.com/ev2900/glue_aggregate_small_files

PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue
https://github.com/ev2900/glue_aggregate_small_files

aws glue pyspark s3 small-files

Last synced: 9 months ago
JSON representation

PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue

Awesome Lists containing this project

README

          

# Glue Aggregate Small Parquet Files

map-user map-user map-user

When storing data in S3 it is important to consider the size of files you store in S3. Parquet files have an ideal file size of 512 MB - 1 GB. Storing data in many small files can decrease the performance of data processing tools ie. Spark.

This repository provides a PySpark script [Aggregate_Small_Parquet_Files.py](https://github.com/ev2900/Glue_Aggregate_Small_Files/blob/main/Aggregate_Small_Parquet_Files.py) that can consolidate small parquet files in an S3 prefix into larger parquet files.

## How to run the Glue job to aggregate small parquet files

*Note* if you are testing the [Aggregate_Small_Parquet_Files.py](https://github.com/ev2900/Glue_Aggregate_Small_Files/blob/main/Aggregate_Small_Parquet_Files.py) and need to generate small parquet files as test data. You can follow the instructions in the [Example](https://github.com/ev2900/Glue_Aggregate_Small_Files/tree/main/Example) folder to create small file test data.

1. Upload the [Aggregate_Small_Parquet_Files.py](https://github.com/ev2900/Glue_Aggregate_Small_Files/blob/main/Aggregate_Small_Parquet_Files.py) file to a S3 bucket

2. Run the CloudFormation stack below to create a Glue job that will generate small parquet files

[![Launch CloudFormation Stack](https://sharkech-public.s3.amazonaws.com/misc-public/cloudformation-launch-stack.png)](https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=aggregate-small-files-glue&templateURL=https://sharkech-public.s3.amazonaws.com/misc-public/Aggregate_Small_Parquet_File_Glue_Job_Deployment.yaml)

As you follow the prompts to deploy the CloudFormation stack ensure that you fill out the *S3GlueScriptLocation* parameter with the S3 URI of the [Create_Small_Parquet_Files.py](https://github.com/ev2900/Glue_Aggregate_Small_Files/blob/cloud_formation/Example/Create_Small_Parquet_Files.py) that you uploaded to a S3 bucket in the first step

cat_indicies_1

3. Update and run the Glue job

The CloudFormation stack deployed a Glue job named *Aggregate_Small_Parquet_Files*. Navigate to the [Glue console](https://us-east-1.console.aws.amazon.com/gluestudio/home). Select *ETL jobs* and then the *Aggregate_Small_Parquet_Files*

Update with the name of the S3 bucket with the small files that need to be aggregated
Update with the path to the prefix of a single partition with small files to aggregate in it
Optional: update the *total_prefix_size* to the desired target size of the aggregated parquet file(s)

cat_indicies_1

After you update the S3 bucket name and the path to the prefix, save and run the Glue job. When the Glue job finishes you will have small parquet files in the specified S3 location will have been aggregated.