An open API service indexing awesome lists of open source software.

https://github.com/buoyant-data/oxbow

Collection of AWS Lambdas for creating and managing Delta tables
https://github.com/buoyant-data/oxbow

datalake deltalake lambda parquet rust

Last synced: 10 days ago
JSON representation

Collection of AWS Lambdas for creating and managing Delta tables

Awesome Lists containing this project

README

          

ifdef::env-github[]
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]
:toc: macro

= Oxbow

Oxbow is a project to take an existing storage location which contains
link:https://parquet.apache.org[Apache Parquet] files into a
link:https://delta.io[Delta Lake table]. It is intended to run both as an AWS
Lambda or as a command line application.

The project is named after link:https://en.wikipedia.org/wiki/Oxbow_lake[Oxbow
lakes] to keep with the lake theme.

toc::[]

== Using

=== Command Line

Executing `cargo build --release` from a clone of this repository will build
the command line binary `oxbow` which can be used directly to convert a
directory full of `.parquet` files into a Delta table.

This is an _in place_ operation and will convert the specified table location
into a Delta table!

.Simple local files
[source,bash]
----
% oxbow --table ./path/to/my/parquet-files
----

.Files on AWS
[source,bash]
----
% export AWS_REGION=us-west-2
% export AWS_SECRET_ACCESS_KEY=xxxx
# Set other AWS environment variables
% oxbow --table s3://my-bucket/prefix/to/parquet
----

=== Lambda

The `deployment/` directory contains the necessary Terraform to provision the
function, a DynamoDB table for locking, S3 bucket, and IAM permissions.

After configuring the necessary authentication for Terraform, the following
steps can be used to provision:

[source,bash]
----
cargo lambda build --release --output-format zip --bin oxbow-lambda
terraform init
terraform plan
terraform apply
----

[NOTE]
====
Terraform configures the Lambda to run with the smallest amount of memory
allowed. For bucket locations with massive `.parquet` files, this may need to
be tuned.
====

==== Advanced

To help ameliorate
link:https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html[concurrency
challenges for Delta Lake on AWS] with the DynamoDb lock, the `deployment/`
directory also contains an "advanced" pattern which uses the `group-events`
Lambda to help serialize S3 Bucket Notifications into an AWS SQS FIFO with
Message Group IDs.

To build all the necessary code locally for the Advanced pattern, please run
`make build-release`

== Development

Building and testing can be done with cargo: `cargo test`.

In order to deploy this in AWS Lambda, it must first be built with the `cargo
lambda` command line tool, e.g.:

[source,bash]
----
cargo lambda build --release --output-format zip
----

This will produce the file: `target/lambda/oxbow-lambda/bootstrap.zip` which can be
uploaded direectly in the web console, or referenced in the Terraform (see
`deployment.tf`).

=== Design

==== Command Line

When running `oxbow` via command line it is a _one time operation_. It will
take an existing directory or location full of `.parquet` files and create a
Delta table out of it.

==== Lambda

When running `oxbow` inside of a AWS Lambda function it should be configured
with an S3 Event Trigger and create new commits to a Delta Lake table any time
a `.parquet` file is added to the bucket/prefix.

== Licensing

This repository is licensed under the link:https://www.gnu.org/licenses/agpl-3.0.en.html[AGPL 3.0]. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: `rtyler@buoyantdata.com`