https://github.com/buoyant-data/oxbow
Collection of AWS Lambdas for creating and managing Delta tables
https://github.com/buoyant-data/oxbow
datalake deltalake lambda parquet rust
Last synced: 10 days ago
JSON representation
Collection of AWS Lambdas for creating and managing Delta tables
- Host: GitHub
- URL: https://github.com/buoyant-data/oxbow
- Owner: buoyant-data
- License: agpl-3.0
- Created: 2023-05-06T15:35:50.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2026-01-12T17:38:30.000Z (2 months ago)
- Last Synced: 2026-01-12T23:04:45.077Z (about 2 months ago)
- Topics: datalake, deltalake, lambda, parquet, rust
- Language: Rust
- Homepage: https://www.buoyantdata.com
- Size: 604 KB
- Stars: 54
- Watchers: 1
- Forks: 10
- Open Issues: 5
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE.txt
Awesome Lists containing this project
README
ifdef::env-github[]
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]
:toc: macro
= Oxbow
Oxbow is a project to take an existing storage location which contains
link:https://parquet.apache.org[Apache Parquet] files into a
link:https://delta.io[Delta Lake table]. It is intended to run both as an AWS
Lambda or as a command line application.
The project is named after link:https://en.wikipedia.org/wiki/Oxbow_lake[Oxbow
lakes] to keep with the lake theme.
toc::[]
== Using
=== Command Line
Executing `cargo build --release` from a clone of this repository will build
the command line binary `oxbow` which can be used directly to convert a
directory full of `.parquet` files into a Delta table.
This is an _in place_ operation and will convert the specified table location
into a Delta table!
.Simple local files
[source,bash]
----
% oxbow --table ./path/to/my/parquet-files
----
.Files on AWS
[source,bash]
----
% export AWS_REGION=us-west-2
% export AWS_SECRET_ACCESS_KEY=xxxx
# Set other AWS environment variables
% oxbow --table s3://my-bucket/prefix/to/parquet
----
=== Lambda
The `deployment/` directory contains the necessary Terraform to provision the
function, a DynamoDB table for locking, S3 bucket, and IAM permissions.
After configuring the necessary authentication for Terraform, the following
steps can be used to provision:
[source,bash]
----
cargo lambda build --release --output-format zip --bin oxbow-lambda
terraform init
terraform plan
terraform apply
----
[NOTE]
====
Terraform configures the Lambda to run with the smallest amount of memory
allowed. For bucket locations with massive `.parquet` files, this may need to
be tuned.
====
==== Advanced
To help ameliorate
link:https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html[concurrency
challenges for Delta Lake on AWS] with the DynamoDb lock, the `deployment/`
directory also contains an "advanced" pattern which uses the `group-events`
Lambda to help serialize S3 Bucket Notifications into an AWS SQS FIFO with
Message Group IDs.
To build all the necessary code locally for the Advanced pattern, please run
`make build-release`
== Development
Building and testing can be done with cargo: `cargo test`.
In order to deploy this in AWS Lambda, it must first be built with the `cargo
lambda` command line tool, e.g.:
[source,bash]
----
cargo lambda build --release --output-format zip
----
This will produce the file: `target/lambda/oxbow-lambda/bootstrap.zip` which can be
uploaded direectly in the web console, or referenced in the Terraform (see
`deployment.tf`).
=== Design
==== Command Line
When running `oxbow` via command line it is a _one time operation_. It will
take an existing directory or location full of `.parquet` files and create a
Delta table out of it.
==== Lambda
When running `oxbow` inside of a AWS Lambda function it should be configured
with an S3 Event Trigger and create new commits to a Delta Lake table any time
a `.parquet` file is added to the bucket/prefix.
== Licensing
This repository is licensed under the link:https://www.gnu.org/licenses/agpl-3.0.en.html[AGPL 3.0]. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: `rtyler@buoyantdata.com`