https://github.com/mridang/athena-mongodb

MongoDB connector for AWS Athena Federation
https://github.com/mridang/athena-mongodb

apache-arrow athena aws lambda mongodb trino

Last synced: 6 months ago
JSON representation

MongoDB connector for AWS Athena Federation

Host: GitHub
URL: https://github.com/mridang/athena-mongodb
Owner: mridang
License: mit
Created: 2023-09-11T08:12:40.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2023-09-15T07:10:17.000Z (over 2 years ago)
Last Synced: 2025-02-05T00:43:32.177Z (about 1 year ago)
Topics: apache-arrow, athena, aws, lambda, mongodb, trino
Language: Java
Homepage:
Size: 630 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 19
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# MongoDB connector for Athena Federation

An enhanced version of the DocDB connector for AWS Athena.

This project was started as the current DocDB connector for AWS Athena did
not support multi-tenant collections.

A few of the places I've worked at used tenant-specific collections e.g `Product_1`, `Product_2`

This connector adds support for multi-tenant collections by providing a "view" of all
underlying multi-tenant collections.

Other improvements include:

* An improved test suite backed by a MongoDB test-container as the previous one heavily relied on mocks and stubs
* Support for AWS Lambda Snapstart as this is now supported on Lambda.
* Support for the ARM64 architecture as the previous implementation used x86_64.
* Support for Zstandard and Zlib compression as this requires shared libraries such as libzstd and libgzip to be bundled.
* Enhanced logging and improved configuration as the previous implementation did not expose tunable sampling parameters
* Improved startup performance by switching the GC mode. https://aws.amazon.com/blogs/compute/optimizing-aws-lambda-function-performance-for-java/

## Deploying

Unfortunately, the connector is not available in any public Maven repositories except the GitHub Package Registry.
For more information on how to install packages from the GitHub Package
Registry, [https://docs.github.com/en/packages/guides/configuring-gradle-for-use-with-github-packages#installing-a-package][see the GitHub docs]

The MongoDB connector for AWS Athena can be deployed using the provided
Cloudformation template.

The template when deployed will create a Lambda function which can then be
configured for use by AWS Athena. More information can be found here:

https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source-lambda.html

#### Enabling compression

You can enable a driver option to compress messages which reduces the amount of
data passed over the network between the replica set and your application.

The driver supports the following algorithms:

* Snappy: available in MongoDB 3.4 and later.
* Zlib: available in MongoDB 3.6 and later.
* Zstandard: available in MongoDB 4.2 and later.

If you specify multiple compression algorithms, the driver selects the
first one in the list supported by the instance to which it is connected.

You can enable compression for the connection to your instance by specifying
the algorithms by adding the parameter to your connection string.

`"mongodb+srv://:@/?compressors=snappy,zlib,zstd"`

#### Parameters

* `SCHEMA_INFERENCE_NUM_DOCS`: Defines the number of documents that should be
sampled to infer the schema. Default `10`.
* `MONGO_QUERY_BATCH_SIZE`: Defines the number of documents to fetch from MongoDB
in every batch. Default `100`.
* `GLOB_PATTERN`: Defines how collections should be coalesced together
when multi-tenant support is required. The glob pattern is a valid regex with
the leading and trailing regex anchor characters omitted i.e. `$` and `^`.
If you have multi-tenant collections in the form

## Caveats

The current implementation does not support parallel scans across multi-tenant
collections.

A benefit of having multi-tenant collections is that you can parallise your query.
Assuming you have a 100 collections called `foo_` (where `` denotes the
tenant) - running a query like `SELECT * FROM foo_id` from Athena will result in
a 100 sequential queries being made.

Adding support for partitioning to the lambda would enable you to parallelize by a
factor of "n". You would not run a 100 parallel scans as that would trash your
replica set.

In the event that these are needed, upstream pull-requests are welcomed.

## Authors

* Mridang Agarwalla
* Palantir Technologies
* Amazon Web Services

## License

Apache-2.0 License

[see the GitHub docs]: https://docs.github.com/en/packages/guides/configuring-gradle-for-use-with-github-packages#installing-a-package

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mridang/athena-mongodb

Awesome Lists containing this project

README