{"id":22724784,"url":"https://github.com/mridang/athena-mongodb","last_synced_at":"2025-10-15T19:36:07.314Z","repository":{"id":194021156,"uuid":"689938825","full_name":"mridang/athena-mongodb","owner":"mridang","description":"MongoDB connector for AWS Athena Federation","archived":false,"fork":false,"pushed_at":"2023-09-15T07:10:17.000Z","size":645,"stargazers_count":0,"open_issues_count":19,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-02-05T00:43:32.177Z","etag":null,"topics":["apache-arrow","athena","aws","lambda","mongodb","trino"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mridang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-09-11T08:12:40.000Z","updated_at":"2023-09-11T11:39:48.000Z","dependencies_parsed_at":"2023-09-11T09:33:08.042Z","dependency_job_id":null,"html_url":"https://github.com/mridang/athena-mongodb","commit_stats":null,"previous_names":["mridang/athena-mongodb"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mridang%2Fathena-mongodb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mridang%2Fathena-mongodb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mridang%2Fathena-mongodb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mridang%2Fathena-mongodb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mridang","download_url":"https://codeload.github.com/mridang/athena-mongodb/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246255756,"owners_count":20748122,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-arrow","athena","aws","lambda","mongodb","trino"],"created_at":"2024-12-10T15:07:56.133Z","updated_at":"2025-10-15T19:36:02.282Z","avatar_url":"https://github.com/mridang.png","language":"Java","readme":"# MongoDB connector for Athena Federation\n\nAn enhanced version of the DocDB connector for AWS Athena.\n\nThis project was started as the current DocDB connector for AWS Athena did\nnot support multi-tenant collections.\n\nA few of the places I've worked at used tenant-specific collections e.g `Product_1`, `Product_2`\n\nThis connector adds support for multi-tenant collections by providing a \"view\" of all\nunderlying multi-tenant collections.\n\nOther improvements include:\n\n* An improved test suite backed by a MongoDB test-container as the previous one heavily relied on mocks and stubs\n* Support for AWS Lambda Snapstart as this is now supported on Lambda.\n* Support for the ARM64 architecture as the previous implementation used x86_64.\n* Support for Zstandard and Zlib compression as this requires shared libraries such as libzstd and libgzip to be bundled.\n* Enhanced logging and improved configuration as the previous implementation did not expose tunable sampling parameters\n* Improved startup performance by switching the GC mode. https://aws.amazon.com/blogs/compute/optimizing-aws-lambda-function-performance-for-java/\n\n## Deploying\n\nUnfortunately, the connector is not available in any public Maven repositories except the GitHub Package Registry.\nFor more information on how to install packages from the GitHub Package\nRegistry, [https://docs.github.com/en/packages/guides/configuring-gradle-for-use-with-github-packages#installing-a-package][see the GitHub docs]\n\nThe MongoDB connector for AWS Athena can be deployed using the provided\nCloudformation template.\n\nThe template when deployed will create a Lambda function which can then be\nconfigured for use by AWS Athena. More information can be found here:\n\nhttps://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source-lambda.html\n\n#### Enabling compression\n\nYou can enable a driver option to compress messages which reduces the amount of\ndata passed over the network between the replica set and your application.\n\nThe driver supports the following algorithms:\n\n* Snappy: available in MongoDB 3.4 and later.\n* Zlib: available in MongoDB 3.6 and later.\n* Zstandard: available in MongoDB 4.2 and later.\n\nIf you specify multiple compression algorithms, the driver selects the\nfirst one in the list supported by the instance to which it is connected.\n\nYou can enable compression for the connection to your instance by specifying\nthe algorithms by adding the parameter to your connection string.\n\n`\"mongodb+srv://\u003cuser\u003e:\u003cpassword\u003e@\u003ccluster-url\u003e/?compressors=snappy,zlib,zstd\"`\n\n#### Parameters\n\n* `SCHEMA_INFERENCE_NUM_DOCS`: Defines the number of documents that should be\n  sampled to infer the schema. Default `10`.\n* `MONGO_QUERY_BATCH_SIZE`: Defines the number of documents to fetch from MongoDB\n  in every batch. Default `100`.\n* `GLOB_PATTERN`: Defines how collections should be coalesced together\n  when multi-tenant support is required. The glob pattern is a valid regex with\n  the leading and trailing regex anchor characters omitted i.e. `$` and `^`.\n  If you have multi-tenant collections in the form\n\n## Caveats\n\nThe current implementation does not support parallel scans across multi-tenant\ncollections.\n\nA benefit of having multi-tenant collections is that you can parallise your query.\nAssuming you have a 100 collections called `foo_\u003cid\u003e` (where `\u003cid\u003e` denotes the\ntenant) - running a query like `SELECT * FROM foo_id` from Athena will result in\na 100 sequential queries being made.\n\nAdding support for partitioning to the lambda would enable you to parallelize by a\nfactor of \"n\". You would not run a 100 parallel scans as that would trash your\nreplica set.\n\nIn the event that these are needed, upstream pull-requests are welcomed.\n\n## Authors\n\n* Mridang Agarwalla \u003cmridang.agarwalla@gmail.com\u003e\n* Palantir Technologies\n* Amazon Web Services\n\n## License\n\nApache-2.0 License\n\n[see the GitHub docs]: https://docs.github.com/en/packages/guides/configuring-gradle-for-use-with-github-packages#installing-a-package\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmridang%2Fathena-mongodb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmridang%2Fathena-mongodb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmridang%2Fathena-mongodb/lists"}