Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dacort/athena-sqlite
A SQLite driver for S3 and Amazon Athena 😳
https://github.com/dacort/athena-sqlite
amazon-athena athena aws lambda-layer s3 sar serverless sqlite vfs
Last synced: about 1 month ago
JSON representation
A SQLite driver for S3 and Amazon Athena 😳
- Host: GitHub
- URL: https://github.com/dacort/athena-sqlite
- Owner: dacort
- License: apache-2.0
- Created: 2019-12-11T07:14:58.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2019-12-16T06:19:07.000Z (about 5 years ago)
- Last Synced: 2024-08-14T07:08:55.617Z (5 months ago)
- Topics: amazon-athena, athena, aws, lambda-layer, s3, sar, serverless, sqlite, vfs
- Language: Python
- Homepage: https://serverlessrepo.aws.amazon.com/#/applications/arn:aws:serverlessrepo:us-east-1:689449560910:applications~AthenaSQLITEConnector
- Size: 85 KB
- Stars: 96
- Watchers: 7
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - dacort/athena-sqlite - A SQLite driver for S3 and Amazon Athena 😳 (Python)
README
# Athena SQLite Driver
Using Athena's new [Query Federation](https://github.com/awslabs/aws-athena-query-federation/) functionality, read SQLite databases from S3.
Install it from the Serverless Application Repository: [AthenaSQLiteConnector](https://serverlessrepo.aws.amazon.com/#/applications/arn:aws:serverlessrepo:us-east-1:689449560910:applications~AthenaSQLITEConnector).
## Why?
I occasionally like to put together fun side projects over Thanksgiving and Christmas holidays.
I'd always joked it would a crazy idea to be able to read SQLite using Athena, so...here we are!
## How?
- I decided to use Python as I'm most familiar with it and because of the next point
- Using [APSW](https://rogerbinns.github.io/apsw/), we can implement a [Virtual File System](https://rogerbinns.github.io/apsw/vfs.html) (VFS) for S3
- Using the [Athena query federation example](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-example/), we can see what calls need to be implementedThe PyArrow library unfortunately weighs in over 250MB, so we have to use a custom compilation step to build a Lambda Layer.
## What?
Drop SQLite databases in a single prefix in S3, and Athena will list each file as a database and automatically detect tables and schemas.
Currently, all data types are strings. I'll fix this eventually. All good things in time.
## Status
This project is under active development and very much in it's infancy.
Many things are hard-coded or broken into various pieces as I experiment and figure out how everything works.
## Building
The documentation for this is a work in progress. It's currently in between me creating the resources manually and building the assets for the AWS SAR,
and most of the docs will be automated away.### Requirements
- Docker
- Python 3.7### Lambda layer
First you need to build Lambda layer. There are two Dockerfiles and build scripts in the `lambda-layer/` directory.
We'll execute each of the build scripts and copy the results to the target directory. This is referenced by the SAR template, [`athena-sqlite.yaml`](athena-sqlite.yaml).
```
cd lambda-layer
./build.sh
./build-pyarrow.sh
cp -R layer/ ../target/
```### Upload sample data
For the purpose of this test, we just have a sample sqlite database you can upload.
`aws s3 cp sample-data/sample_data.sqlite s3:////`
Feel free to upload your own SQLite databases as well!
### Lambda function
There are three components to the Lambda code:
- `vfs.py` - A SQLite Virtual File System implementation for S3
- `s3qlite.py` - The actual Lambda function that handles Athena metadata/data requests
- `sqlite_db.py` - Helper functions for access SQLite databases on S3Create a function with the code in [lambda-function/s3qlite.py](lambda-function/s3qlite.py) that uses the previously created layer.
The handler will be `s3qlite.lambda_handler`
Also include the `vfs.py` and `sqlite_db.py` files in your Lambda functionConfigure two environment variables for your lambda function:
- `TARGET_BUCKET` - The name of your S3 bucket where SQLite files live
- `TARGET_PREFIX` - The prefix (e.g. `data/sqlite`) that you uploaded the sample sqlite database toNote that the IAM role you associate the function with will also need `s3:GetObject` and `s3:ListBucket` access to wherever your lovely SQLite databases are stored.
### Configure Athena
Follow the Athena documentation for [Connecting to a data source](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source.html).
The primary thing to note here is that you need to create a workgroup named `AmazonAthenaPreviewFunctionality` and use that for your testing.
Some functionality will work in the primary workgroup, but you'll get weird errors when you try to query data.I named my function `s3qlite` :)
### Run queries!
Here's a couple basic queries that should work:
```sql
SELECT * FROM "s3qlite"."sample_data"."records" limit 10;SELECT COUNT(*) FROM "s3qlite"."sample_data"."records";
```If you deploy the SAR app, the data catalog isn't registered automatically, but you can still run queries by using the special `lambda:` schema:
```sql
SELECT * FROM "lambda:s3qlite".sample_data.records LIMIT 10;
```Where `s3qlite` is the value you provided for the `AthenaCatalogName` parameter.
## TODO
- Move these into issues :)
- Move vfs.py into it's own module
- Maybe add write support to it someday :scream:
- Publish to SAR
- Add tests...always tests
- struct types, probably
- Don't read the entire file every time :)
- Escape column names with invalid characters
- Implement recursive listing## Serverless App Repo
These are mostly notes I made while figuring out how to get SAR working.
Need to grant SAR access to the bucket
```shell
aws s3api put-bucket-policy --bucket --region us-east-1 --policy '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "serverlessrepo.amazonaws.com"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::/*"
}
]
}'
```For publishing to the SAR, we just execute two commands
```shell
sam package --template-file athena-sqlite.yaml --s3-bucket --output-template-file target/out.yaml
sam publish --template target/out.yaml --region us-east-1
```If you want to deploy using CloudFormation, use this command:
```shell
sam deploy --template-file ./target/out.yaml --stack-name athena-sqlite --capabilities CAPABILITY_IAM --parameter-overrides 'DataBucket= DataPrefix=tmp/sqlite' --region us-east-1
```