https://github.com/dacort/athena-gmail
Athena Gmail connector
https://github.com/dacort/athena-gmail
Last synced: about 1 month ago
JSON representation
Athena Gmail connector
- Host: GitHub
- URL: https://github.com/dacort/athena-gmail
- Owner: dacort
- License: apache-2.0
- Created: 2021-01-02T06:32:13.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-03-20T05:24:56.000Z (about 4 years ago)
- Last Synced: 2025-02-16T04:33:35.445Z (3 months ago)
- Language: Python
- Size: 27.3 KB
- Stars: 1
- Watchers: 2
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Athena Gmail Connector
_Another Thanksgiving day experiment from @dacort_
## Overview
Ever wanted to query your email from Athena? Well now you can!
## Usage
You can (eventually) use any advanched search syntax Gmail supports in your `WHERE` clause.
- `SELECT * FROM gmail.messages WHERE meta_gmailquery='from:amazonaws.com'`
For this experiment, we only load 100 messages.
## Requirements
- Create a Google OAuth client configured as a "Desktop App"
- Run `python quickstart.py` to populate local credentials## Docker Usage
- In this directory, build the Docker image:
```shell
docker build -t gathena .
```- Start the container
```shell
docker run -p 9000:8080 gathena:latest
```- Test the endpoint!
```shell
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{"@type": "PingRequest", "identity": {"id": "UNKNOWN", "principal": "UNKNOWN", "account": "123456789012", "arn": "arn:aws:iam::123456789012:root", "tags": {}, "groups": []}, "catalogName": "gmail", "queryId": "1681559a-548b-4771-874c-2aa2ea7c39ab"}'
```## Uploading
- Create a container repository
```shell
export AWS_REGION=us-east-1
aws ecr create-repository --repository-name gathena --image-scanning-configuration scanOnPush=true
docker tag gathena:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/gathena:latest
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/gathena:latest
```- Create a Lambda function with the above container
- Set an environment variable on the Lambda function with a spill bucket
```shell
aws lambda update-function-configuration --function-name gathena_container --environment 'Variables={TARGET_BUCKET=}'
```- Add a new data source to Athena pointing to the Lambda function
- If changing code, use `AWS_ACCOUNT_ID=123456789012 make docker` to rebuild and update your Lambda function.
## Schema thoughts
- old schema
source_file (string)
ts (string)
from (string)
to (string)
subject (string)
message_id (string)
in_reply_to_id (string)
dt (string) (Partitioned)- thoughts from https://stackoverflow.com/questions/14641865/email-database-design-schema
from : string
to : string
subject: string
date (range): datetime
attachments (names & types only) : Object Array
message contents : string
(optional) mailbox / folder structure: string- https://cwiki.apache.org/confluence/display/solr/MailEntityProcessor
single valued fields :messageId
subject
from
sentDate
xMailermulti valued fields :
allTo
flags : possible flags are 'answered', 'deleted', 'draft', 'flagged' , 'recent', 'seen'
content
attachment
attachmentNames;