https://github.com/epomatti/aws-glue-athena
Glue ETL crawler and jobs with Athena queries
https://github.com/epomatti/aws-glue-athena
aws aws-athena aws-glue etl s3 terraform
Last synced: 7 months ago
JSON representation
Glue ETL crawler and jobs with Athena queries
- Host: GitHub
- URL: https://github.com/epomatti/aws-glue-athena
- Owner: epomatti
- License: mit
- Created: 2022-08-14T02:12:53.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-08-17T12:44:26.000Z (about 3 years ago)
- Last Synced: 2025-01-17T18:44:40.185Z (9 months ago)
- Topics: aws, aws-athena, aws-glue, etl, s3, terraform
- Language: HCL
- Homepage:
- Size: 138 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AWS Glue + Athena
Glue example extraction from RDS and query with Athena.
## Create the infrastructure
Create the Terraform variables file:
```sh
touch .auto.tfvars
```Add the variables according to your preferences. Example:
```hcl
# The role to be assumed by Terraform to create the resources
assume_role_arn = "arn:aws:iam::000000000000:role/OrganizationAccountAccessRole"# Region to create the resources
region = "sa-east-1"# Availability Zones
availability_zones = ["sa-east-1a", "sa-east-1b", "sa-east-1c"]
main_az = "sa-east-1a"# RDS Aurora credentials
master_username = "etluser"
master_password = "passw0rd"
```Apply Terraform:
```sh
terraform init
terraform apply -auto-approve
```Once ready, enter the Glue Studio and test the connector to the RDS database.
## Generate database data
Connect to the Jumpbox VM using instance connect, and then connect to the database:
```sh
mysql -u 'etluser' -p'passw0rd' \
-h 'aurora-mysql-instance.cq1qsu0anb1o.sa-east-1.rds.amazonaws.com' \
-P 3306 \
-D 'testdb'
```Apply the [`prepare-database.sql`](./prepare-database.sql) script to generate data.
## Glue ETL Job
First, run the crawler to feed the database catalog.
```sh
aws glue start-crawler --name 'rds-aurora-crawler'
```Connect to the AWS Glue Studio and go to the Jobs blade. Create a new Job:
- Source: AWS Glue Database Catalog
- Target: S3Enter JSON for the output format, and fill it in the required information.
Save the job. File [auto-generated-script-example.py](./auto-generated-script-example.py) is reference of what Glue will generate.
Run the ETL job and check the output files in S3:
```json
{"favoriteFood":"Pasta","sex":"M","id":2,"birthday":"1998-03-15","name":"John"}
```## Athena
Athena needs an S3 data source, so querying the existing datasource is not possible since it runs over Aurora.
Use Glue again to prepare an Athena table with an S3 source:
1. Create a new database on Glue.
2. Create a new crawler that will read the S3 JSON data and feed the new Glue database.
3. Run the crawler.
4. Go to Athena and add a query result location on S3.The table should be automatically created and you should now be able to run a query against the S3 data:
```sql
SELECT* FROM "crawler-s3"."transform" WHERE favoritefood='Lasagna';
```Done 🏆 Athena will run your queries over S3 using SQL:
---
### Clean-upDelete the manually created Glue Jobs, Crawlers, Database, Table, S3, CloudWatch Logs.
Run `terraform destroy -auto-approve` to remove the infrastructure.