https://github.com/divitmittal/datathon-bigdata
Efficient Data Processing ETL Pipeline for Event Records
https://github.com/divitmittal/datathon-bigdata
aws aws-glue aws-lambda aws-s3 etl-pipeline hadoop spark
Last synced: about 1 month ago
JSON representation
Efficient Data Processing ETL Pipeline for Event Records
- Host: GitHub
- URL: https://github.com/divitmittal/datathon-bigdata
- Owner: DivitMittal
- License: mit
- Created: 2024-09-21T14:27:56.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-01-15T20:06:32.000Z (5 months ago)
- Last Synced: 2026-01-15T22:31:41.511Z (5 months ago)
- Topics: aws, aws-glue, aws-lambda, aws-s3, etl-pipeline, hadoop, spark
- Language: Jupyter Notebook
- Homepage: https://deepwiki.com/DivitMittal/Datathon-BigData
- Size: 4.11 MB
- Stars: 6
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
= Datathon-BigData
== Efficient Data Processing ETL Pipeline for Event Records
To process raw product event data & filter relevant https://www.bobble.ai/en/home[BobbleAI Keyboard] event records from the last five days before July 1, 2024, expand JSON columns, and store the final data in a structured Apache Parquet format S3.
== Technology Stack
- **Cloud Services**: AWS (S3, Lambda, Glue)
- **Data Processing**: PySpark on AWS Glue
- **Storage**: S3 (Parquet format)
- **IAM & Security**: Managed using AWS IAM roles and policies for access control.