https://github.com/aiwithqasim/emr-batch-processing
This Repo includes the code for Batch Processing using PySpark on AWS EMR
https://github.com/aiwithqasim/emr-batch-processing
Last synced: 2 months ago
JSON representation
This Repo includes the code for Batch Processing using PySpark on AWS EMR
- Host: GitHub
- URL: https://github.com/aiwithqasim/emr-batch-processing
- Owner: aiwithqasim
- Created: 2023-11-11T03:32:40.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-11T04:02:54.000Z (over 1 year ago)
- Last Synced: 2023-11-12T04:28:18.493Z (over 1 year ago)
- Language: Python
- Size: 4.72 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### About Batch Data Pipeline:
The Wikipedia Activity data will be put into a folder in the S3 bucket. We will have PySpark code that will run on the EMR cluster. This code will fetch the data from the S3 bucket, perform filtering and aggregation on this data, and push the processed data back into S3 in another folder. We will then use Athena to query this processed data present in S3. We will create a table on top of the processed data by providing the relevant schema and then use ANSI SQL to query the data.
Full Blog link: [Batch Processing using PySpark on AWS EMR](https://dev.to/aiwithqasim/batch-processing-using-pyspark-on-aws-emr-59n4)
### Architecture Diagram:

- **Languages** - Python
- **Package** - PySpark
- **Services** - AWS EMR, AWS S3, AWS Athena.### Dataset:
We'll be using the Wikipedia activity logs JSON dataset that has a huge payload comprising 15+ fields
NOTE: In our Script created we'll take two conditions into consideration that we want only those payloads where **_isRobot _** is **_False_** & user **_Country_** is from **_United Estate_**

For more such content please follow:
- LinkedIn: [https://www.linkedin.com/in/qasimhassan/](https://www.linkedin.com/in/qasimhassan/)
- GitHub: [https://github.com/aiwithqasim](https://github.com/aiwithqasim)
- Join our [AWS Data Engineering WhastApp](https://chat.whatsapp.com/LoLXrRI18lPJlLiDK7Mson) Group