Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/timkong21/aws-batch-processing
Big data analysis with AWS services, filtering the Wikiticker dataset with Apache Spark on Amazon EMR, storing data in S3, cataloging with AWS Glue, and querying with Amazon Athena. This end-to-end pipeline exemplifies handling and analyzing big data in the cloud.
https://github.com/timkong21/aws-batch-processing
apache-spark aws aws-athena aws-emr aws-glue aws-s3 big-data cloud-computing data-analysis data-processing
Last synced: about 1 month ago
JSON representation
Big data analysis with AWS services, filtering the Wikiticker dataset with Apache Spark on Amazon EMR, storing data in S3, cataloging with AWS Glue, and querying with Amazon Athena. This end-to-end pipeline exemplifies handling and analyzing big data in the cloud.
- Host: GitHub
- URL: https://github.com/timkong21/aws-batch-processing
- Owner: TimKong21
- Created: 2024-03-05T04:11:57.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-03-07T03:01:36.000Z (8 months ago)
- Last Synced: 2024-10-12T00:03:25.535Z (about 1 month ago)
- Topics: apache-spark, aws, aws-athena, aws-emr, aws-glue, aws-s3, big-data, cloud-computing, data-analysis, data-processing
- Language: Python
- Homepage:
- Size: 8.01 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AWS Big Data Processing
## Overview
This project demonstrates the process of big data analysis using AWS services, focusing on filtering and analyzing the [Wikiticker dataset](https://github.com/apache/druid/blob/master/examples/quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz). Utilizing technologies such as Amazon EMR, S3, Glue, and Athena, it showcases an end-to-end pipeline from data processing with Spark to data storage, cataloging, and querying.
## Project Structure
```bash
AWS Big Data Processing
├── Code/
│ └── filter.py # Spark job script for processing the dataset
├── Data/
│ ├── datatypes.json # Schema definition for AWS Glue catalog table
│ └── wikiticker-2015-09-12-sampled.json # Sampled Wikiticker dataset for analysis
└── Project Documentation.pdf # Detailed project documentation
```## Getting Started
### Prerequisites
- AWS account with access to EMR, S3, Glue, and Athena services.
- AWS CLI installed and configured.### Setup and Execution
1. **Prepare the Data**: Upload the `wikiticker-2015-09-12-sampled.json` file to your S3 bucket.
2. **Launch an EMR Cluster**: Refer to the `Project Documentation.pdf` for detailed instructions on setting up the EMR cluster.
3. **Run the Spark Job**:
- SSH into the EMR master node.- Use `vi` to create and edit `filter.py` directly on the node:
```bash
vi filter.py
```- Insert the Spark script content into `filter.py`. Exit and save the file by typing `:wq!`.
- Execute the script using Spark-submit:
```bash
spark-submit filter.py
```
4. **Catalog the Data**: Use the provided `datatypes.json` to create a schema in AWS Glue for the filtered dataset.5. **Query with Athena**: Following the setup in Glue, use Athena to execute queries against your data.
### Cleaning Up
Ensure to terminate the EMR cluster and delete any unused resources in S3 to avoid unnecessary charges.
## Further Information
For detailed instructions, configuration options, and best practices, refer to the `Project Documentation.pdf` included in this repository.
## References
The following resources provide foundational lab exercises that inspired the tasks and structure of this project:
- **[Spark Job for Filtering and Processing Wikiticker Data](https://www.projectpro.io/hands-on-labs/spark-job-for-filtering-data)**: Details the tasks in developing a Spark job for data filtering, similar to the approach taken in this project.
- **[Create Glue Catalog Table and Query Data in AWS Athena](https://www.projectpro.io/hands-on-labs/aws-glue-catalog-table-example)**: Details the process of creating a Glue catalog table and using Athena for querying, as implemented in the workflow of this project.