https://github.com/mayurasandakalum/ipl-data-engineering-spark-databricks
Comprehensive data engineering and analytics project using IPL dataset with Amazon S3, Apache Spark, Databricks, and SQL. Includes data storage, transformation, analysis, and visualization.
https://github.com/mayurasandakalum/ipl-data-engineering-spark-databricks
amazon-s3 apache-spark aws big-data cricket-analytics data-analytics data-engineering data-visualization databricks etl-pipeline ipl-dataset machine-learning python sql
Last synced: 5 months ago
JSON representation
Comprehensive data engineering and analytics project using IPL dataset with Amazon S3, Apache Spark, Databricks, and SQL. Includes data storage, transformation, analysis, and visualization.
- Host: GitHub
- URL: https://github.com/mayurasandakalum/ipl-data-engineering-spark-databricks
- Owner: mayurasandakalum
- Created: 2024-07-31T20:29:47.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-31T21:34:21.000Z (almost 2 years ago)
- Last Synced: 2025-03-06T19:43:51.668Z (over 1 year ago)
- Topics: amazon-s3, apache-spark, aws, big-data, cricket-analytics, data-analytics, data-engineering, data-visualization, databricks, etl-pipeline, ipl-dataset, machine-learning, python, sql
- Language: Jupyter Notebook
- Homepage: https://github.com/mayurasandakalum/ipl-data-engineering-spark-databricks
- Size: 2.66 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# IPL Data Engineering and Analytics with Apache Spark and Databricks 🏏📊🚀
This project showcases the development and execution of a data engineering and analytics pipeline using the IPL (Indian Premier League) dataset. The primary objective is to demonstrate data storage, transformation, analysis, and visualization using Amazon S3, Apache Spark, Databricks, SQL, and Python.

## Key Components and Functionalities 🔧
1. **Data Storage 📦**
- **Amazon S3**: Data storage solution where the IPL dataset is uploaded and managed.

2. **Data Transformation and Analysis 🔄**
- **Apache Spark**: Used for data transformation and processing.
- **SQL**: Employed for data querying and analysis.

3. **Data Visualization 📊**
- **Matplotlib and Seaborn**: Used for creating insightful visualizations of the analyzed data.
## Dataset 📚
The IPL dataset used in this project contains ball-by-ball data of all IPL matches up to the 2017 season. The dataset is sourced from data.world and can be accessed [here](https://data.world/raghu543/ipl-data-till-2017).

## Setup 🛠️
### Prerequisites 📋
- AWS account with access to S3.
- Databricks account.
- Python 3.x installed.
### Installation ⚙️
1. **Clone the repository**:
```sh
https://github.com/mayurasandakalum/ipl-data-engineering-spark-databricks.git
```
2. **Navigate to the project directory**:
```sh
cd ipl-data-engineering-spark-databricks
```
## Usage 🚀
### Data Storage on Amazon S3
- Upload the IPL dataset files to your S3 bucket.
- Ensure the bucket is publicly accessible if required.
### Data Transformation and Analysis
- **Setup Databricks**:
- Create a new cluster on Databricks.

- Import the notebook files and attach them to your cluster.

- Execute the cells to run the data transformation and analysis steps.
### Data Visualizations
1. **Average Runs Scored by Batsmen in Winning Matches**:

2. **Distribution of Scores by Venue**:

3. **Impact of Winning Toss on Match Outcomes**:

4. **Most Economical Bowlers in Powerplay Overs**:

5. **Most Frequent Dismissal Types**:

6. **Team Performance After Winning Toss**:

### Code Samples 💻
- **Data Transformation Code**:
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg
spark = SparkSession.builder.appName("IPL Data Analysis").getOrCreate()
# Read data from S3
df = spark.read.csv("s3://your-bucket-name/ipl-data.csv", header=True, inferSchema=True)
# Data transformation
transformed_df = df.groupBy("season").agg(sum("runs").alias("total_runs"), avg("runs").alias("avg_runs"))
transformed_df.show()
```
- **SQL Queries**:
```sql
SELECT player_name, SUM(runs) as total_runs
FROM ball_by_ball
GROUP BY player_name
ORDER BY total_runs DESC
LIMIT 10;
```
## Contributing 🤝
Feel free to fork this repository, make enhancements, and submit pull requests. Your contributions are welcome!
## Acknowledgments 🙏
Special thanks to the creators of the datasets and the open-source tools used in this project.