An open API service indexing awesome lists of open source software.

https://github.com/nabilshadman/pyspark-dataframe-sql-ml-exercises

Comprehensive PySpark exercises covering data wrangling, ML pipelines, and AWS deployment from Udacity's Learn Spark course
https://github.com/nabilshadman/pyspark-dataframe-sql-ml-exercises

apache-spark aws data-engineering data-science machine-learning pyspark

Last synced: 26 days ago
JSON representation

Comprehensive PySpark exercises covering data wrangling, ML pipelines, and AWS deployment from Udacity's Learn Spark course

Awesome Lists containing this project

README

        

# PySpark Exercises

![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10-blue.svg?logo=python&logoColor=white)
![Apache Spark](https://img.shields.io/badge/Apache%20Spark-3.4.0-E25A1C.svg?logo=apache-spark&logoColor=white)
![Jupyter](https://img.shields.io/badge/Jupyter-Notebook-F37626.svg?logo=jupyter&logoColor=white)
![AWS](https://img.shields.io/badge/AWS-EMR%20%7C%20S3-FF9900.svg?logo=amazon-aws&logoColor=white)
![pandas](https://img.shields.io/badge/pandas-2.0.0-150458.svg?logo=pandas&logoColor=white)
![NumPy](https://img.shields.io/badge/NumPy-1.24.0-013243.svg?logo=numpy&logoColor=white)

A comprehensive collection of exercises and mini-projects using [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) (Python API for Apache Spark). These materials were developed as part of Udacity's [Learn Spark at Udacity](https://www.udacity.com/course/learn-spark-at-udacity--ud2002) course, providing hands-on experience with Apache Spark's core features and advanced capabilities.

## 🛠 Tech Stack
- Python
- PySpark
- NumPy
- pandas
- Matplotlib
- Jupyter Notebook
- AWS
- GitHub

## 📂 Repository Structure

```
.
├── data_wrangling_with_spark/ # Data processing fundamentals
│ ├── notebooks covering procedural vs functional programming
│ ├── Spark operations and lazy evaluation
│ ├── DataFrame operations and SQL
│ └── practice datasets
├── debugging_and_optimization/ # Performance tuning
│ └── exercises/
│ ├── data skewness handling
│ ├── broadcast joins
│ └── repartitioning strategies
├── machine_learning_with_spark/ # ML implementations
│ ├── feature engineering
│ ├── linear regression
│ ├── k-means clustering
│ └── model tuning
└── setting_up_spark_clusters_with_aws/ # AWS deployment
├── demo_code/
└── exercises/
├── EMR cluster creation
├── script submission
└── S3 integration
```

## 📚 Course Content

### 1. The Power of Spark
- Introduction to Big Data ecosystem
- MapReduce implementation
- Fundamental Spark concepts

### 2. Data Wrangling with Spark
- Functional programming principles
- DataFrame operations and transformations
- Spark SQL integration
- Data input/output operations

### 3. Setting up Spark Clusters with AWS
- EMR cluster deployment
- AWS CLI integration
- S3 data storage
- Spark job submission

### 4. Debugging and Optimization
- Data skewness handling
- Broadcast join optimization
- Partition management
- Performance tuning strategies

### 5. Machine Learning with Spark
- Feature engineering (numeric and text)
- Linear regression implementation
- K-means clustering
- Model tuning and optimization
- ML pipeline construction

## 🚀 Getting Started

1. **Environment Setup**
- Follow PySpark's official [installation guide](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)
- Set up Python environment with required dependencies
- Configure AWS credentials (for cluster-related exercises)

2. **Running the Exercises**
- Each directory contains Jupyter notebooks and Python scripts
- Start with the numbered notebooks in each section
- Solutions are provided for self-assessment

## 📝 Notes
- Exercise solutions are available in corresponding `*_solution` notebooks
- AWS-related exercises require active AWS credentials
- Sample datasets are included in respective directories

## 🤝 Contributing
Feel free to submit issues and enhancement requests!