https://github.com/nabilshadman/pyspark-dataframe-sql-ml-exercises
Comprehensive PySpark exercises covering data wrangling, ML pipelines, and AWS deployment from Udacity's Learn Spark course
https://github.com/nabilshadman/pyspark-dataframe-sql-ml-exercises
apache-spark aws data-engineering data-science machine-learning pyspark
Last synced: 26 days ago
JSON representation
Comprehensive PySpark exercises covering data wrangling, ML pipelines, and AWS deployment from Udacity's Learn Spark course
- Host: GitHub
- URL: https://github.com/nabilshadman/pyspark-dataframe-sql-ml-exercises
- Owner: nabilshadman
- Created: 2023-12-16T05:30:44.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-04T18:46:30.000Z (5 months ago)
- Last Synced: 2025-02-09T13:13:01.412Z (3 months ago)
- Topics: apache-spark, aws, data-engineering, data-science, machine-learning, pyspark
- Language: Jupyter Notebook
- Homepage:
- Size: 39.4 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PySpark Exercises





A comprehensive collection of exercises and mini-projects using [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) (Python API for Apache Spark). These materials were developed as part of Udacity's [Learn Spark at Udacity](https://www.udacity.com/course/learn-spark-at-udacity--ud2002) course, providing hands-on experience with Apache Spark's core features and advanced capabilities.
## 🛠 Tech Stack
- Python
- PySpark
- NumPy
- pandas
- Matplotlib
- Jupyter Notebook
- AWS
- GitHub## 📂 Repository Structure
```
.
├── data_wrangling_with_spark/ # Data processing fundamentals
│ ├── notebooks covering procedural vs functional programming
│ ├── Spark operations and lazy evaluation
│ ├── DataFrame operations and SQL
│ └── practice datasets
├── debugging_and_optimization/ # Performance tuning
│ └── exercises/
│ ├── data skewness handling
│ ├── broadcast joins
│ └── repartitioning strategies
├── machine_learning_with_spark/ # ML implementations
│ ├── feature engineering
│ ├── linear regression
│ ├── k-means clustering
│ └── model tuning
└── setting_up_spark_clusters_with_aws/ # AWS deployment
├── demo_code/
└── exercises/
├── EMR cluster creation
├── script submission
└── S3 integration
```## 📚 Course Content
### 1. The Power of Spark
- Introduction to Big Data ecosystem
- MapReduce implementation
- Fundamental Spark concepts### 2. Data Wrangling with Spark
- Functional programming principles
- DataFrame operations and transformations
- Spark SQL integration
- Data input/output operations### 3. Setting up Spark Clusters with AWS
- EMR cluster deployment
- AWS CLI integration
- S3 data storage
- Spark job submission### 4. Debugging and Optimization
- Data skewness handling
- Broadcast join optimization
- Partition management
- Performance tuning strategies### 5. Machine Learning with Spark
- Feature engineering (numeric and text)
- Linear regression implementation
- K-means clustering
- Model tuning and optimization
- ML pipeline construction## 🚀 Getting Started
1. **Environment Setup**
- Follow PySpark's official [installation guide](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)
- Set up Python environment with required dependencies
- Configure AWS credentials (for cluster-related exercises)2. **Running the Exercises**
- Each directory contains Jupyter notebooks and Python scripts
- Start with the numbered notebooks in each section
- Solutions are provided for self-assessment## 📝 Notes
- Exercise solutions are available in corresponding `*_solution` notebooks
- AWS-related exercises require active AWS credentials
- Sample datasets are included in respective directories## 🤝 Contributing
Feel free to submit issues and enhancement requests!