https://github.com/shantanujpk/bigdatacloud
Exploration of PySpark for data processing and interview prep — demonstrates handling corrupted records, applying transformations/actions, and building efficient data pipelines with practical examples.
https://github.com/shantanujpk/bigdatacloud
big-data data jupyter-notebook pipeline pyspark python spark sparksql
Last synced: about 2 months ago
JSON representation
Exploration of PySpark for data processing and interview prep — demonstrates handling corrupted records, applying transformations/actions, and building efficient data pipelines with practical examples.
- Host: GitHub
- URL: https://github.com/shantanujpk/bigdatacloud
- Owner: Shantanujpk
- Created: 2025-09-09T02:38:17.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-24T21:41:34.000Z (9 months ago)
- Last Synced: 2025-09-24T23:32:55.228Z (9 months ago)
- Topics: big-data, data, jupyter-notebook, pipeline, pyspark, python, spark, sparksql
- Language: Jupyter Notebook
- Homepage:
- Size: 38.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PySpark Practice & Interview Prep 🚀
This repository contains my **PySpark practice notebook** where I explored fundamental Spark concepts, practical coding, and interview-style Q&A. It demonstrates how to use PySpark for real-world data handling and analysis while also preparing for data engineering and analyst interviews.
---
## 📂 Notebook Overview
- **Dealing with Corrupted Records**
- Detecting and handling malformed data in CSV/JSON
- Read modes: `PERMISSIVE`, `FAILFAST`, `DROPMALFORMED`
- Printing and storing bad records using `.option("badRecordsPath", ...)`
- **Transformations & Actions**
- Lazy evaluation and DAG execution
- Narrow vs. wide transformations
- Common actions: `show`, `collect`, `count`
- **Hands-On PySpark**
- Building a local Spark session
- Reading CSV with schema inference
- Performing `groupBy`, `joins`, and aggregations
---
## ⚙️ Tech Stack
- Python 3.x
- Apache Spark (PySpark)
- Jupyter Notebook
---
## 🚀 How to Run
1. Clone the repository
```bash
git clone https://github.com//PySpark-Practice.git
cd PySpark-Practice
Install dependencies (if not already installed):
bash
Copy code
pip install pyspark jupyter
Launch Jupyter Notebook:
bash
Copy code
jupyter notebook
Open PySpark.ipynb and run cells.
📌 Use Cases
This project demonstrates how to:
Clean and process large datasets in PySpark
Handle corrupted records gracefully without breaking jobs
Apply transformations and actions effectively to optimize performance
Explain Spark internals (lazy evaluation, DAG execution) during interviews
Strengthen hands-on familiarity with PySpark for real-world data pipelines
❓ Sample Interview Q&A from Notebook
Q: What happens in PERMISSIVE mode when reading corrupted records?
A: Bad records are replaced with null values.
Q: What is the difference between a transformation and an action in Spark?
A: Transformations are lazy and build the DAG, while actions trigger execution and return results.
Q: How do you store corrupted records for debugging?
A: Use .option("badRecordsPath", "path/to/store") when reading the data.
Q: What is the role of groupBy and join in transformations?
A: They create wide transformations, which shuffle data across partitions, impacting performance.