https://github.com/neha-dev-dot/pyspark-tutorial
This repository is part of my journey to learn **PySpark**, the Python API for Apache Spark. I explored the fundamentals of distributed data processing using Spark and practiced with real-world data transformation and querying use cases.
https://github.com/neha-dev-dot/pyspark-tutorial
actions data-partitioning dataframes pyspark-basics pyspark-sql rdds sparkbasics sparkcontext sparksession transformations udfs window-functions
Last synced: 6 months ago
JSON representation
This repository is part of my journey to learn **PySpark**, the Python API for Apache Spark. I explored the fundamentals of distributed data processing using Spark and practiced with real-world data transformation and querying use cases.
- Host: GitHub
- URL: https://github.com/neha-dev-dot/pyspark-tutorial
- Owner: neha-dev-dot
- Created: 2025-06-23T06:33:43.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-06-28T12:09:02.000Z (6 months ago)
- Last Synced: 2025-06-28T13:23:53.737Z (6 months ago)
- Topics: actions, data-partitioning, dataframes, pyspark-basics, pyspark-sql, rdds, sparkbasics, sparkcontext, sparksession, transformations, udfs, window-functions
- Language: Jupyter Notebook
- Homepage: https://github.com/neha-dev-dot/Pyspark-Tutorial
- Size: 230 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ฅ PySpark Essentials
This project is a hands-on collection of notebooks, code snippets, and exercises focused on learning **Apache Spark with Python (PySpark)**. It includes my notes and experiments while exploring **core Spark concepts, transformations, actions, DataFrame API, and more**.
---
## ๐ What is PySpark?
**PySpark** is the Python API for **Apache Spark**, a powerful open-source distributed computing engine used for large-scale data processing and analytics. PySpark allows you to leverage the power of distributed computing using Python.
---
## ๐ Topics Covered
- โ
Introduction to Spark & PySpark
- โ
SparkContext & SparkSession
- โ
RDDs (Resilient Distributed Datasets)
- โ
DataFrames & Datasets
- โ
Transformations vs Actions
- โ
Reading/Writing: JSON, CSV, Parquet
- โ
PySpark SQL & Queries
- โ
GroupBy, Aggregations, Joins
- โ
Handling Nulls & Missing Data
- โ
User-Defined Functions (UDFs)
- โ
Window Functions
- โ
Data Partitioning & Performance Optimization
- โ
Intro to MLlib (Optional)
---
## โ๏ธ How I Learn
I follow a "Learn by Doing" approach.
Each notebook contains:
โ
Detailed explanations
๐งช Hands-on code examples
๐ Real-world case studies