https://github.com/neha-dev-dot/pyspark-tutorial

This repository is part of my journey to learn **PySpark**, the Python API for Apache Spark. I explored the fundamentals of distributed data processing using Spark and practiced with real-world data transformation and querying use cases.
https://github.com/neha-dev-dot/pyspark-tutorial

actions data-partitioning dataframes pyspark-basics pyspark-sql rdds sparkbasics sparkcontext sparksession transformations udfs window-functions

Last synced: 6 months ago
JSON representation

Host: GitHub
URL: https://github.com/neha-dev-dot/pyspark-tutorial
Owner: neha-dev-dot
Created: 2025-06-23T06:33:43.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-06-28T12:09:02.000Z (6 months ago)
Last Synced: 2025-06-28T13:23:53.737Z (6 months ago)
Topics: actions, data-partitioning, dataframes, pyspark-basics, pyspark-sql, rdds, sparkbasics, sparkcontext, sparksession, transformations, udfs, window-functions
Language: Jupyter Notebook
Homepage: https://github.com/neha-dev-dot/Pyspark-Tutorial
Size: 230 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🔥 PySpark Essentials

This project is a hands-on collection of notebooks, code snippets, and exercises focused on learning **Apache Spark with Python (PySpark)**. It includes my notes and experiments while exploring **core Spark concepts, transformations, actions, DataFrame API, and more**.

---

## 🚀 What is PySpark?

**PySpark** is the Python API for **Apache Spark**, a powerful open-source distributed computing engine used for large-scale data processing and analytics. PySpark allows you to leverage the power of distributed computing using Python.

---

## 📘 Topics Covered

- ✅ Introduction to Spark & PySpark
- ✅ SparkContext & SparkSession
- ✅ RDDs (Resilient Distributed Datasets)
- ✅ DataFrames & Datasets
- ✅ Transformations vs Actions
- ✅ Reading/Writing: JSON, CSV, Parquet
- ✅ PySpark SQL & Queries
- ✅ GroupBy, Aggregations, Joins
- ✅ Handling Nulls & Missing Data
- ✅ User-Defined Functions (UDFs)
- ✅ Window Functions
- ✅ Data Partitioning & Performance Optimization
- ✅ Intro to MLlib (Optional)

---

## ✍️ How I Learn
I follow a "Learn by Doing" approach.
Each notebook contains:

✅ Detailed explanations

🧪 Hands-on code examples

📌 Real-world case studies

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/neha-dev-dot/pyspark-tutorial

Awesome Lists containing this project

README