Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dimajix/pyspark-datascience
https://github.com/dimajix/pyspark-datascience
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/dimajix/pyspark-datascience
- Owner: dimajix
- Created: 2022-03-28T06:30:05.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-11-16T16:38:35.000Z (about 1 year ago)
- Last Synced: 2024-03-26T20:25:00.936Z (9 months ago)
- Language: Jupyter Notebook
- Size: 3.21 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PySpark ML Crashcourse
This repository contains exercises and solutions for a one-day crash course
for PySpark and Spark ML. The repository only contains Jupyter Notebooks which
assume a working PySpark kernel with Python 3.5 and Spark 2.1.## Author
All notebooks have been create by Kaya Kupferschmidt @ dimajix. In case you
have any questions, feel free to contact me at [email protected]## 01 - PySpark DataFrame Introduction
This notebook contains some simple snippets to get a basic understanding how
to interact with Spark DataFrames in Python.## 02 - From Pandas to Spark (skeleton + solution)
These notebooks provides some examples on the differences between Pandas and Spark on an API level.
## 03 - Weather Analysis Exercise (exercise + solution)
A small exercise using some more data for a simple weather analysis.
## 04 - Pandas UDF (skeleton + solution)
An introduction to the various types of Pandas Vectorized UDFs
## 05 - Grouped Regression (exercise + solution)
An non-trivial example for using Pandas UDFs
## 06 - House Prices (skeleton + solution)
These notebooks contain a simple linear regression exercise as an introduction
to machine learning with Spark.## 07 - House Prices (exercise + solution)
These notebooks builds on the last one, but creates more structure by using Spark ML pipeliens.
## 08 - Text Classification (exercise + solution)
After being exposed to a simple linear regression, these notebooks contain an
exercise to perform a simple statistical text classification.## 09 - Hyper Parameter Tuning (exercise + solution)
As with many complex algorithms and ML pipelines, the text classification has
many hyper parameters. These notebooks show how to perform hyper parameter
tuning with PySpark.