https://github.com/dimajix/pyspark-datascience

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/dimajix/pyspark-datascience
Owner: dimajix
Created: 2022-03-28T06:30:05.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2024-12-27T06:12:28.000Z (6 months ago)
Last Synced: 2025-01-05T04:26:13.140Z (6 months ago)
Language: Jupyter Notebook
Size: 3.77 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# PySpark ML Crashcourse

This repository contains exercises and solutions for a one-day crash course
for PySpark and Spark ML. The repository only contains Jupyter Notebooks which
assume a working PySpark kernel with Python 3.5 and Spark 2.1.

## Author

All notebooks have been create by Kaya Kupferschmidt @ dimajix. In case you
have any questions, feel free to contact me at [email protected]

## 01 - PySpark DataFrame Introduction

This notebook contains some simple snippets to get a basic understanding how
to interact with Spark DataFrames in Python.

## 02 - From Pandas to Spark (skeleton + solution)

These notebooks provides some examples on the differences between Pandas and Spark on an API level.

## 03 - Weather Analysis Exercise (exercise + solution)

A small exercise using some more data for a simple weather analysis.

## 04 - Pandas UDF (skeleton + solution)

An introduction to the various types of Pandas Vectorized UDFs

## 05 - Grouped Regression (exercise + solution)

An non-trivial example for using Pandas UDFs

## 06 - House Prices (skeleton + solution)

These notebooks contain a simple linear regression exercise as an introduction
to machine learning with Spark.

## 07 - House Prices (exercise + solution)

These notebooks builds on the last one, but creates more structure by using Spark ML pipeliens.

## 08 - Text Classification (exercise + solution)

After being exposed to a simple linear regression, these notebooks contain an
exercise to perform a simple statistical text classification.

## 09 - Hyper Parameter Tuning (exercise + solution)

As with many complex algorithms and ML pipelines, the text classification has
many hyper parameters. These notebooks show how to perform hyper parameter
tuning with PySpark.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dimajix/pyspark-datascience

Awesome Lists containing this project

README