Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dimajix/pyspark-datascience


https://github.com/dimajix/pyspark-datascience

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# PySpark ML Crashcourse

This repository contains exercises and solutions for a one-day crash course
for PySpark and Spark ML. The repository only contains Jupyter Notebooks which
assume a working PySpark kernel with Python 3.5 and Spark 2.1.

## Author

All notebooks have been create by Kaya Kupferschmidt @ dimajix. In case you
have any questions, feel free to contact me at [email protected]

## 01 - PySpark DataFrame Introduction

This notebook contains some simple snippets to get a basic understanding how
to interact with Spark DataFrames in Python.

## 02 - From Pandas to Spark (skeleton + solution)

These notebooks provides some examples on the differences between Pandas and Spark on an API level.

## 03 - Weather Analysis Exercise (exercise + solution)

A small exercise using some more data for a simple weather analysis.

## 04 - Pandas UDF (skeleton + solution)

An introduction to the various types of Pandas Vectorized UDFs

## 05 - Grouped Regression (exercise + solution)

An non-trivial example for using Pandas UDFs

## 06 - House Prices (skeleton + solution)

These notebooks contain a simple linear regression exercise as an introduction
to machine learning with Spark.

## 07 - House Prices (exercise + solution)

These notebooks builds on the last one, but creates more structure by using Spark ML pipeliens.

## 08 - Text Classification (exercise + solution)

After being exposed to a simple linear regression, these notebooks contain an
exercise to perform a simple statistical text classification.

## 09 - Hyper Parameter Tuning (exercise + solution)

As with many complex algorithms and ML pipelines, the text classification has
many hyper parameters. These notebooks show how to perform hyper parameter
tuning with PySpark.