Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark
https://github.com/mahmoudparsian/pyspark-tutorial

big-data big-data-analytics data-algorithms pyspark spark spark-dataframes spark-rdd

Last synced: about 7 hours ago
JSON representation

PySpark-Tutorial provides basic algorithms using PySpark

Awesome Lists containing this project

README

        

# PySpark Tutorial

* PySpark is the Python API for Spark.

* The purpose of PySpark tutorial is to provide
basic distributed algorithms using PySpark.

* PySpark supports two types of Data Abstractions:
* RDDs
* DataFrames

* **PySpark Interactive Mode**: has an interactive shell
(`$SPARK_HOME/bin/pyspark`) for basic testing
and debugging and is not supposed to be used
for production environment.

* **PySpark Batch Mode**: you may use `$SPARK_HOME/bin/spark-submit`
command for running PySpark programs (may be used for
testing and production environemtns)

------

# [Glossary: big data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/README.md)

------

# [Basics of PySpark with Examples](./howto/README.md)

------

# PySpark Examples and Tutorials

* [PySpark Examples: RDDs](./tutorial/pyspark-examples/rdds/)
* [PySpark Examples: DataFramess](./tutorial/pyspark-examples/dataframes/)
* [DNA Base Counting](./tutorial/dna-basecount/README.md)
* [Classic Word Count](./tutorial/wordcount)
* [Find Frequency of Bigrams](./tutorial/bigrams)
* [Join of Two Relations R(K, V1), S(K, V2)](./tutorial/basic-join)
* [Basic Mapping of RDD Elements](./tutorial/basic-map)
* [How to add all RDD elements together](./tutorial/basic-sum)
* [How to multiply all RDD elements together](./tutorial/basic-multiply)
* [Find Top-N and Bottom-N](./tutorial/top-N)
* [Find average by using combineByKey()](./tutorial/combine-by-key)
* [How to filter RDD elements](./tutorial/basic-filter)
* [How to find average](./tutorial/basic-average)
* [Cartesian Product: rdd1.cartesian(rdd2)](./tutorial/cartesian)
* [Sort By Key: sortByKey() ascending/descending](./tutorial/basic-sort)
* [How to Add Indices](./tutorial/add-indices)
* [Map Partitions: mapPartitions() by Examples](./tutorial/map-partitions/README.md)
* [Monoid: Design Principle](https://github.com/mahmoudparsian/data-algorithms-with-spark/blob/master/wiki-spark/docs/monoid/README.md)

------

# Books

### [Data Algorithms with Spark](https://github.com/mahmoudparsian/data-algorithms-with-spark/)

### [Data Algorithms](https://github.com/mahmoudparsian/data-algorithms-book/)

### [PySpark Algorithms](https://github.com/mahmoudparsian/pyspark-algorithms/)

-----

# Miscellaneous

### [Download, Install Spark and Run PySpark](./howto/download_install_run_spark.md)

### [How to Minimize the Verbosity of Spark](./howto/minimize_verbosity.md)

-------

# PySpark Tutorial and References...
* [Getting started with PySpark - Part 1](http://www.mccarroll.net/blog/pyspark/)
* [Getting started with PySpark - Part 2](http://www.mccarroll.net/blog/pyspark2/index.html)
* [A really really fast introduction to PySpark](http://www.slideshare.net/hkarau/a-really-really-fast-introduction-to-py-spark-lightning-fast-cluster-computing-with-python-1)
* [PySpark](http://www.slideshare.net/thegiivee/pysaprk?qid=81cf1b31-8b19-4570-89a5-21d03cad6ecd&v=default&b=&from_search=9)
* [Basic Big Data Manipulation with PySpark](http://bigdatasciencebootcamp.com/posts/Part_3/basic_big_data.html)
* [Working in Pyspark: Basics of Working with Data and RDDs](http://www.learnbymarketing.com/618/pyspark-rdd-basics-examples/)

-------

# Questions/Comments
* [View Mahmoud Parsian's profile on LinkedIn](http://www.linkedin.com/in/mahmoudparsian)
* Please send me an email: [email protected]
* [Twitter: @mahmoudparsian](http://twitter.com/mahmoudparsian)

Thank you!

````
best regards,
Mahmoud Parsian
````

-----


Data Algorithms with Spark


Data Algorithms with Spark


PySpark Algorithms


Data Algorithms

------

[//]: # (metadata:)
[//]: # (Spark, PySpark, Python)
[//]: # (MapReduce, Distributed Algorithms, mappers, reducers, partitioners)
[//]: # (Transformations, Actions, RDDs, DataFrames, SQL)