Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mahmoudparsian/pyspark-tutorial
PySpark-Tutorial provides basic algorithms using PySpark
https://github.com/mahmoudparsian/pyspark-tutorial
big-data big-data-analytics data-algorithms pyspark spark spark-dataframes spark-rdd
Last synced: 3 days ago
JSON representation
PySpark-Tutorial provides basic algorithms using PySpark
- Host: GitHub
- URL: https://github.com/mahmoudparsian/pyspark-tutorial
- Owner: mahmoudparsian
- License: other
- Created: 2015-03-12T03:01:53.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2023-01-20T22:04:40.000Z (about 2 years ago)
- Last Synced: 2025-01-12T00:05:55.822Z (10 days ago)
- Topics: big-data, big-data-analytics, data-algorithms, pyspark, spark, spark-dataframes, spark-rdd
- Language: Jupyter Notebook
- Homepage: http://mapreduce4hackers.com
- Size: 8.97 MB
- Stars: 1,189
- Watchers: 55
- Forks: 474
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# PySpark Tutorial
* PySpark is the Python API for Spark.
* The purpose of PySpark tutorial is to provide
basic distributed algorithms using PySpark.* PySpark supports two types of Data Abstractions:
* RDDs
* DataFrames* **PySpark Interactive Mode**: has an interactive shell
(`$SPARK_HOME/bin/pyspark`) for basic testing
and debugging and is not supposed to be used
for production environment.* **PySpark Batch Mode**: you may use `$SPARK_HOME/bin/spark-submit`
command for running PySpark programs (may be used for
testing and production environemtns)------
# [Glossary: big data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/README.md)
------
# [Basics of PySpark with Examples](./howto/README.md)
------
# PySpark Examples and Tutorials
* [PySpark Examples: RDDs](./tutorial/pyspark-examples/rdds/)
* [PySpark Examples: DataFramess](./tutorial/pyspark-examples/dataframes/)
* [DNA Base Counting](./tutorial/dna-basecount/README.md)
* [Classic Word Count](./tutorial/wordcount)
* [Find Frequency of Bigrams](./tutorial/bigrams)
* [Join of Two Relations R(K, V1), S(K, V2)](./tutorial/basic-join)
* [Basic Mapping of RDD Elements](./tutorial/basic-map)
* [How to add all RDD elements together](./tutorial/basic-sum)
* [How to multiply all RDD elements together](./tutorial/basic-multiply)
* [Find Top-N and Bottom-N](./tutorial/top-N)
* [Find average by using combineByKey()](./tutorial/combine-by-key)
* [How to filter RDD elements](./tutorial/basic-filter)
* [How to find average](./tutorial/basic-average)
* [Cartesian Product: rdd1.cartesian(rdd2)](./tutorial/cartesian)
* [Sort By Key: sortByKey() ascending/descending](./tutorial/basic-sort)
* [How to Add Indices](./tutorial/add-indices)
* [Map Partitions: mapPartitions() by Examples](./tutorial/map-partitions/README.md)
* [Monoid: Design Principle](https://github.com/mahmoudparsian/data-algorithms-with-spark/blob/master/wiki-spark/docs/monoid/README.md)------
# Books
### [Data Algorithms with Spark](https://github.com/mahmoudparsian/data-algorithms-with-spark/)
### [Data Algorithms](https://github.com/mahmoudparsian/data-algorithms-book/)
### [PySpark Algorithms](https://github.com/mahmoudparsian/pyspark-algorithms/)
-----
# Miscellaneous
### [Download, Install Spark and Run PySpark](./howto/download_install_run_spark.md)
### [How to Minimize the Verbosity of Spark](./howto/minimize_verbosity.md)
-------
# PySpark Tutorial and References...
* [Getting started with PySpark - Part 1](http://www.mccarroll.net/blog/pyspark/)
* [Getting started with PySpark - Part 2](http://www.mccarroll.net/blog/pyspark2/index.html)
* [A really really fast introduction to PySpark](http://www.slideshare.net/hkarau/a-really-really-fast-introduction-to-py-spark-lightning-fast-cluster-computing-with-python-1)
* [PySpark](http://www.slideshare.net/thegiivee/pysaprk?qid=81cf1b31-8b19-4570-89a5-21d03cad6ecd&v=default&b=&from_search=9)
* [Basic Big Data Manipulation with PySpark](http://bigdatasciencebootcamp.com/posts/Part_3/basic_big_data.html)
* [Working in Pyspark: Basics of Working with Data and RDDs](http://www.learnbymarketing.com/618/pyspark-rdd-basics-examples/)-------
# Questions/Comments
* [View Mahmoud Parsian's profile on LinkedIn](http://www.linkedin.com/in/mahmoudparsian)
* Please send me an email: [email protected]
* [Twitter: @mahmoudparsian](http://twitter.com/mahmoudparsian)Thank you!
````
best regards,
Mahmoud Parsian
````-----
------
[//]: # (metadata:)
[//]: # (Spark, PySpark, Python)
[//]: # (MapReduce, Distributed Algorithms, mappers, reducers, partitioners)
[//]: # (Transformations, Actions, RDDs, DataFrames, SQL)