Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mahmoudparsian/data-algorithms-with-spark

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
https://github.com/mahmoudparsian/data-algorithms-with-spark

algorithms bigdata data data-abstractions data-algorithms data-transformation dataframes design design-patterns machine-learning mappers mapreduce monoid partitioning-algorithms pyspark python rdd reducers spark transformations

Last synced: 7 days ago
JSON representation

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian

Awesome Lists containing this project

README

        

### [Data Algorithms with Spark](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/) by Mahmoud Parsian



"... This book will be a great resource for

both readers looking to implement existing

algorithms in a scalable fashion and readers

who are developing new, custom algorithms

using Spark. ..."



Dr. Matei Zaharia

Original Creator of Apache Spark



FOREWORD by Dr. Matei Zaharia

-------

### [Data Algorithms with Spark](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/) by [Mahmoud Parsian](https://www.linkedin.com/mahmoudparsian/)

### [Foreword by Dr. Matei Zaharia](./docs/FOREWORD_by_Dr_Matei_Zaharia.md) (Original Creator of Apache Spark)

### Author: [Mahmoud Parsian](https://www.linkedin.com/in/mahmoudparsian/)

### [Goal of this book: Data Algorithms with Spark](./docs/goal_of_book.md)

### [Story of this book: Data Algorithms with Spark](./docs/story_of_book.md)

--------

* [Mahmoud Parsian's Author Page @Amazon](https://www.amazon.com/author/mahmoudparsian/)

* [Mahmoud Parsian's Author Page @LinkedIn](https://www.linkedin.com/mahmoudparsian/)

* This [new O'Reilly book](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/)
is the successor Edition of [Data Algorithms](https://www.oreilly.com/library/view/data-algorithms/9781491906170/)
(published by [O'Reilly](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/))

* This book uses PySpark (much simpler and readable)

* [Published date: April 8, 2022](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/)

* [@OReillyMedia: Data Algorithms with Spark, By @mahmoudparsian](https://twitter.com/OReillyMedia/status/1511796122548903938/)

* Autor Contact: [ [![Email](https://support.microsoft.com/images/Mail-GrayScale.png) Email](mailto:[email protected]) ] [ [![Linkedin](https://i.stack.imgur.com/gVE0j.png) Mahmoud Parsian @LinkedIn](https://www.linkedin.com/mahmoudparsian/) ][ [![GitHub](https://i.stack.imgur.com/tskMh.png) Mahmoud Parsian @GitHub](https://github.com/mahmoudparsian/) ]

-------

## [Github Chapter Solutions](./code/)

* This GitHub repository will host all source code and scripts for
[Data Algorithms with Spark]((https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/))

* Chapter solutions are provided in [PySpark and Scala](./code/)
* [PySpark solutions](./code/) are provided by [Mahmoud Parsian](https://github.com/mahmoudparsian/)
* [Scala solutions](./code/) are provided by [Deepak Kumar](https://github.com/deepakmca05/) and [Biman Mandal](https://github.com/bimanmandal/)

-----

## Software:

All programs are tested with the following software:

| Spark | Python | Scala | Java
|----------|:----------------:|-------:|-----------:|
| [Apache Spark 3.4.0](http://spark.apache.org/downloads.html) | [Python 3.10.5](https://www.python.org/downloads/) | [Scala 2.13](https://https://www.scala-lang.org/download/scala2.html) | [Java 11](https://www.oracle.com/java/technologies/javase/jdk11-archive-downloads.html) |

-----

## Table of Contents

| Chapter | Title |
|--------------|------------------|
| Glossary | [Glossary of Big Data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/glossary_of_big_data_and_mapreduce.md)
| Chapter 1 | [Introduction to Data Algorithms](./code/chap01/) |
| Chapter 2 | [Transformations in Action](./code/chap02/) |
| Chapter 3 | [Mapper Transformations](./code/chap03/) |
| Chapter 4 | [Reductions in Spark](./code/chap04/) |
| Chapter 5 | [Partitioning Data](./code/chap05/) |
| Chapter 6 | [Graph Algorithms](./code/chap06/) |
| Chapter 7 | [Interacting with External Data Sources](./code/chap07/) |
| Chapter 8 | [Ranking Algorithms](./code/chap08/) |
| Chapter 9 | [Fundamental Data Design Patterns](./code/chap09/) |
| Chapter 10 | [Common Data Design Patterns](./code/chap10/) |
| Chapter 11 | [Join Design Patterns](./code/chap11/) |
| Chapter 12 | [Feature Engineering in PySpark](./code/chap12/) |

--------

## Bonus Chapters

| Bonus Chapter | Title / Description |
|-----------------------------------|----------------------|
| Glossary | [Glossary of Big Data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/glossary_of_big_data_and_mapreduce.md) |
| Word Count | [Solutions for Word Count using RDDs and DataFrames](./code/bonus_chapters/wordcount/) |
| Anagrams | [Find words, which are anagrams](./code/bonus_chapters/anagrams/) |
| Lambda Expressions | [Using Lambda Expressions in PySpark programs](./code/bonus_chapters/lambda_expressions/) |
| TF-IDF | [Term Frequency - Inverse Document Frequency](./code/bonus_chapters/TF-IDF/) |
| K-mers | [K-mers for DNA Sequences](./code/bonus_chapters/k-mers/) |
| Correlation | [All vs. All Correlation](./code/bonus_chapters/correlation/) |
| Mapping Partitions | [`mapPartitions()` Complete Example](./code/bonus_chapters/mappartitions/) |
| UDF | [User-Defined Function Examples](./code/bonus_chapters/UDF/) |
| DataFrames Transformations | [Examples on Creation and Transformation of DataFrames](./code/bonus_chapters/dataframes/) |
| DataFrames Tutorials | [DataFrames Tutorials: from collections and CSV text files](./code/bonus_chapters/dataframes/) |
| Join Operations | [Examples on join of RDDs and DataFrames](./code/bonus_chapters/join/)|
| PySpark Tutorial 101 | [Examples on using PySpark RDDs and DataFrames](./code/bonus_chapters/pyspark_tutorial/) |
| Physical Data Partitioning | [Tutorial of Physical Data Partitioning](./code/bonus_chapters/physical_partitioning/README.md) |
| Monoids and Combiners | [Monoid as a Design Principle](https://github.com/mahmoudparsian/data-algorithms-with-spark/blob/master/wiki-spark/docs/monoid/README.md) |

-------


Data Algorithms with Spark


Data Algorithms with Spark


Data Algorithms with Spark

------

-------