Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mahmoudparsian/data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
https://github.com/mahmoudparsian/data-algorithms-with-spark
algorithms bigdata data data-abstractions data-algorithms data-transformation dataframes design design-patterns machine-learning mappers mapreduce monoid partitioning-algorithms pyspark python rdd reducers spark transformations
Last synced: 7 days ago
JSON representation
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
- Host: GitHub
- URL: https://github.com/mahmoudparsian/data-algorithms-with-spark
- Owner: mahmoudparsian
- Created: 2019-12-10T03:40:09.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-06-26T01:01:18.000Z (over 1 year ago)
- Last Synced: 2025-01-08T11:09:28.008Z (14 days ago)
- Topics: algorithms, bigdata, data, data-abstractions, data-algorithms, data-transformation, dataframes, design, design-patterns, machine-learning, mappers, mapreduce, monoid, partitioning-algorithms, pyspark, python, rdd, reducers, spark, transformations
- Language: Python
- Homepage:
- Size: 44.9 MB
- Stars: 210
- Watchers: 15
- Forks: 90
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### [Data Algorithms with Spark](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/) by Mahmoud Parsian
"... This book will be a great resource for
both readers looking to implement existing
algorithms in a scalable fashion and readers
who are developing new, custom algorithms
using Spark. ..."
Dr. Matei Zaharia
Original Creator of Apache Spark
FOREWORD by Dr. Matei Zaharia
-------
### [Data Algorithms with Spark](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/) by [Mahmoud Parsian](https://www.linkedin.com/mahmoudparsian/)
### [Foreword by Dr. Matei Zaharia](./docs/FOREWORD_by_Dr_Matei_Zaharia.md) (Original Creator of Apache Spark)
### Author: [Mahmoud Parsian](https://www.linkedin.com/in/mahmoudparsian/)
### [Goal of this book: Data Algorithms with Spark](./docs/goal_of_book.md)
### [Story of this book: Data Algorithms with Spark](./docs/story_of_book.md)
--------
* [Mahmoud Parsian's Author Page @Amazon](https://www.amazon.com/author/mahmoudparsian/)
* [Mahmoud Parsian's Author Page @LinkedIn](https://www.linkedin.com/mahmoudparsian/)
* This [new O'Reilly book](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/)
is the successor Edition of [Data Algorithms](https://www.oreilly.com/library/view/data-algorithms/9781491906170/)
(published by [O'Reilly](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/))* This book uses PySpark (much simpler and readable)
* [Published date: April 8, 2022](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/)* [@OReillyMedia: Data Algorithms with Spark, By @mahmoudparsian](https://twitter.com/OReillyMedia/status/1511796122548903938/)
* Autor Contact: [ [![Email](https://support.microsoft.com/images/Mail-GrayScale.png) Email](mailto:[email protected]) ] [ [![Linkedin](https://i.stack.imgur.com/gVE0j.png) Mahmoud Parsian @LinkedIn](https://www.linkedin.com/mahmoudparsian/) ][ [![GitHub](https://i.stack.imgur.com/tskMh.png) Mahmoud Parsian @GitHub](https://github.com/mahmoudparsian/) ]
-------
## [Github Chapter Solutions](./code/)
* This GitHub repository will host all source code and scripts for
[Data Algorithms with Spark]((https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/))* Chapter solutions are provided in [PySpark and Scala](./code/)
* [PySpark solutions](./code/) are provided by [Mahmoud Parsian](https://github.com/mahmoudparsian/)
* [Scala solutions](./code/) are provided by [Deepak Kumar](https://github.com/deepakmca05/) and [Biman Mandal](https://github.com/bimanmandal/)
-----## Software:
All programs are tested with the following software:
| Spark | Python | Scala | Java
|----------|:----------------:|-------:|-----------:|
| [Apache Spark 3.4.0](http://spark.apache.org/downloads.html) | [Python 3.10.5](https://www.python.org/downloads/) | [Scala 2.13](https://https://www.scala-lang.org/download/scala2.html) | [Java 11](https://www.oracle.com/java/technologies/javase/jdk11-archive-downloads.html) |-----
## Table of Contents
| Chapter | Title |
|--------------|------------------|
| Glossary | [Glossary of Big Data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/glossary_of_big_data_and_mapreduce.md)
| Chapter 1 | [Introduction to Data Algorithms](./code/chap01/) |
| Chapter 2 | [Transformations in Action](./code/chap02/) |
| Chapter 3 | [Mapper Transformations](./code/chap03/) |
| Chapter 4 | [Reductions in Spark](./code/chap04/) |
| Chapter 5 | [Partitioning Data](./code/chap05/) |
| Chapter 6 | [Graph Algorithms](./code/chap06/) |
| Chapter 7 | [Interacting with External Data Sources](./code/chap07/) |
| Chapter 8 | [Ranking Algorithms](./code/chap08/) |
| Chapter 9 | [Fundamental Data Design Patterns](./code/chap09/) |
| Chapter 10 | [Common Data Design Patterns](./code/chap10/) |
| Chapter 11 | [Join Design Patterns](./code/chap11/) |
| Chapter 12 | [Feature Engineering in PySpark](./code/chap12/) |--------
## Bonus Chapters
| Bonus Chapter | Title / Description |
|-----------------------------------|----------------------|
| Glossary | [Glossary of Big Data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/glossary_of_big_data_and_mapreduce.md) |
| Word Count | [Solutions for Word Count using RDDs and DataFrames](./code/bonus_chapters/wordcount/) |
| Anagrams | [Find words, which are anagrams](./code/bonus_chapters/anagrams/) |
| Lambda Expressions | [Using Lambda Expressions in PySpark programs](./code/bonus_chapters/lambda_expressions/) |
| TF-IDF | [Term Frequency - Inverse Document Frequency](./code/bonus_chapters/TF-IDF/) |
| K-mers | [K-mers for DNA Sequences](./code/bonus_chapters/k-mers/) |
| Correlation | [All vs. All Correlation](./code/bonus_chapters/correlation/) |
| Mapping Partitions | [`mapPartitions()` Complete Example](./code/bonus_chapters/mappartitions/) |
| UDF | [User-Defined Function Examples](./code/bonus_chapters/UDF/) |
| DataFrames Transformations | [Examples on Creation and Transformation of DataFrames](./code/bonus_chapters/dataframes/) |
| DataFrames Tutorials | [DataFrames Tutorials: from collections and CSV text files](./code/bonus_chapters/dataframes/) |
| Join Operations | [Examples on join of RDDs and DataFrames](./code/bonus_chapters/join/)|
| PySpark Tutorial 101 | [Examples on using PySpark RDDs and DataFrames](./code/bonus_chapters/pyspark_tutorial/) |
| Physical Data Partitioning | [Tutorial of Physical Data Partitioning](./code/bonus_chapters/physical_partitioning/README.md) |
| Monoids and Combiners | [Monoid as a Design Principle](https://github.com/mahmoudparsian/data-algorithms-with-spark/blob/master/wiki-spark/docs/monoid/README.md) |-------
------
-------