{"id":18421280,"url":"https://github.com/mahmoudparsian/data-algorithms-with-spark","last_synced_at":"2025-04-07T13:09:58.243Z","repository":{"id":39589741,"uuid":"227022803","full_name":"mahmoudparsian/data-algorithms-with-spark","owner":"mahmoudparsian","description":"O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian","archived":false,"fork":false,"pushed_at":"2023-06-26T01:01:18.000Z","size":47051,"stargazers_count":213,"open_issues_count":0,"forks_count":93,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-31T12:05:35.515Z","etag":null,"topics":["algorithms","bigdata","data","data-abstractions","data-algorithms","data-transformation","dataframes","design","design-patterns","machine-learning","mappers","mapreduce","monoid","partitioning-algorithms","pyspark","python","rdd","reducers","spark","transformations"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mahmoudparsian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-10T03:40:09.000Z","updated_at":"2025-03-08T15:50:24.000Z","dependencies_parsed_at":"2024-11-06T04:37:19.214Z","dependency_job_id":null,"html_url":"https://github.com/mahmoudparsian/data-algorithms-with-spark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoudparsian%2Fdata-algorithms-with-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoudparsian%2Fdata-algorithms-with-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoudparsian%2Fdata-algorithms-with-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoudparsian%2Fdata-algorithms-with-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mahmoudparsian","download_url":"https://codeload.github.com/mahmoudparsian/data-algorithms-with-spark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247657281,"owners_count":20974345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithms","bigdata","data","data-abstractions","data-algorithms","data-transformation","dataframes","design","design-patterns","machine-learning","mappers","mapreduce","monoid","partitioning-algorithms","pyspark","python","rdd","reducers","spark","transformations"],"created_at":"2024-11-06T04:25:01.439Z","updated_at":"2025-04-07T13:09:58.212Z","avatar_url":"https://github.com/mahmoudparsian.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"### [Data Algorithms with Spark](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/) by Mahmoud Parsian\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd\u003e\n\u003ca href=\"https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/\"\u003e\n\u003cimg src=\"https://learning.oreilly.com/library/cover/9781492082378/250w/\"\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd\u003e\n\"... This  book  will be a  great resource for \u003cbr\u003e\nboth readers looking  to  implement  existing \u003cbr\u003e\nalgorithms in a scalable fashion and readers \u003cbr\u003e\nwho are developing new, custom algorithms  \u003cbr\u003e\nusing Spark. ...\" \u003cbr\u003e\n\u003cbr\u003e\n\u003ca href=\"https://cs.stanford.edu/people/matei/\"\u003eDr. Matei Zaharia\u003c/a\u003e\u003cbr\u003e\nOriginal Creator of Apache Spark \u003cbr\u003e\n\u003cbr\u003e\n\u003ca href=\"https://github.com/mahmoudparsian/data-algorithms-with-spark/blob/master/docs/FOREWORD_by_Dr_Matei_Zaharia.md\"\u003eFOREWORD by Dr. Matei Zaharia\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003c/tr\u003e   \n\u003c/table\u003e\n\n\n-------\n\n### [Data Algorithms with Spark](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/) by [Mahmoud Parsian](https://www.linkedin.com/mahmoudparsian/)\n\n### [Foreword by Dr. Matei Zaharia](./docs/FOREWORD_by_Dr_Matei_Zaharia.md) (Original Creator of Apache Spark)\n\n### Author: [Mahmoud Parsian](https://www.linkedin.com/in/mahmoudparsian/) \n\n### [Goal of this book: Data Algorithms with Spark](./docs/goal_of_book.md)\n\n### [Story of this book: Data Algorithms with Spark](./docs/story_of_book.md)\n\n\n--------\n\n* [Mahmoud Parsian's Author Page @Amazon](https://www.amazon.com/author/mahmoudparsian/)\n\n* [Mahmoud Parsian's Author Page @LinkedIn](https://www.linkedin.com/mahmoudparsian/)\n\n* This [new O'Reilly book](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/) \n  is the successor Edition of [Data Algorithms](https://www.oreilly.com/library/view/data-algorithms/9781491906170/) \n  (published by [O'Reilly](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/))\n\n* This book uses PySpark (much simpler and readable)\n\t\n* [Published date: April 8, 2022](https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/)\n\n* [@OReillyMedia: Data Algorithms with Spark, By @mahmoudparsian](https://twitter.com/OReillyMedia/status/1511796122548903938/)\n\n* Autor Contact: [ [![Email](https://support.microsoft.com/images/Mail-GrayScale.png) Email](mailto:mahmoud.parsian@yahoo.com) ]  [  [![Linkedin](https://i.stack.imgur.com/gVE0j.png) Mahmoud Parsian @LinkedIn](https://www.linkedin.com/mahmoudparsian/) ][  [![GitHub](https://i.stack.imgur.com/tskMh.png) Mahmoud Parsian @GitHub](https://github.com/mahmoudparsian/) ]\n\n\n-------\n\n## [Github Chapter Solutions](./code/)\n\n* This GitHub repository will host all source code and scripts for \n  [Data Algorithms with Spark]((https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/))\n\n* Chapter solutions are provided in [PySpark and Scala](./code/)\n\t* [PySpark solutions](./code/) are provided by [Mahmoud Parsian](https://github.com/mahmoudparsian/)\n\t* [Scala solutions](./code/) are provided by [Deepak Kumar](https://github.com/deepakmca05/) and [Biman Mandal](https://github.com/bimanmandal/)\n\t\n-----\n\n## Software:\n\nAll programs are tested with the following software:\n\n| Spark    |      Python      |  Scala | Java \n|----------|:----------------:|-------:|-----------:|\n| [Apache Spark 3.4.0](http://spark.apache.org/downloads.html) |  [Python 3.10.5](https://www.python.org/downloads/) | [Scala 2.13](https://https://www.scala-lang.org/download/scala2.html) | [Java 11](https://www.oracle.com/java/technologies/javase/jdk11-archive-downloads.html) |\n\n-----\n\n## Table of Contents\n\n| Chapter      |      Title       |\n|--------------|------------------|\n| Glossary     | [Glossary of Big Data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/glossary_of_big_data_and_mapreduce.md)\n| Chapter 1    | [Introduction to Data Algorithms](./code/chap01/) |\n| Chapter 2    | [Transformations in Action](./code/chap02/) |\n| Chapter 3    | [Mapper Transformations](./code/chap03/) |\n| Chapter 4    | [Reductions in Spark](./code/chap04/) |\n| Chapter 5    | [Partitioning Data](./code/chap05/) |\n| Chapter 6    | [Graph Algorithms](./code/chap06/) |\n| Chapter 7    | [Interacting with External Data Sources](./code/chap07/) |\n| Chapter 8    | [Ranking Algorithms](./code/chap08/) |\n| Chapter 9    | [Fundamental Data Design Patterns](./code/chap09/) |\n| Chapter 10   | [Common Data Design Patterns](./code/chap10/) |\n| Chapter 11   | [Join Design Patterns](./code/chap11/) |\n| Chapter 12   | [Feature Engineering in PySpark](./code/chap12/) |\n\n\n--------\n\n## Bonus Chapters\n\n\n| Bonus Chapter                     | Title / Description  |\n|-----------------------------------|----------------------|\n| Glossary                          | [Glossary of Big Data, MapReduce, Spark](https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/slides/glossary/glossary_of_big_data_and_mapreduce.md)  |\n| Word Count                        | [Solutions for Word Count using RDDs and DataFrames](./code/bonus_chapters/wordcount/)  |\n| Anagrams                          | [Find words, which are anagrams](./code/bonus_chapters/anagrams/) |\n| Lambda Expressions                | [Using Lambda Expressions in PySpark programs](./code/bonus_chapters/lambda_expressions/) |\n| TF-IDF                            | [Term Frequency - Inverse Document Frequency](./code/bonus_chapters/TF-IDF/) |\n| K-mers                            | [K-mers for DNA Sequences](./code/bonus_chapters/k-mers/) |\n| Correlation                       | [All vs. All Correlation](./code/bonus_chapters/correlation/)  |\n| Mapping Partitions                | [`mapPartitions()` Complete Example](./code/bonus_chapters/mappartitions/)  |\n| UDF                               | [User-Defined Function Examples](./code/bonus_chapters/UDF/)  |\n| DataFrames Transformations        | [Examples on Creation and Transformation of DataFrames](./code/bonus_chapters/dataframes/) |\n| DataFrames Tutorials              | [DataFrames Tutorials: from collections and CSV text files](./code/bonus_chapters/dataframes/) |\n| Join Operations                   | [Examples on join of RDDs and DataFrames](./code/bonus_chapters/join/)|\n| PySpark Tutorial 101              | [Examples on using PySpark RDDs and DataFrames](./code/bonus_chapters/pyspark_tutorial/) |\n| Physical Data Partitioning        | [Tutorial of Physical Data Partitioning](./code/bonus_chapters/physical_partitioning/README.md) |\n| Monoids and Combiners             | [Monoid as a Design Principle](https://github.com/mahmoudparsian/data-algorithms-with-spark/blob/master/wiki-spark/docs/monoid/README.md) |\n\n-------\n\n\u003ca href=\"https://www.oreilly.com/library/view/data-algorithms-with/9781492082378\"\u003e\n    \u003cimg\n        alt=\"Data Algorithms with Spark\"\n        src=\"images/data_algorithms_with_spark_knowledge_is_power.jpeg\"\n\u003e\n\n\u003ca href=\"https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/\"\u003e\n    \u003cimg\n        alt=\"Data Algorithms with Spark\"\n        src=\"images/Data-Algorithms-with-Spark_mech2.png\"\n\u003e\n\n\u003ca href=\"https://www.oreilly.com/library/view/data-algorithms-with/9781492082378/\"\u003e\n    \u003cimg\n        alt=\"Data Algorithms with Spark\"\n        src=\"images/Data_Algorithms_with_Spark_COVER_9781492082385.png\"\n\u003e\n\n------\n\n\n\u003c!---      metadata         --\u003e\n\u003c!---   Data Algorithms with Spark, Spark, PySpark, Python --\u003e\n\u003c!---   MapReduce, Distributed Algorithms, mappers, reducers, partitioners --\u003e\n\u003c!---   Transformations, Actions, RDDs, DataFrames, SQL --\u003e\n\u003c!---   Data Design Patterns, monoids --\u003e\n\u003c!---   RDD map transformations: map(), flatMap(), mapPartitions() --\u003e\n\u003c!---   RDD reducers: groupByKey(), reduceByKey(), combineByKey() --\u003e\n\u003c!---   RDD actions: reduce(), collect()  --\u003e\n\u003c!---   RDD Tutorial --\u003e\n\u003c!---   DataFrame Tutorial --\u003e\n\u003c!---   Join operations --\u003e\n\u003c!---   RDD reducers --\u003e\n\u003c!---   DataFrames creation, manipulation, and transformations --\u003e\n\n-------","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmahmoudparsian%2Fdata-algorithms-with-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmahmoudparsian%2Fdata-algorithms-with-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmahmoudparsian%2Fdata-algorithms-with-spark/lists"}