{"id":14982368,"url":"https://github.com/tirthajyoti/spark-with-python","last_synced_at":"2025-04-05T12:05:43.751Z","repository":{"id":50601119,"uuid":"145349886","full_name":"tirthajyoti/Spark-with-Python","owner":"tirthajyoti","description":"Fundamentals of Spark with Python (using PySpark), code examples","archived":false,"fork":false,"pushed_at":"2022-10-29T13:26:53.000Z","size":9405,"stargazers_count":343,"open_issues_count":1,"forks_count":272,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-05T12:05:37.715Z","etag":null,"topics":["analytics","apache","apache-spark","big-data","database","dataframe","distributed-computing","hadoop","hdfs","machine-learning","map-reduce","mlib","parallel-computing","pyspark","python","spark","sql"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tirthajyoti.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-08-20T01:07:21.000Z","updated_at":"2025-03-15T18:24:10.000Z","dependencies_parsed_at":"2023-01-20T11:17:55.391Z","dependency_job_id":null,"html_url":"https://github.com/tirthajyoti/Spark-with-Python","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tirthajyoti%2FSpark-with-Python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tirthajyoti%2FSpark-with-Python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tirthajyoti%2FSpark-with-Python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tirthajyoti%2FSpark-with-Python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tirthajyoti","download_url":"https://codeload.github.com/tirthajyoti/Spark-with-Python/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247332604,"owners_count":20921853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","apache","apache-spark","big-data","database","dataframe","distributed-computing","hadoop","hdfs","machine-learning","map-reduce","mlib","parallel-computing","pyspark","python","spark","sql"],"created_at":"2024-09-24T14:05:17.332Z","updated_at":"2025-04-05T12:05:43.728Z","avatar_url":"https://github.com/tirthajyoti.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark with Python\n\n## Apache Spark\n\u003ca href=\"https://spark.apache.org/\"\u003eApache Spark\u003c/a\u003e is one of the hottest new trends in the technology domain. It is the framework with probably the **highest potential to realize the fruit of the marriage between Big Data and Machine Learning**. It runs fast (up to 100x faster than traditional \u003ca href=\"https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm\"\u003eHadoop MapReduce\u003c/a\u003e due to in-memory operation, offers robust, distributed, fault-tolerant data objects (called \u003ca href=\"https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm\"\u003eRDD\u003c/a\u003e), and integrates beautifully with the world of machine learning and graph analytics through supplementary packages like \u003ca href=\"https://spark.apache.org/mllib/\"\u003eMlib\u003c/a\u003e and \u003ca href=\"https://spark.apache.org/graphx/\"\u003eGraphX\u003c/a\u003e.\n\u003cbr\u003e\n\u003cp align='center'\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/tirthajyoti/PySpark_Basics/master/Images/Spark%20ecosystem.png\" width=\"400\" height=\"400\"\u003e\n\u003c/p\u003e\nSpark is implemented on \u003ca href=\"https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html\"\u003eHadoop/HDFS\u003c/a\u003e and written mostly in \u003ca href=\"https://www.scala-lang.org/\"\u003eScala\u003c/a\u003e, a functional programming language, similar to Java. In fact, Scala needs the latest Java installation on your system and runs on JVM. However, for most of the beginners, Scala is not a language that they learn first to venture into the world of data science. Fortunately, Spark provides a wonderful Python integration, called \u003cb\u003ePySpark, which lets Python programmers to interface with the Spark framework and learn how to manipulate data at scale and work with objects and algorithms over a distributed file system.\u003c/b\u003e\n\n## Notebooks\n### RDD and basics\n* [SparkContext and RDD basiscs](https://github.com/tirthajyoti/Spark-with-Python/blob/master/SparkContext%20and%20RDD%20Basics.ipynb)\n* [SparkContext workers lazy evaluations](https://github.com/tirthajyoti/Spark-with-Python/blob/master/SparkContext_Workers_Lazy_Evaluations.ipynb)\n* [RDD chaining executions](https://github.com/tirthajyoti/Spark-with-Python/blob/master/RDD_Chaining_Execution.ipynb)\n* [Word count example with RDD](https://github.com/tirthajyoti/Spark-with-Python/blob/master/Word_Count.ipynb)\n* [Partitioning and Gloming](https://github.com/tirthajyoti/Spark-with-Python/blob/master/Partioning%20and%20Gloming.ipynb)\n### Dataframe\n* [Dataframe basics](https://github.com/tirthajyoti/Spark-with-Python/blob/master/Dataframe_basics.ipynb)\n* [Dataframe simple operations](https://github.com/tirthajyoti/Spark-with-Python/blob/master/DataFrame_operations_basics.ipynb)\n* [Dataframe row and column objects](https://github.com/tirthajyoti/Spark-with-Python/blob/master/Row_column_objects.ipynb)\n* [Dataframe groupBy and aggregrate](https://github.com/tirthajyoti/Spark-with-Python/blob/master/GroupBy_aggregrate.ipynb)\n* [Dataframe SQL operations](https://github.com/tirthajyoti/Spark-with-Python/blob/master/Dataframe_SQL_query.ipynb)\n\n## Setting up Apache Spark with Python 3 and Jupyter notebook\nUnlike most Python libraries, getting PySpark to start working properly is not as straightforward as `pip install ...` and `import ...` Most of us with Python-based data science and Jupyter/IPython background take this workflow as granted for all popular Python packages. We tend to just head over to our CMD or BASH shell, type the pip install command, launch a Jupyter notebook and import the library to start practicing.\n\u003e But, PySpark+Jupyter combo needs a little bit more love :-)\n\u003cbr\u003e\n\u003cp align='center'\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/tirthajyoti/PySpark_Basics/master/Images/Components.png\" width=\"500\" height=\"300\"\u003e\n\u003c/p\u003e\n\n#### Check which version of Python is running. Python 3.4+ is needed.\n`python3 --version`\n\n#### Update apt-get\n`sudo apt-get update`\n\n#### Install pip3 (or pip for Python3)\n`sudo apt install python3-pip`\n\n#### Install Jupyter for Python3\n`pip3 install jupyter`\n\n#### Augment the PATH variable to launch Jupyter notebook\n`export PATH=$PATH:~/.local/bin`\n\n#### Java 8 is shown to work with UBUNTU 18.04  LTS/SPARK-2.3.1-BIN-HADOOP2.7\n```\nsudo add-apt-repository ppa:webupd8team/java\nsudo apt-get install oracle-java8-installer\nsudo apt-get install oracle-java8-set-default\n```\n#### Set Java related PATH variables\n```\nexport JAVA_HOME=/usr/lib/jvm/java-8-oracle\nexport JRE_HOME=/usr/lib/jvm/java-8-oracle/jre\n```\n#### Install Scala\n`sudo apt-get install scala`\n\n#### Install py4j for Python-Java integration\n`pip3 install py4j`\n\n#### Download latest Apache Spark (with pre-built Hadoop) from [Apache download server](https://spark.apache.org/downloads.html). Unpack Apache Spark after downloading\n`sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz`\n\n#### Set variables to launch PySpark with Python3 and enable it to be called from Jupyter notebook. Add all the following lines to the end of your .bashrc file\n```\nexport SPARK_HOME='/home/tirtha/Spark/spark-2.3.1-bin-hadoop2.7'\nexport PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH\nexport PYSPARK_DRIVER_PYTHON=\"jupyter\"\nexport PYSPARK_DRIVER_PYTHON_OPTS=\"notebook\"\nexport PYSPARK_PYTHON=python3\nexport PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin\n```\n#### Source .bashrc\n`source .bashrc`\n\n## Basics of `RDD`\nResilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.\n\nSpark makes use of the concept of RDD to achieve **faster and efficient MapReduce operations.**\n\n\u003cimg src=\"https://www.oreilly.com/library/view/data-analytics-with/9781491913734/assets/dawh_0402.png\" width=\"650\" height=\"250\"\u003e\n\nFormally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.\n\nThere are two ways to create RDDs,\n* parallelizing an existing collection in your driver program, \n* referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.\n\n## Basics of the `Dataframe`\n\u003cp align='center'\u003e\u003cimg src=\"https://cdn-images-1.medium.com/max/1202/1*wiXLNwwMyWdyyBuzZnGrWA.png\" width=\"600\" height=\"400\"\u003e\u003c/p\u003e\n\n### DataFrame\n\nIn Apache Spark, a DataFrame is a distributed collection of rows under named columns. It is conceptually equivalent to a table in a relational database, an Excel sheet with Column headers, or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. It also shares some common characteristics with RDD:\n\n* __Immutable in nature__ : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD  after applying transformations.\n* __Lazy Evaluations__: Which means that a task is not executed until an action is performed.\n* __Distributed__: RDD and DataFrame both are distributed in nature.\n\n### Advantages of the Dataframe\n\n* DataFrames are designed for processing large collection of structured or semi-structured data.\n* Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.\n* DataFrame in Apache Spark has the ability to handle petabytes of data.\n* DataFrame has a support for wide range of data format and sources.\n* It has API support for different languages like Python, R, Scala, Java.\n\n## Spark SQL\nSpark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark's built-in distributed collections—at scale!\n\nTo support a wide variety of diverse data sources and algorithms in Big Data, Spark SQL introduces a novel extensible optimizer called Catalyst, which makes it easy to add data sources, optimization rules, and data types for advanced analytics such as machine learning.\nEssentially, Spark SQL leverages the power of Spark to perform distributed, robust, in-memory computations at massive scale on Big Data. \n\nSpark SQL provides state-of-the-art SQL performance and also maintains compatibility with all existing structures and components supported by Apache Hive (a popular Big Data warehouse framework) including data formats, user-defined functions (UDFs), and the metastore. Besides this, it also helps in ingesting a wide variety of data formats from Big Data sources and enterprise data warehouses like JSON, Hive, Parquet, and so on, and performing a combination of relational and procedural operations for more complex, advanced analytics.\n\n![Spark-2](https://cdn-images-1.medium.com/max/2000/1*OY41hGbe4IB9-hHLRPuCHQ.png)\n\n### Speed of Spark SQL\nSpark SQL has been shown to be extremely fast, even comparable to C++ based engines such as Impala.\n\n![spark_speed](https://opensource.com/sites/default/files/uploads/9_spark-dataframes-vs-rdds-and-sql.png)\n\nFollowing graph shows a nice benchmark result of DataFrames vs. RDDs in different languages, which gives an interesting perspective on how optimized DataFrames can be.\n\n![spark-speed-2](https://opensource.com/sites/default/files/uploads/10_comparing-spark-dataframes-and-rdds.png)\n\nWhy is Spark SQL so fast and optimized? The reason is because of a new extensible optimizer, **Catalyst**, based on functional programming constructs in Scala.\n\nCatalyst's extensible design has two purposes.\n\n* Makes it easy to add new optimization techniques and features to Spark SQL, especially to tackle diverse problems around Big Data, semi-structured data, and advanced analytics\n* Ease of being able to extend the optimizer—for example, by adding data source-specific rules that can push filtering or aggregation into external storage systems or support for new data types\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftirthajyoti%2Fspark-with-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftirthajyoti%2Fspark-with-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftirthajyoti%2Fspark-with-python/lists"}