{"id":26129519,"url":"https://github.com/mathworks-ref-arch/matlab-apache-spark","last_synced_at":"2025-04-13T18:42:41.017Z","repository":{"id":78586830,"uuid":"439011331","full_name":"mathworks-ref-arch/matlab-apache-spark","owner":"mathworks-ref-arch","description":null,"archived":false,"fork":false,"pushed_at":"2023-02-15T14:06:43.000Z","size":116,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-10T19:55:17.629Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"MATLAB","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mathworks-ref-arch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.MD","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-16T14:06:18.000Z","updated_at":"2023-08-05T12:47:08.000Z","dependencies_parsed_at":"2023-06-10T03:45:30.563Z","dependency_job_id":null,"html_url":"https://github.com/mathworks-ref-arch/matlab-apache-spark","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mathworks-ref-arch%2Fmatlab-apache-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mathworks-ref-arch%2Fmatlab-apache-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mathworks-ref-arch%2Fmatlab-apache-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mathworks-ref-arch%2Fmatlab-apache-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mathworks-ref-arch","download_url":"https://codeload.github.com/mathworks-ref-arch/matlab-apache-spark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248764973,"owners_count":21158196,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-10T19:49:16.120Z","updated_at":"2025-04-13T18:42:41.003Z","avatar_url":"https://github.com/mathworks-ref-arch.png","language":"MATLAB","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MATLAB Interface *for Apache Spark*\n\nApache Spark™ is a unified analytics engine for large-scale data processing.\n\n## Requirements\n### MathWorks Products (https://www.mathworks.com)\nRequires MATLAB release R2019a or later.\n* MATLAB\n* (optional) MATLAB Parallel Server\n* (optional) MATLAB Compiler\n* (optional) MATLAB Compiler SDK\n\n### 3rd Party Products\n\n* JDK 8+\n* Apache Maven™ 3.6.3\n\n## Introduction\n\nThis package enables connecting to a local or a remote Apache Spark cluster,\nto read and write data, and perform certain Spark operations on this data.\nMATLAB ships with two workflows involving Apache Spark, \n[Deploy Applications Using the MATLAB API for Spark](https://www.mathworks.com/help/releases/R2022a/compiler/deploy-applications-using-the-matlab-api-for-spark.html)\nand\n[Deploy Tall Arrays to a Spark Enabled Hadoop Cluster](https://www.mathworks.com/help/releases/R2022a/compiler/deploy-tall-arrays-to-a-spark-enabled-hadoop-cluster.html).\n\nThis package provides some different workflows for using MATLAB and Spark together.\n* Using interactive Spark sessions to read, write and operate on Data on a Spark cluster.\n* Using the MATLAB Compiler SDK to compile MATLAB code into a form that can be used in\nSpark Dataset methods (map, filter, etc.). This can be used both with Java and Python.\n\n## Installation\n\n### Clone this repository recursively\n\n```\n git clone --recursive git@github.com:mathworks-ref-arch/matlab-apache-spark.git\n```\n\u003e **Note:** If you're reading this from a downloaded package,\n\u003e you don't need to clone anything.\n\n### Build MATLAB Spark Utility\n\nBefore the package can be used, some Jar files must be built.\n\nPlease refer to the [Installation section](Documentation/Installation.md) for this.\n\n## Usage\n\n### Interactive Spark\nFirst, create a Spark session\n```matlab\n\n% A second argument can be used for choosing a spark cluster. This example uses\n% a local (in process) Spark.\nspark = getDefaultSparkSession('my-spark-app')\n```\nCreate a dataset in Spark\n```matlab\nflightsCSV = which('airlinesmall.csv');\nflights = spark.read.format(\"csv\") ...\n    .option(\"header\", \"true\") ...\n    .option(\"inferSchema\", \"true\") ...\n    .load(addFileProtocol(flightsCSV));\n```\n\nCheck how many rows are in this dataset\n```matlab\n\u003e\u003e fprintf(\"Number of flights: #%d\\n\", flights.count)\nNumber of flights: #123523\n```\n\nChange some datatypes from text to int in the dataset\n```matlab\ncleanFlights = flights ...\n    .withColumn('ArrDelay', flights.col('ArrDelay').cast('int')) ...\n    .withColumn('DepDelay', flights.col('DepDelay').cast('int'));```\n```\n\nDo a `groupBy` of how many flights leave a certain airport at a certain day of week,\nthen sort the results, first ascending, then descending.\n```matlab\n% How many flights leave a certain origin on a certain day of week?\n\u003e\u003e dow=cleanFlights.groupBy('Origin', 'DayOfWeek').count();\n\u003e\u003e dow.show(5)\n+------+---------+-----+\n|Origin|DayOfWeek|count|\n+------+---------+-----+\n|   CLT|        4|  378|\n|   BNA|        6|  154|\n|   CLE|        5|  224|\n|   MSY|        2|  141|\n|   TPA|        3|  196|\n+------+---------+-----+\nonly showing top 5 rows\n\n% And if we sort them?\n\u003e\u003e dow.sort(\"count\").show(5)\n+------+---------+-----+\n|Origin|DayOfWeek|count|\n+------+---------+-----+\n|   LRD|        2|    1|\n|   DLG|        2|    1|\n|   SMX|        1|    1|\n|   RDM|        2|    1|\n|   CHO|        2|    1|\n+------+---------+-----+\nonly showing top 5 rows\n\n% The default sort is ascending, but if we want to see the descending \n% order, we have to get more precise. We have to provide a column object,\n% and set it to be descending in order.\n\u003e\u003e dow.sort(dow.col(\"count\").desc()).show(10)\n+------+---------+-----+\n|Origin|DayOfWeek|count|\n+------+---------+-----+\n|   ORD|        2| 1011|\n|   ORD|        4|  988|\n|   ORD|        3|  968|\n|   ORD|        1|  942|\n|   ORD|        7|  929|\n|   ORD|        5|  927|\n|   ATL|        1|  914|\n|   ORD|        6|  908|\n|   ATL|        2|  901|\n|   ATL|        5|  899|\n+------+---------+-----+\nonly showing top 10 rows\n```\n\nUse a SQL statement to filter the flights, convert the data to a MATLAB table,\nand plot some data on delays\n```matlab\nAAflights = cleanFlights.filter(\"UniqueCarrier LIKE 'AA' AND DayOfWeek = 3\") ...\n    .select('ArrDelay', 'DepDelay');\nAAflights.count()\n\nAAT = table(AAflights);\nplot([AAT.ArrDelay, AAT.DepDelay]); shg;\ntitle(\"Arrival and Depature delay for Wednesday AA Flights\");\nlegend(\"Arrival\", \"Departure\");\n\nylabel(\"Delay\");\nxlabel(\"Flights\");\n```\n\n### Compiling code to Java or Python\n\n\nPlease see the [documentation](Documentation/README.md) for more information.\n\n## License\n\nThe license for the MATLAB Interface *for Apache Spark* is available in the LICENSE.md file in this GitHub repository.\nThis package uses certain third-party content which is licensed under separate license agreements.\nSee the pom.xml file for third-party software downloaded at build time.\n\n## Enhancement Requests\n\nProvide suggestions for additional features or capabilities using the following link:\nhttps://www.mathworks.com/products/reference-architectures/request-new-reference-architectures.html\n\n## Support\n\nEmail: mwlab@mathworks.com\n\n[//]: #  (Copyright 2021-2022 The MathWorks, Inc.)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmathworks-ref-arch%2Fmatlab-apache-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmathworks-ref-arch%2Fmatlab-apache-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmathworks-ref-arch%2Fmatlab-apache-spark/lists"}