https://github.com/mathworks-ref-arch/matlab-apache-spark

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/mathworks-ref-arch/matlab-apache-spark
Owner: mathworks-ref-arch
License: other
Created: 2021-12-16T14:06:18.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-02-15T14:06:43.000Z (over 3 years ago)
Last Synced: 2025-03-10T19:55:17.629Z (over 1 year ago)
Language: MATLAB
Size: 113 KB
Stars: 3
Watchers: 5
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.MD
- Security: SECURITY.md

Awesome Lists containing this project

README

          # MATLAB Interface *for Apache Spark*

Apache Spark™ is a unified analytics engine for large-scale data processing.

## Requirements

### MathWorks Products (https://www.mathworks.com)

Requires MATLAB release R2019a or later.

* MATLAB

* (optional) MATLAB Parallel Server

* (optional) MATLAB Compiler

* (optional) MATLAB Compiler SDK

### 3rd Party Products

* JDK 8+

* Apache Maven™ 3.6.3

## Introduction

This package enables connecting to a local or a remote Apache Spark cluster,

to read and write data, and perform certain Spark operations on this data.

MATLAB ships with two workflows involving Apache Spark, 

[Deploy Applications Using the MATLAB API for Spark](https://www.mathworks.com/help/releases/R2022a/compiler/deploy-applications-using-the-matlab-api-for-spark.html)

and

[Deploy Tall Arrays to a Spark Enabled Hadoop Cluster](https://www.mathworks.com/help/releases/R2022a/compiler/deploy-tall-arrays-to-a-spark-enabled-hadoop-cluster.html).

This package provides some different workflows for using MATLAB and Spark together.

* Using interactive Spark sessions to read, write and operate on Data on a Spark cluster.

* Using the MATLAB Compiler SDK to compile MATLAB code into a form that can be used in

Spark Dataset methods (map, filter, etc.). This can be used both with Java and Python.

## Installation

### Clone this repository recursively

```

 git clone --recursive git@github.com:mathworks-ref-arch/matlab-apache-spark.git

```

> **Note:** If you're reading this from a downloaded package,

> you don't need to clone anything.

### Build MATLAB Spark Utility

Before the package can be used, some Jar files must be built.

Please refer to the [Installation section](Documentation/Installation.md) for this.

## Usage

### Interactive Spark

First, create a Spark session

```matlab

% A second argument can be used for choosing a spark cluster. This example uses

% a local (in process) Spark.

spark = getDefaultSparkSession('my-spark-app')

```

Create a dataset in Spark

```matlab

flightsCSV = which('airlinesmall.csv');

flights = spark.read.format("csv") ...

    .option("header", "true") ...

    .option("inferSchema", "true") ...

    .load(addFileProtocol(flightsCSV));

```

Check how many rows are in this dataset

```matlab

>> fprintf("Number of flights: #%d\n", flights.count)

Number of flights: #123523

```

Change some datatypes from text to int in the dataset

```matlab

cleanFlights = flights ...

    .withColumn('ArrDelay', flights.col('ArrDelay').cast('int')) ...

    .withColumn('DepDelay', flights.col('DepDelay').cast('int'));```

```

Do a `groupBy` of how many flights leave a certain airport at a certain day of week,

then sort the results, first ascending, then descending.

```matlab

% How many flights leave a certain origin on a certain day of week?

>> dow=cleanFlights.groupBy('Origin', 'DayOfWeek').count();

>> dow.show(5)

+------+---------+-----+

|Origin|DayOfWeek|count|

+------+---------+-----+

|   CLT|        4|  378|

|   BNA|        6|  154|

|   CLE|        5|  224|

|   MSY|        2|  141|

|   TPA|        3|  196|

+------+---------+-----+

only showing top 5 rows

% And if we sort them?

>> dow.sort("count").show(5)

+------+---------+-----+

|Origin|DayOfWeek|count|

+------+---------+-----+

|   LRD|        2|    1|

|   DLG|        2|    1|

|   SMX|        1|    1|

|   RDM|        2|    1|

|   CHO|        2|    1|

+------+---------+-----+

only showing top 5 rows

% The default sort is ascending, but if we want to see the descending 

% order, we have to get more precise. We have to provide a column object,

% and set it to be descending in order.

>> dow.sort(dow.col("count").desc()).show(10)

+------+---------+-----+

|Origin|DayOfWeek|count|

+------+---------+-----+

|   ORD|        2| 1011|

|   ORD|        4|  988|

|   ORD|        3|  968|

|   ORD|        1|  942|

|   ORD|        7|  929|

|   ORD|        5|  927|

|   ATL|        1|  914|

|   ORD|        6|  908|

|   ATL|        2|  901|

|   ATL|        5|  899|

+------+---------+-----+

only showing top 10 rows

```

Use a SQL statement to filter the flights, convert the data to a MATLAB table,

and plot some data on delays

```matlab

AAflights = cleanFlights.filter("UniqueCarrier LIKE 'AA' AND DayOfWeek = 3") ...

    .select('ArrDelay', 'DepDelay');

AAflights.count()

AAT = table(AAflights);

plot([AAT.ArrDelay, AAT.DepDelay]); shg;

title("Arrival and Depature delay for Wednesday AA Flights");

legend("Arrival", "Departure");

ylabel("Delay");

xlabel("Flights");

```

### Compiling code to Java or Python

Please see the [documentation](Documentation/README.md) for more information.

## License

The license for the MATLAB Interface *for Apache Spark* is available in the LICENSE.md file in this GitHub repository.

This package uses certain third-party content which is licensed under separate license agreements.

See the pom.xml file for third-party software downloaded at build time.

## Enhancement Requests

Provide suggestions for additional features or capabilities using the following link:

https://www.mathworks.com/products/reference-architectures/request-new-reference-architectures.html

## Support

Email: mwlab@mathworks.com

[//]: #  (Copyright 2021-2022 The MathWorks, Inc.)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mathworks-ref-arch/matlab-apache-spark

Awesome Lists containing this project

README