Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dciborow/spark-linalg
https://github.com/dciborow/spark-linalg
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/dciborow/spark-linalg
- Owner: dciborow
- Created: 2018-05-22T14:43:28.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-11-06T05:28:28.000Z (over 6 years ago)
- Last Synced: 2024-10-18T07:16:57.458Z (4 months ago)
- Language: Scala
- Size: 39.1 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Background
The current matrix multiplication implementation in Spark runs into scaling issues. The matrix multiplcaiton plan becomes unrunable when the number of columns in the first matrix exceeds 40,000. This new appraoch is designed to overcome that issue, and to provide a more scalable implmentation of matrix multiplication that can complete when matrix sizes exceed millions of columns and rows. A script is included, SparseSparseRDDTimer, that can be used to easily reproduce the observed issues with the current matrix multiplication implementation.## Benchmarks
Benchmark results can be found in Benchmarks.xlsx, and the code to generate the benchmark results is included in Benchmarks.scala. The benchmarks were run on Azure Databricks and the cluster configurations can be found in each tab. For larger benchmarks, it is recommended to first generate and save the dataset, and to also write the dataset to disc. This is to avoid collecting the results onto the driver, while forcing Spark's lazy execution. Sample code for this is included in DataGeneration.scala.## Setup
###### Downloads
First, ensure that both Java 1.8 and Maven are installed. Newer versions of Java may cause build issues.Install java 1.8 in an easily accessible directory (for this example, ~/java/)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Download apache Maven 3.3.9 and un-tar/gunzip to ~/apache/apache-maven-3.3.9/ .
https://maven.apache.org/download.cgiCreate a directory ~/src/ .
```
mkdir ~/src/
cd ~/src/
```Clone project repository into `~/src`.
```
git clone https://github.com/dciborow/spark-linalg.git
```
###### Building Project with Apache Maven
From the project directory we may issue the command to build using the mvn profile.JVM build:
```
mvn clean install -DskipTests
```#### Testing the Project Environment
To launch the shell in local mode with 2 threads: simply do the following:
```
$ spark-shell MASTER=local[2] --jars="target/linalg-1.0-SNAPSHOT.jar"
```At the scala> prompt, enter:
```
scala> :load examples/bin/SparseSparseDrmTimer.scala
```
Which will load a matrix multiplication timer function definition. To run the matrix timer:
```
scala> timeSparseDRMMMul(1000,1000,1000,1,.02,1234L)
{...} res3: Long = 16321
```## Azure Databricks - Setup
First, follow the above directions to use Maven to build a JAR from the project. Then, upload the jar as a new library in Databricks. Attach this library to your desired cluster. Next, upload Eenchmarks.scala. Run this notebook on your attached cluster.
## Troubleshooting
If you run into the following error, you may have set the sample ratio to small compared to the size of your matrices.
Try increasing the sample ratio.
```
java.lang.UnsupportedOperationException: empty collection
```