An open API service indexing awesome lists of open source software.

https://github.com/wesslen/code-tutorials-for-sophi

Tutorials and templates for running Spark on UNCC's SOPHI platform
https://github.com/wesslen/code-tutorials-for-sophi

pyspark scala spark-sql

Last synced: 10 months ago
JSON representation

Tutorials and templates for running Spark on UNCC's SOPHI platform

Awesome Lists containing this project

README

          

# SOPHI Code

## Introduction

This repository provides template code for running Spark on [SOPHI](http://sophi.uncc.edu). The code will include a mixture of Scala, PySpark and SparkR.

## Code

| Topics |
| --------------------------------------------------------------- |
| [Twitter Gnip SQL-DataFrame Manipulation with PySpark](/code/PySpark-Dataframe-Processing.md) |
| [Twitter Gnip Summary Count Files with PySpark](/code/PySpark-Gnip-Twitter-Summary-Files.md) |
| [Twitter Gnip Latent Dirichlet Allocation with Scala](/code/Scala-LDA.md) |

## How to access SOPHI

To access SOPHI, you must have an active UNCC ID username (student, faculty or staff) and be connected to the UNCC network either directly (edu-roam) or through VPN. See this [link](https://faq.uncc.edu/pages/viewpage.action?pageId=6653379) on how to set up VPN access.

This link ([https://cci-hadoopm3.uncc.edu](https://cci-hadoopm3.uncc.edu)) provides access to SOPHI's Hue Interface.

To start, click this link and then when prompted, enter your UNCC ID and password.

## How to open a Notebook

Within SOPHI, click the "Notebook" button on the top ribbon and click the "+ Notebook" button to create a new Notebook.

Once within a new Notebook, create a PySpark, Scala or SparkR (not available yet) session.

## Further Links

* [Spark Programming Guide](https://spark.apache.org/docs/latest/programming-guide.html)

* [Spark SQL and DataFrames Tutorial](http://spark.apache.org/docs/latest/sql-programming-guide.html)

* [Spark Machine Learning Library Tutorial](http://spark.apache.org/docs/latest/ml-guide.html)

* [Databrick's Spark Guides](https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html)

* [Automating PySpark Code through YARN and Oozie](http://gethue.com/how-to-schedule-spark-jobs-with-spark-on-yarn-and-oozie/)

* [PySpark and nltk (Anaconda)](https://docs.continuum.io/anaconda-cluster/howto/spark-nltk)

* [CY Lin's Big Data Analytics PySpark Tutorial](https://www.ee.columbia.edu/~cylin/course/bigdata/EECS6893-BigDataAnalytics-Lecture6.pdf)

* [Matteo Redaelli's PySpark Twitter GitHub Repository](https://github.com/matteoredaelli/pyspark-examples)

* [Duke Computational Statistics PySpark Tutorial](http://people.duke.edu/~ccc14/sta-663-2016/21A_Introduction_To_Spark.html)

* [XD-Deng's GitHub Tutorial on PySpark](https://github.com/XD-DENG/Spark-practice)

* [Charles Rawles' Using Apache Spark for Sports Analytics](https://content.pivotal.io/blog/how-data-science-assists-sports)