https://github.com/wesslen/code-tutorials-for-sophi
Tutorials and templates for running Spark on UNCC's SOPHI platform
https://github.com/wesslen/code-tutorials-for-sophi
pyspark scala spark-sql
Last synced: 10 months ago
JSON representation
Tutorials and templates for running Spark on UNCC's SOPHI platform
- Host: GitHub
- URL: https://github.com/wesslen/code-tutorials-for-sophi
- Owner: wesslen
- Created: 2016-11-16T00:01:54.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-02-15T01:01:11.000Z (almost 9 years ago)
- Last Synced: 2025-02-13T00:30:26.436Z (12 months ago)
- Topics: pyspark, scala, spark-sql
- Homepage:
- Size: 17.6 KB
- Stars: 1
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SOPHI Code
## Introduction
This repository provides template code for running Spark on [SOPHI](http://sophi.uncc.edu). The code will include a mixture of Scala, PySpark and SparkR.
## Code
| Topics |
| --------------------------------------------------------------- |
| [Twitter Gnip SQL-DataFrame Manipulation with PySpark](/code/PySpark-Dataframe-Processing.md) |
| [Twitter Gnip Summary Count Files with PySpark](/code/PySpark-Gnip-Twitter-Summary-Files.md) |
| [Twitter Gnip Latent Dirichlet Allocation with Scala](/code/Scala-LDA.md) |
## How to access SOPHI
To access SOPHI, you must have an active UNCC ID username (student, faculty or staff) and be connected to the UNCC network either directly (edu-roam) or through VPN. See this [link](https://faq.uncc.edu/pages/viewpage.action?pageId=6653379) on how to set up VPN access.
This link ([https://cci-hadoopm3.uncc.edu](https://cci-hadoopm3.uncc.edu)) provides access to SOPHI's Hue Interface.
To start, click this link and then when prompted, enter your UNCC ID and password.
## How to open a Notebook
Within SOPHI, click the "Notebook" button on the top ribbon and click the "+ Notebook" button to create a new Notebook.
Once within a new Notebook, create a PySpark, Scala or SparkR (not available yet) session.
## Further Links
* [Spark Programming Guide](https://spark.apache.org/docs/latest/programming-guide.html)
* [Spark SQL and DataFrames Tutorial](http://spark.apache.org/docs/latest/sql-programming-guide.html)
* [Spark Machine Learning Library Tutorial](http://spark.apache.org/docs/latest/ml-guide.html)
* [Databrick's Spark Guides](https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html)
* [Automating PySpark Code through YARN and Oozie](http://gethue.com/how-to-schedule-spark-jobs-with-spark-on-yarn-and-oozie/)
* [PySpark and nltk (Anaconda)](https://docs.continuum.io/anaconda-cluster/howto/spark-nltk)
* [CY Lin's Big Data Analytics PySpark Tutorial](https://www.ee.columbia.edu/~cylin/course/bigdata/EECS6893-BigDataAnalytics-Lecture6.pdf)
* [Matteo Redaelli's PySpark Twitter GitHub Repository](https://github.com/matteoredaelli/pyspark-examples)
* [Duke Computational Statistics PySpark Tutorial](http://people.duke.edu/~ccc14/sta-663-2016/21A_Introduction_To_Spark.html)
* [XD-Deng's GitHub Tutorial on PySpark](https://github.com/XD-DENG/Spark-practice)
* [Charles Rawles' Using Apache Spark for Sports Analytics](https://content.pivotal.io/blog/how-data-science-assists-sports)