https://github.com/addono/id2221-final-project
https://github.com/addono/id2221-final-project
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/addono/id2221-final-project
- Owner: Addono
- Created: 2019-10-18T16:03:00.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2021-05-07T11:42:26.000Z (almost 5 years ago)
- Last Synced: 2024-12-27T00:27:26.355Z (about 1 year ago)
- Language: Scala
- Size: 36.1 KB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ID2221 - Final Project
Builds a directional graph from all public events (pushing public commits, creating repositories, ...). Each event is an edge which connects a user with a repository.
## Table of Contents
+ [About](#about)
+ [Getting Started](#getting_started)
+ [Usage](#usage)
+ [Resources](#resources)
## About
This project leverages the [Github Archive](https://gharchive.org) dataset.
At the moment of writing this it is 789 GiB of compressed data:
```bash
$ gsutil du -sh gs://data.gharchive.org
789.57 GiB gs://data.gharchive.org
```
### Prerequisites
* [scala](https://scala-lang.org/download/)
* [sbt](https://www.scala-sbt.org/download.html)
* [Google Cloud SDK](https://cloud.google.com/sdk/) (optional, alternatively use the [web console](https://console.cloud.google.com))
### Installing
Clone the repository and build the application:
```bash
sbt assembly
```
[Create a DataProc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) if you don't have one running yet.
Launch the project on a DataProc cluster, make sure to update the name of the cluster, the clusters region (if not set to global), the input selector and output location.
**Note: The input selector should not match files created before 1th of January 2015, so for example `201*-01-01-10` is illegal.**
The `GraphBuilder` can be used to construct the graphs without running any community detection algorithms.
```bash
gcloud dataproc jobs submit spark --jars target/scala-2.11/github-graphframe-builder-assembly-0.2.jar --cluster gh-archive-dataproc --region europe-west1 --class GraphBuilder -- "2015-01-01-*" gs://gh-grahpframes/2015-01-01
```
The `LabelPropagationRunner` both constructs the graph and then runs the label propagation algorithm on this graph. It's first argument is the location where the output files can be written to, all preceding arguments are input selector. Each of which will be scheduled as seperate batches.
```bash
gcloud dataproc jobs submit spark --jars target/scala-2.11/github-graphframe-builder-assembly-0.2.jar --cluster gh-archive-dataproc --region europe-west1 --class `LabelPropagationRunner` -- gs://gh-graphframes "2015-01-01-*"
```
### Cleanup
To prevent unnecessary costs, make sure to destroy all resources which you aren't using anymore.
First, the most expensive thing to keep running is probably going to be the DataProc cluster. Destroy it by running:
```bash
gcloud dataproc clusters delete gh-archive-dataproc
```
Also, along the way we have stored some files into Cloud Storage, e.g. the JAR we assembled or parquet files as job artifacts (see the second runtime argument):
```bash
# Individual files
gsutil rm gs:///myfilename.txt
# Directories
gsutil rm -r gs:///
```
## Resources
* [Write and run Spark Scala jobs on Cloud Dataproc](https://cloud.google.com/dataproc/docs/tutorials/spark-scala)
* [Github Archive](https://www.gharchive.org)
* [GraphFrames Documentation](https://graphframes.github.io/graphframes/docs/_site/index.html)