Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alecruces/bigdata

Unleash the power of Apache Spark in distributed graph analytics - from triangle counting to stream processing, dive deep into scalable big data computation
https://github.com/alecruces/bigdata

apache-spark pyspark streaming

Last synced: 5 days ago
JSON representation

Unleash the power of Apache Spark in distributed graph analytics - from triangle counting to stream processing, dive deep into scalable big data computation

Host: GitHub
URL: https://github.com/alecruces/bigdata
Owner: alecruces
Created: 2024-04-06T18:03:48.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-04-10T11:01:34.000Z (6 months ago)
Last Synced: 2024-09-28T07:02:41.886Z (5 days ago)
Topics: apache-spark, pyspark, streaming
Language: Python
Homepage:
Size: 31.3 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Exploring Distributed Triangle Counting and Stream Processing with Spark
---

## Description
This project aims to implement and compare distributed algorithms for approximating the number of triangles in a graph using Apache Spark. It consists of several parts:

BigData

## Part 1: Distributed Triangle Counting with Apache Spark: Node Coloring vs. Spark Partitions
This part implements and compares two distributed algorithms:

- `MR_ApproxTCwithNodeColors`: Utilizes node coloring to assign colors to vertices and aggregates triangles based on these colors.
- `MR_ApproxTCwithSparkPartitions`: Employs Spark's partitioning mechanism to distribute computation efficiently across the cluster.

The implementation leverages Spark's parallel processing capabilities, utilizing functions such as `mapPartitionsWithIndex` and `groupByKey`, along with error handling for input parameters and file validation.

## Part 2: Triangle Counting in CloudVeneto Cluster
This part runs a Spark program on the CloudVeneto cluster to estimate the number of triangles in an undirected graph 𝐺=(𝑉,𝐸) using two algorithms:

- `MR_ApproxTCwithNodeColors`: Utilizes node coloring to assign colors to vertices and aggregates triangles based on these colors.
- `MR_ExactTC`: Precisely counts triangles in the graph by exhaustively examining all possible triangles.

Both algorithms leverage Apache Spark's parallel processing capabilities across the cluster. The code also includes functionality to measure execution time and provides options for performance evaluation.

## Part 3: Count Sketch with Spark Streaming
This part processes an unbounded stream of data in batches using Apache Spark's `StreamingContext`. It implements a sketch-based algorithm to estimate the distinct items in the stream and their frequencies.

## Keywords
Big Data computing

## Software and Tools
- Apache Spark
- Python (PySpark)
- CloudVeneto (OpenStack-based cloud)
- Streaming Spark's `StreamingContext`

## Note
The data used in this project is private and cannot be uploaded.

## Files:
Code:
1. Part 1: `counting_triangles.py`
2. Part 2: `streaming.py`
3. Part 3: `counting_triangles_cloudveneto.py`