https://github.com/dimits-ts/large-scale-data
Distributed computing for data science tasks, executed on a Ubuntu server.
https://github.com/dimits-ts/large-scale-data
cassandra kafka map-reduce spark vagrant
Last synced: 4 months ago
JSON representation
Distributed computing for data science tasks, executed on a Ubuntu server.
- Host: GitHub
- URL: https://github.com/dimits-ts/large-scale-data
- Owner: dimits-ts
- Created: 2024-02-11T17:07:08.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-03-26T16:51:11.000Z (over 1 year ago)
- Last Synced: 2025-01-21T02:25:49.966Z (6 months ago)
- Topics: cassandra, kafka, map-reduce, spark, vagrant
- Language: Java
- Homepage:
- Size: 27.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Large Scale Data Management
This repository consists of two projects:
* A Java [Hadoop map-reduce application](https://github.com/dimits-exe/large-scale-data/tree/master/map-reduce) which:
- Computes the occurences of each word for a large file
- Computes spotify song statistics for each country and month
* A [Hadoop SPARK-Cassandra application](https://github.com/dimits-exe/large-scale-data/tree/master/project-2) which:
- Generates a configurable stream of test data, posting them to a Kafka cluster
- Reads, preprocesses and combines the stream data with static data using SPARK
- Periodically posts them to a Cassandra cluster
- Performs queries using CQL on the Cassandra cluster