Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dimits-ts/large-scale-data
Distributed computing for data science tasks, executed on a Ubuntu server.
https://github.com/dimits-ts/large-scale-data
cassandra kafka map-reduce spark vagrant
Last synced: 26 days ago
JSON representation
Distributed computing for data science tasks, executed on a Ubuntu server.
- Host: GitHub
- URL: https://github.com/dimits-ts/large-scale-data
- Owner: dimits-ts
- Created: 2024-02-11T17:07:08.000Z (9 months ago)
- Default Branch: master
- Last Pushed: 2024-03-26T16:51:11.000Z (8 months ago)
- Last Synced: 2024-09-30T04:02:58.889Z (about 1 month ago)
- Topics: cassandra, kafka, map-reduce, spark, vagrant
- Language: Java
- Homepage:
- Size: 27.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Large Scale Data Management
This repository consists of two projects:
* A Java [Hadoop map-reduce application](https://github.com/dimits-exe/large-scale-data/tree/master/map-reduce) which:
- Computes the occurences of each word for a large file
- Computes spotify song statistics for each country and month
* A [Hadoop SPARK-Cassandra application](https://github.com/dimits-exe/large-scale-data/tree/master/project-2) which:
- Generates a configurable stream of test data, posting them to a Kafka cluster
- Reads, preprocesses and combines the stream data with static data using SPARK
- Periodically posts them to a Cassandra cluster
- Performs queries using CQL on the Cassandra cluster