https://github.com/cschan1828/data-modeling-with-apache-cassandra

Data modeling with Apache Cassandra
https://github.com/cschan1828/data-modeling-with-apache-cassandra

cassandra data-engineering etl-pipeline

Last synced: 3 months ago
JSON representation

Data modeling with Apache Cassandra

Host: GitHub
URL: https://github.com/cschan1828/data-modeling-with-apache-cassandra
Owner: cschan1828
Created: 2020-02-07T15:59:54.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-02-19T16:11:11.000Z (over 5 years ago)
Last Synced: 2025-01-27T14:55:28.581Z (5 months ago)
Topics: cassandra, data-engineering, etl-pipeline
Language: HTML
Homepage:
Size: 844 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Data Modeling with Apache Cassandra
## Introduction
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

Consequently, a data engineer is required to create a Cassandra database which can create queries on song play data to answer the questions and discover business values. My role is to design a database schema and ETL pipeline for this analysis.

## Dataset
The first dataset is a subset of real data from the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/). Each file is in csv format and contains metadata about events. The files are partitioned by date.

## Model a Apache Cassandra database
In this project, Apache Cassandra database is adopted. For such database, denomrialization must be doen for fast read. Additionally, model the data before query is required. That is, our strategy is one table per query.

## Relevant Files
* Cassandra_ETL.ipynb: Read csv files, process and load them into database.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cschan1828/data-modeling-with-apache-cassandra

Awesome Lists containing this project

README