Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cschan1828/data-modeling-with-apache-cassandra
Data modeling with Apache Cassandra
https://github.com/cschan1828/data-modeling-with-apache-cassandra
cassandra data-engineering etl-pipeline
Last synced: 15 days ago
JSON representation
Data modeling with Apache Cassandra
- Host: GitHub
- URL: https://github.com/cschan1828/data-modeling-with-apache-cassandra
- Owner: cschan1828
- Created: 2020-02-07T15:59:54.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-02-19T16:11:11.000Z (almost 5 years ago)
- Last Synced: 2024-11-29T02:41:03.084Z (about 2 months ago)
- Topics: cassandra, data-engineering, etl-pipeline
- Language: HTML
- Homepage:
- Size: 844 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Modeling with Apache Cassandra
## Introduction
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.Consequently, a data engineer is required to create a Cassandra database which can create queries on song play data to answer the questions and discover business values. My role is to design a database schema and ETL pipeline for this analysis.
## Dataset
The first dataset is a subset of real data from the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/). Each file is in csv format and contains metadata about events. The files are partitioned by date.## Model a Apache Cassandra database
In this project, Apache Cassandra database is adopted. For such database, denomrialization must be doen for fast read. Additionally, model the data before query is required. That is, our strategy is one table per query.## Relevant Files
* Cassandra_ETL.ipynb: Read csv files, process and load them into database.