{"id":15056785,"url":"https://github.com/maxinexiong/data-modelling-with-apache-cassandra","last_synced_at":"2026-01-04T21:02:13.018Z","repository":{"id":256316523,"uuid":"854709640","full_name":"MaxineXiong/Data-Modelling-with-Apache-Cassandra","owner":"MaxineXiong","description":"This project implemented Apache Cassandra data modelling to support Sparkify's analysis of user activity and song play data. It involved consolidating partitioned files into a single CSV, designing and creating tables based on specific queries from Sparkify’s analytics team, and inserting the data from the CSV into the tables using CQL commands.","archived":false,"fork":false,"pushed_at":"2024-09-10T04:12:20.000Z","size":301,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-18T08:21:23.210Z","etag":null,"topics":["apache-cassandra","cql","data-engineering","data-modeling","etl","etl-pipeline","nosql","nosql-database","nosql-query","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MaxineXiong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-09T16:37:46.000Z","updated_at":"2024-12-31T07:23:21.000Z","dependencies_parsed_at":"2024-09-10T05:25:49.882Z","dependency_job_id":"5aec0dc4-08af-46a1-8f84-4829cfbd085b","html_url":"https://github.com/MaxineXiong/Data-Modelling-with-Apache-Cassandra","commit_stats":null,"previous_names":["maxinexiong/data-modelling-with-apache-cassandra"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaxineXiong%2FData-Modelling-with-Apache-Cassandra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaxineXiong%2FData-Modelling-with-Apache-Cassandra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaxineXiong%2FData-Modelling-with-Apache-Cassandra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaxineXiong%2FData-Modelling-with-Apache-Cassandra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MaxineXiong","download_url":"https://codeload.github.com/MaxineXiong/Data-Modelling-with-Apache-Cassandra/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254567252,"owners_count":22092738,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-cassandra","cql","data-engineering","data-modeling","etl","etl-pipeline","nosql","nosql-database","nosql-query","python"],"created_at":"2024-09-24T21:56:29.261Z","updated_at":"2026-01-04T21:02:12.931Z","avatar_url":"https://github.com/MaxineXiong.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Modelling with Apache Cassandra\n[![GitHub](https://badgen.net/badge/icon/GitHub?icon=github\u0026color=black\u0026label)](https://github.com/MaxineXiong)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Made with Python](https://img.shields.io/badge/Python-\u003e=3.6-blue?logo=python\u0026logoColor=white)](https://www.python.org)\n[![Apache Cassandra](https://img.shields.io/badge/Apache_Cassandra-1287B1?logo=Apache+Cassandra\u0026logoColor=white)](https://cassandra.apache.org/)\n\n\u003cbr\u003e\n\n## Project Description\n\nThis project focuses on building an Apache Cassandra database for *Sparkify*, a startup that offers music streaming services. Sparkify collects large amounts of user activity and song data, and the analytics team wants to query this data to better understand user behaviour, specifically around song preferences. The goal is to create an ETL pipeline to preprocess raw event data stored in multiple CSV files, consolidate it, and design and implement a Cassandra data model to support queries on the song play data.\n\n\u003cbr\u003e\n\n## Project Data\n\nThe dataset contains user activity data from *Sparkify*'s app, partitioned by date in the `event_data` folder. Each CSV file represents a day's worth of events. For instance:\n\n```\nevent_data/2018-11-08-events.csv\nevent_data/2018-11-09-events.csv\n```\n\nThese files include information on song titles, user details, and session data, as shown in the image below.\n\n![image](https://github.com/user-attachments/assets/5a2cd3d4-b1d1-4052-837b-8e7ade421e31)\n\n\n\u003cbr\u003e\n\n## Repository Structure\n\nThe repository is organized as follows:\n\n```\nData_Modelling_with_Apache_Cassandra/\n├── Project_Data_Modelling_with_Apache_Cassandra.ipynb\n├── event_data/\n├── .gitignore\n├── README.md\n└── LICENSE\n```\n\n- **Project_Data_Modelling_with_Apache_Cassandra.ipynb**: Jupyter notebook containing the ETL pipeline code for pre-processing data and modelling it in Apache Cassandra.\n- **event_data/**: Directory containing the original CSV files partitioned by date.\n- **.gitignore**: Specifies files and directories that Git should ignore (e.g., system files, large data files).\n- **README.md**: Provides an overview of the project.\n- **LICENSE**: The license governing the usage of this project.\n\n\u003cbr\u003e\n\n## Usage\n\n1. **Pre-requisites**:\n    - Python 3.7 or higher\n    - Apache Cassandra installed and running\n    - Jupyter Notebook (optional for running the `.ipynb` file)\n2. **Steps**:\n    - Run the Jupyter notebook `Project_Data_Modelling_with_Apache_Cassandra.ipynb`.\n    - The notebook will pre-process the data by consolidating the partitioned files into a single streamlined CSV. It then will design and create tables in Apache Cassandra based on the specific queries from *Sparkify*’s analytics team, and finally, will insert data from the CSV into those tables using CQL commands.\n    - You can modify the queries or data model to suit your needs.\n\n\u003cbr\u003e\n\n## Contribution\n\nContributions to improve the project are welcome. Please open an issue or submit a pull request with your suggestions or bug fixes.\n\n\u003cbr\u003e\n\n## **License**\n\nThis project is licensed under the [MIT License](https://choosealicense.com/licenses/mit/). Feel free to use, modify, and distribute the application in accordance with the terms of the license.\n\n\u003cbr\u003e\n\n## Acknowledgement\n\nThis project was completed as part of the [Data Engineering Nanodegree at Udacity](https://www.udacity.com/course/data-engineer-nanodegree--nd027?promo=labor\u0026coupon=LABOR40\u0026utm_source=gsem_brand\u0026utm_medium=ads_r\u0026utm_campaign=19692269004_c_individuals\u0026utm_term=151372113572\u0026utm_keyword=udacity%20data%20engineering_e\u0026utm_source=gsem_brand\u0026utm_medium=ads_r\u0026utm_campaign=19692269004_c_individuals\u0026utm_term=151372113572\u0026utm_keyword=udacity%20data%20engineering_e\u0026gad_source=1\u0026gclid=CjwKCAjwufq2BhAmEiwAnZqw8q11WJ-KNhO-d1bBQodev0p2b9gtBIIlBp0_jZotggKBM-bj36SE3hoC968QAvD_BwE). Special thanks to [Udacity](https://www.udacity.com/) for providing the datasets and project specifications.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxinexiong%2Fdata-modelling-with-apache-cassandra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxinexiong%2Fdata-modelling-with-apache-cassandra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxinexiong%2Fdata-modelling-with-apache-cassandra/lists"}