Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/codeslash21/data-modeling-with-apache-cassnadra
Data modeling with apache cassnadra
https://github.com/codeslash21/data-modeling-with-apache-cassnadra
cassandra datamodeling jupyter-notebook python3
Last synced: about 1 month ago
JSON representation
Data modeling with apache cassnadra
- Host: GitHub
- URL: https://github.com/codeslash21/data-modeling-with-apache-cassnadra
- Owner: codeslash21
- Created: 2024-03-04T10:21:54.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-03-05T07:35:11.000Z (9 months ago)
- Last Synced: 2024-09-30T04:02:41.149Z (about 2 months ago)
- Topics: cassandra, datamodeling, jupyter-notebook, python3
- Language: HTML
- Homepage:
- Size: 861 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Overview
Built an ETL(Extract, Transform, Load) pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables. Created different cassandra data models optimized for different queries those we needed to execute.## Files
- **`event_data`** This folder contains all the `csv` files, these are the source of ETL pipelinne. All the data was provided by Udacity
- `event_datafile_new.csv` This is the output file of ETL pipeline. This file is used to model and insert data into Apache Cassandra tables.
- `Project_1B_ Project_Template.ipynb` has the code for ETL and Cassandra data modeling## Tools and Packages
### Tools
- Apache Cassandra
- CQL
- Python
- Jupyter NoteBook### Python Packages
- cassandra
- os
- glob
- csv
- prettytable## Project Steps
- **ETL:** Build an ETL pipeline that transform data from a set of CSV files within a directory to create a streamlined CSV file that can be used to model and insert data into Apache Cassandra tables.
- **Data Modeling:** Based on ETL output file and the queries we need to run model different Apache Cassandra data model those will be optimized and give exepected output. Main important thing is choosing `Partition Key` and `Clustering Columns` properly so that data will be evenly distributed and appropriate rows will be fetched for the executed query.