Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/longnguyen010203/data-warehouse-accident-us-2016-2023
Design and implement a data warehouse to manage automobile accident cases across all 49 states in the US, using a star schema and Snowflake for the data warehouse architecture.
https://github.com/longnguyen010203/data-warehouse-accident-us-2016-2023
apache-airflow apache-spark data-ingestion data-processing data-quality-checks data-transformation data-warehouse dbt decorators-python dimensions docker docker-compose dockerfile fastapi minio powerbi pyspark snowflake star-schema
Last synced: about 24 hours ago
JSON representation
Design and implement a data warehouse to manage automobile accident cases across all 49 states in the US, using a star schema and Snowflake for the data warehouse architecture.
- Host: GitHub
- URL: https://github.com/longnguyen010203/data-warehouse-accident-us-2016-2023
- Owner: longNguyen010203
- Created: 2024-12-13T15:04:19.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-12-14T04:59:34.000Z (about 2 months ago)
- Last Synced: 2024-12-17T13:10:37.754Z (about 2 months ago)
- Topics: apache-airflow, apache-spark, data-ingestion, data-processing, data-quality-checks, data-transformation, data-warehouse, dbt, decorators-python, dimensions, docker, docker-compose, dockerfile, fastapi, minio, powerbi, pyspark, snowflake, star-schema
- Language: Jupyter Notebook
- Homepage:
- Size: 412 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 📊 DATA-WAREHOUSE-ACCIDENT-US-2016-2023
Design and implement a data warehouse to manage automobile accident cases across all 49 states in the US, using a star schema and Snowflake for the data warehouse architecture.
## 🚀 About Project
- **Data Source**: This project uses data on [Kaggle](https://www.kaggle.com/) including 2 datasets: [US Accidents (2016 - 2023)](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) and [Traffic Accidents and Vehicles](https://www.kaggle.com/datasets/tsiaras/uk-road-safety-accidents-and-vehicles)
- `US Accidents (2016 - 2023`: This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023
- `Traffic Accidents and Vehicles`: every line in the file represents the involvement of a unique vehicle in a unique traffic accident, featuring various vehicle and passenger properties as columns
- **Extract Data**: Data is `extracted` from `csv` file then `ingested` into `MinIO` data lake in `bronze` folder using `Python` and `Airflow`
- **Transform Data**: Data is retrieved from `MinIO's` `bronze` directory using `Spark` and `FastAPI` to perform `transformation` and `cleaning`, then the output is `loaded` into `MinIO's` `silver` directory.
- **Load Data**: Once the data has been cleaned, we load it into the `Snowflake` data `warehouse` at Schema `Staging` using `Python` and `Airflow`.
- **Warehouse**: Data is loaded into `staging` schema in `Snowflake`, Build and deploy `data warehouse` with `Star Schema` architecture by creating `dimension` and `fact` tables, to do this we use `DBT` to `transform` and `check data`.
- **Serving**: Analyze data to improve road safety, identify high-risk accident areas to implement preventative measures. Identify factors that contribute to accidents (weather, road conditions, human error). Then visualize and create reports with `Power BI`.
- **Package and Orchestration**: Components are packaged using `Docker` and orchestrated using `Apache Airflow`.## 📦 Technologies
- `Apache Airflow`
- `Apache Spark`
- `Docker`
- `Dbt`
- `Snowflake`
- `MinIO`
- `FastApi`
- `Power BI`## 🔦 Star Schema Diagram