https://github.com/derak-isaack/nyc-taxi-analytics
Data engineering ETL project using OLAP databases and DBT to perform analysis on NYC taxi data.
https://github.com/derak-isaack/nyc-taxi-analytics
data-pipeline dbt duckdb etl olap-database powerbi prefect python3 sql
Last synced: about 1 month ago
JSON representation
Data engineering ETL project using OLAP databases and DBT to perform analysis on NYC taxi data.
- Host: GitHub
- URL: https://github.com/derak-isaack/nyc-taxi-analytics
- Owner: derak-isaack
- Created: 2024-06-06T08:01:42.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2024-07-30T19:33:19.000Z (almost 2 years ago)
- Last Synced: 2025-11-08T05:03:06.268Z (7 months ago)
- Topics: data-pipeline, dbt, duckdb, etl, olap-database, powerbi, prefect, python3, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 1.06 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
##
NEW YORK CITY TLC TAXI DATA PIPELINE



###
Project Overview
This is an ETL(`Extract-Transform-Load`) data pipeline using `DUCKDB` for extraction & loading and `DBT` for transformation. The data to be transformed is from the [NYC-TLC-website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page#:~:text=Yellow%20and%20green%20taxi%20trip,and%20driver%2Dreported%20passenger%20counts.) for the month of May 2024. The data columns description can be found [here](data_dictionary_trip_records_green.pdf).
###
Objectives
The transformation objectives include building various transformation models for further analysis.
1. `Route traffic model` using the `Pick-Up` & `Drop-off` locations.
2. `Hourly daily Server outage model` using the `Drop-Off` location. Look for `drop off` locations that are prone to server outages in terms of sending trip detals to the server. They are marked as `N`. This is for analysis to get which hours of the day are mostly affected by severe server outages and might need further action.
3. `Tip amount model`. Analyze the tips by different customers to different vendors.
4. `Daily-hourly traffic model`. Model to analyze passenger count for `every 24hrs per day` for further passenger trend analysis.
5. `Pick-up trend model` for assesing passenger counts in various pick up locations.
###
Data Extraction & Loading
For the data extraction, [DuckDB](https://duckdb.org/docs/data/parquet/overview) has extensive options for performing `data extraction` explicitly. Of importance is to use the `fetch_df()` in the `SQL` queries when seeking to find the data structure and format. How the final transformation models would look like can be found [here](taxi.ipynb).
[DuckDB](https://duckdb.org/docs/installation/index?version=stable&environment=cli&platform=win&download_method=package_manager) will also be used for Loading the transformed data in table formart as will be defined in the `transformation models` using the `{{config(materialized='table')}} command.
###
Data Transformation
For the transformation, [DBT](https://docs.getdbt.com/docs/introduction)(Data Build Tool) comes in very handy in handling the transformation logic using the normal `SQL` syntax. The [transformation-models](TLC_NYC/models) are all chained to the first model for further analysis of the data.
To initialize a `DBT project` together with `DuckDB OLAP database`, the following commands are to be performed in order.
* `pip install dbt-duckdb`
* `dbt init`
* `dbt debug` to test that everything is working fine before proceeding.
* `dbt run` after defining the transformation models. Incase of any error the `logs` should be checked.
For a succesfull dbt model, the following should be printed on the terminal:
![dbt-final-screenshot]()
###
Dashboard
The transformation models are then visualized using `Power BI` which offers quick interactive visualization charts with the key `KPI's`.
![Dashboard]()
###
Prefect Integration
`Prefect-dbt-flow` [library](https://github.com/datarootsio/prefect-dbt-flow) offers quick simple integration with orchestration and pipeline monitoring.
