An open API service indexing awesome lists of open source software.

https://github.com/derak-isaack/nyc-taxi-analytics

Data engineering ETL project using OLAP databases and DBT to perform analysis on NYC taxi data.
https://github.com/derak-isaack/nyc-taxi-analytics

data-pipeline dbt duckdb etl olap-database powerbi prefect python3 sql

Last synced: about 1 month ago
JSON representation

Data engineering ETL project using OLAP databases and DBT to perform analysis on NYC taxi data.

Awesome Lists containing this project

README

          

##


NEW YORK CITY TLC TAXI DATA PIPELINE

![dbt](https://img.shields.io/badge/dbt-FF694B?logo=dbt&logoColor=fff&style=for-the-badge)
![Duckdb](https://img.shields.io/badge/DuckDB-FFF000?logo=duckdb&logoColor=000&style=for-the-badge)
![Apache-parquet](https://img.shields.io/badge/Apache%20Parquet-50ABF1?logo=apacheparquet&logoColor=fff&style=for-the-badge)

###


Project Overview

This is an ETL(`Extract-Transform-Load`) data pipeline using `DUCKDB` for extraction & loading and `DBT` for transformation. The data to be transformed is from the [NYC-TLC-website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page#:~:text=Yellow%20and%20green%20taxi%20trip,and%20driver%2Dreported%20passenger%20counts.) for the month of May 2024. The data columns description can be found [here](data_dictionary_trip_records_green.pdf).

###


Objectives

The transformation objectives include building various transformation models for further analysis.

1. `Route traffic model` using the `Pick-Up` & `Drop-off` locations.

2. `Hourly daily Server outage model` using the `Drop-Off` location. Look for `drop off` locations that are prone to server outages in terms of sending trip detals to the server. They are marked as `N`. This is for analysis to get which hours of the day are mostly affected by severe server outages and might need further action.

3. `Tip amount model`. Analyze the tips by different customers to different vendors.

4. `Daily-hourly traffic model`. Model to analyze passenger count for `every 24hrs per day` for further passenger trend analysis.

5. `Pick-up trend model` for assesing passenger counts in various pick up locations.

###


Data Extraction & Loading

For the data extraction, [DuckDB](https://duckdb.org/docs/data/parquet/overview) has extensive options for performing `data extraction` explicitly. Of importance is to use the `fetch_df()` in the `SQL` queries when seeking to find the data structure and format. How the final transformation models would look like can be found [here](taxi.ipynb).

[DuckDB](https://duckdb.org/docs/installation/index?version=stable&environment=cli&platform=win&download_method=package_manager) will also be used for Loading the transformed data in table formart as will be defined in the `transformation models` using the `{{config(materialized='table')}} command.

###


Data Transformation

For the transformation, [DBT](https://docs.getdbt.com/docs/introduction)(Data Build Tool) comes in very handy in handling the transformation logic using the normal `SQL` syntax. The [transformation-models](TLC_NYC/models) are all chained to the first model for further analysis of the data.

To initialize a `DBT project` together with `DuckDB OLAP database`, the following commands are to be performed in order.

* `pip install dbt-duckdb`

* `dbt init`

* `dbt debug` to test that everything is working fine before proceeding.

* `dbt run` after defining the transformation models. Incase of any error the `logs` should be checked.

For a succesfull dbt model, the following should be printed on the terminal:

![dbt-final-screenshot]()

###


Dashboard

The transformation models are then visualized using `Power BI` which offers quick interactive visualization charts with the key `KPI's`.

![Dashboard]()

###


Prefect Integration

`Prefect-dbt-flow` [library](https://github.com/datarootsio/prefect-dbt-flow) offers quick simple integration with orchestration and pipeline monitoring.

![Orchestration-prefect](prefect-flow.png)