Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aimanamri/yellow-taxi-trips-etl-data-engineering-project


https://github.com/aimanamri/yellow-taxi-trips-etl-data-engineering-project

azure data-engineering etl-pipeline jupyter-notebook

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

# Yellow Taxi Trips Data Analytics | Data Engineering Azure Project



GitHub Language Count


GitHub Top Language



GitHub Stars


GitHub Last Commit


Repository Size

## Introduction
The "Yellow Taxi Trips Data Analytics" project uses modern technology and data analysis to extract valuable insights from New York City's yellow taxi trip records. I'm employing a range of advanced tools like Python, SQL, Azure services, and Power BI to process, analyze, and visualize the data.

## Architecture

## Technologies Used
1. Python
2. SQL
3. Azure Data Factory
4. Azure Data Bricks
5. Azure Synapse Analytics
6. Power BI

## Dataset Used
1. Source : https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
2. Data Dictionary : https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

The data is separated by months for each year, so I created a simple Python script to download all the Parquet files and combine them by year. The dataset is stored in .parquet.gzip format to be cost-effective for storage. But since it were too large to be stored on GitHub (without Git LFS), reducing the file size and using CSV/Parquet format is the best solution by filtering the rows for this side project use. Here, first `20,000` rows randomly selected from each month will be used.

## Data Model

## Insights