Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lefosg/azure-taxi-data-analytics
Code for running queries on a dataset about taxi rides in Azure (Azure functions + Event Hub)
https://github.com/lefosg/azure-taxi-data-analytics
Last synced: 19 days ago
JSON representation
Code for running queries on a dataset about taxi rides in Azure (Azure functions + Event Hub)
- Host: GitHub
- URL: https://github.com/lefosg/azure-taxi-data-analytics
- Owner: lefosg
- Created: 2024-05-20T20:50:03.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-05-20T21:08:32.000Z (8 months ago)
- Last Synced: 2024-11-07T17:55:45.156Z (2 months ago)
- Language: Python
- Size: 355 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
This repo includes all the code for an Azure function app which processes the dataset in this repo. Datasets are uploaded to a Blob Storage. The Azure function Blob trigger fire up, and do some calculations on that specific dataset.
Streaming of mini-batches is also supported through an Event Hub which is set up. The Even Hub Trigger starts handling the mini-batch stream.
- preprocess_dataset.py x1 x2 x3: gets the first x1 lines from the dataset and produces another csv called output.csv (locally). After that, it creates x2 batches of x3 lines each. Empirically, due to limitations of Event Hub, 20 lines per mini-batch csv file is the maximum size supported.
- upload.py: supports two oprations 'upload.py single name_of_file.csv' which uploads the dataset to the Blob Storage (and the Blob trigger is invoked), and 'upload.py batch' which pulls the minibatch files from a directory created by preprocess_dataset.py and streams them to the Event Hub.
- function_appy.py: code that is executed in the azure function. Includes 4 queries for the dataset uploaded, and the two triggers (Blob and Event Hub)
Results of the calculations are stored to Blob Storage, and some specific results are saved to Redis Cache.