https://github.com/imsanjoykb/etl-project

The goal of this project is to illustrate Extract Transform Load (ETL) using Python and SQL. ETL is a process commonly done in computing, which takes raw data, cleans it and stores it for later use. The extraction phase targets and retrieves the data. Transform manipulates and cleans the data. Then load stores the data, typically in a data warehouse.
https://github.com/imsanjoykb/etl-project

data-engineering database datalake datawarehouse etl etl-automation etl-pipeline etl-solutions

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/imsanjoykb/etl-project
Owner: imsanjoykb
License: mit
Created: 2021-08-21T17:11:32.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2021-09-07T16:08:11.000Z (about 4 years ago)
Last Synced: 2024-10-30T13:55:03.130Z (11 months ago)
Topics: data-engineering, database, datalake, datawarehouse, etl, etl-automation, etl-pipeline, etl-solutions
Language: Jupyter Notebook
Homepage: https://imsanjoykb.github.io/
Size: 285 KB
Stars: 20
Watchers: 3
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Covid-19 Infection ETL Project
By Sanjoy Biswas

## Project Proposal
Based upon the data compiled by John Hopkins University, I want to explore ''' Insert reasons here'''
This will be done by extracting the CSV data and migrating it to a PostgreSQL Database.

## Project Description
I found data from `data.data.org` that had been compiled from John Hopkins University. I filtered the data for March 2020 and evaluated the number of cases with respect to the deaths, recovery, and the confirmed cases.

## Finding Data
All of the data that we used were from `https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases` where they compiled by the John Hopkins University Center for Systems Science and Engineering (JHU CSSE) from various sources that include the World Health Organisations, Hong Kong Department of Health, European Centre for Disease Prevention and Control, etc. This data is always being updated so we are narrowing the scope to the month of March.

## Data Cleanup and Analysis

* TRANSFORMATION STEPS
My transformation steps I needed to clean the data to be readable, presentable, and easy for me to query in the later stages. This was done by:
* Developing a cleaning function in python that would select the data in the month of March 2020. This was applied to all datasets that I have.
* All of the dates that have in the data sets we treated as values through the `pd.melt` function in Pandas.
* Found a way of finding the daily increase with respect to each table. This value was converted from a float to an integer

* LOADING STEPS
* I established a connection to a local PostgreSQL server in our desktop to store the data
* I have a schema that just makes the tables and we can confirm it throught `engine.table_names()`
* I pushed the Pandas DataFrame to the local PostgreSQL server so I can retrieve and query the data in our Jupyter Notebook

* Analysis / SQL Queries
In this part, I want to find:
* Top 5 countries with the most/least confirmed cases
* Top 5 countries with the most/least deaths
* Top 5 countries with the most/least recovered
* Date in March with the most confirmed/deaths/recovered

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/imsanjoykb/etl-project

Awesome Lists containing this project

README