Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/banknatchapol/us-immigration-data-pipeline

Create Data Pipeline for US Imigration data using Spark.
https://github.com/banknatchapol/us-immigration-data-pipeline

data-pipeline spark

Last synced: 19 days ago
JSON representation

Create Data Pipeline for US Imigration data using Spark.

Host: GitHub
URL: https://github.com/banknatchapol/us-immigration-data-pipeline
Owner: BankNatchapol
Created: 2021-01-30T17:12:31.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2021-02-02T10:01:05.000Z (about 4 years ago)
Last Synced: 2024-11-29T02:32:22.028Z (3 months ago)
Topics: data-pipeline, spark
Language: Jupyter Notebook
Homepage:
Size: 2.41 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

[![LinkedIn][linkedin-shield]][linkedin-url]

US Immigration Data Pipeline

Create ETL Data Pipeline using Spark.

Report Bug
·
Request Feature

## About The Project
This project is to create ETL Data pipeline using spark for 4 datasets immigration, airport, us demographics and world temperature data.

In this project i use Spark because of Fast, powerful, simple trajectory, Spark unlocks the power of data by handling large-scale data with speed. It abstracts complexity of data access across countless landing zones such as the Hadoop Distributed File System (HDFS), relational databases, fast-moving data streams, distributed file systems and much more.

There are 5 detailed steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

## Step 1: Scope the Project and Gather Data

### Scope
- use spark to load data to workspace.
- EDA for checking missing value.
- cleaning data based on EDA result.
- using spark to write parquet file

## Describe and Gather Data
### I94 Immigration Data
This data comes from the US National Tourism and Trade Office.
Each report contains international visitor arrival statistics by world regions and select countries (including top 20), type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry (for select countries).
The immigration data in a folder with the following path: ../../data/18-83510-I94-Data-2016/. There's a file for each month of the year. An example file name is i94_apr16_sub.sas7bdat. Each file has a three-letter abbreviation for the month name. So a full file path for June would look like this: ../../data/18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat.

### World Temperature Data
The World Temperature dataset comes from Kaggle and represents global land temperatures by city.

### U.S. City Demographic Data
This data comes from OpenSoft and contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. Original data comes from the US Census Bureau's 2015 American Community Survey.

### Airport Code Table
This is a simple table of airport codes and corresponding cities.

## Step 2: Explore and Assess the Data
### Exploratory Data Analysis and Cleaning Data contain this visualization steps

#### Visualize World Temperature Data missing values

#### Visualize I94 Immigration Data missing values

#### Visualize U.S. City Demographic Data missing values

#### Visualize Airport Code Table missing values

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
this project using star schema for data model because it's simpler to implement than other schema like snowflake schema.

#### 3.2 Mapping Out Data Pipelines
pipeline steps
- load the datasets
- cleaning data, missing values
- transform raw data to our data models

### Step 4: Run Pipelines to Model the Data
#### immigration fact table
#### immigration time table
#### airport table
#### us demograghic table
#### land temperature table

### Step 5: Project write up
To deal with this senarios.
#### The data was increased by 100x.
We can use Cloud service such as AWS Redshift to store our data and increase size of our machine.

#### The pipelines would be run on a daily basis by 7 am every day.
We can use scheduler to make our pipeline run on every specific times using tools such as Apache Airflow.

#### The database needed to be accessed by 100+ people.
We can set our cloud database service to be auto-scaling that can lead more availability.

## Contact

Facebook - [@Natchapol Patamawisut](https://www.facebook.com/natchapol.patamawisut/)

Project Link: [https://github.com/BankNatchapol/US-Immigration-Data-Pipeline](https://github.com/BankNatchapol/AWS-Data-Warehouse-ETL)

[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555
[linkedin-url]: https://www.linkedin.com/in/natchapol-patamawisut