Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sravanigodavarthi/wdi-pyspark-etl

A PySpark ETL pipeline is developed to transform and model World Development Indicators (WDI) data. The process involves performing data quality checks, exporting cleaned data to CSV files, and conducting time series analysis to derive insightful metrics.
https://github.com/sravanigodavarthi/wdi-pyspark-etl

pivot pyspark time-series-analysis

Last synced: about 2 months ago
JSON representation

A PySpark ETL pipeline is developed to transform and model World Development Indicators (WDI) data. The process involves performing data quality checks, exporting cleaned data to CSV files, and conducting time series analysis to derive insightful metrics.

Awesome Lists containing this project

README

        

# WDI-PySpark-ETL
## Project Objective:
The objective of this project is to transform and model World Development Indicators (WDI) datasets using PySpark, ensuring data quality and readiness for analysis. The project involves:

1. **Data Exploration and Transformation**: Analyzing and transforming datasets (`WDICountry.csv`, `WDISeries.csv`, `WDIData.csv`).
2. **Data Quality Checks**: Performing data quality checks and resolving any issues.
3. **Data Output**: Writing the cleaned data to `CSV` files.
4. **Time Series Analysis**: Conducting time series data pivoting and analyzing metrics for insights.

### Data Source:
The World Development Indicators can be accessed through the World Bank's https://datacatalog.worldbank.org/search/dataset/0037712/World-Development-Indicators

The World Development Indicators (WDI) from the World Bank provide comprehensive data on economic, social, and environmental metrics across over 200 countries. They include various indicators such as GDP, health, education, and infrastructure, sourced from international and national agencies. This data supports global development analysis and policymaking.

## Problem Statement:

**Cellular and Broadband Penetration Analysis:**

We aim to measure cellular and broadband penetration in comparison to population demographics for each country. Additionally, we seek insights on annual global aggregates.

**Regional Metrics Exploration:**

Becky finds the regional metrics interesting and wants to explore these metrics at a country level for each year. Can you adapt the regional pivot computed earlier to get the metrics for each country by year?

**Business Environment Analysis:**

Kat wants to identify the countries that are conducive to starting a business. She is interested in the most recent metrics for the following indicators:

* Gross National Income (GNI)

* Cost of business start-up procedures

* Number of days required to start a business

* Number of start-up procedures to register a business

* GDP

* GDP per capita

* Business Regulatory Environment

* Ease of doing business index (available only for 2017)

The data should be written to a CSV file.