Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/spshah1701/world-development-indicators
Analysis of World Development Indicators (WDI) using big data technologies, specifically Databricks, Apache Spark, and Scala.
https://github.com/spshah1701/world-development-indicators
apache-spark big-data data-analysis spark-sql
Last synced: about 2 months ago
JSON representation
Analysis of World Development Indicators (WDI) using big data technologies, specifically Databricks, Apache Spark, and Scala.
- Host: GitHub
- URL: https://github.com/spshah1701/world-development-indicators
- Owner: spshah1701
- Created: 2024-01-01T22:51:10.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-01-12T22:55:11.000Z (about 1 year ago)
- Last Synced: 2024-01-13T13:20:56.841Z (about 1 year ago)
- Topics: apache-spark, big-data, data-analysis, spark-sql
- Homepage:
- Size: 107 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# World Development Indicators Analysis
## Introduction
The World Development Indicators (WDI) is the World Bank's most comprehensive collection of cross-country development data. It's website basically provides access to data as well as information about data coverage, curation and methodologies and allow users to discover what type of indicators are available.
+ [World Bank](https://www.worldbank.org/en/home)
+ [World Development Indicators](https://databank.worldbank.org/source/world-development-indicators)## Tools and Technologies
+ Databricks
+ Apache Spark
+ Scala## Data Description
1. **Country.csv**
247 rows representing the countries. \
31 columns describing various attributes of the countries.2. **Indicators.csv**
5656458 rows representing data instances. \
6 columns describing indicators of the countries.#### Need for using Big Data Technologies
The size of this file is about 550MB, necessitating the use of Apache Spark implemented in Scala on Databricks. This combination provides a powerful and scalable framework for efficiently processing large-scale datasets.
## My Implementation
+ [Published Notebook](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1225760005808135/549596638656775/7992345167110499/latest.html)**Note**: This link will be valid till 01-06-2024.
## Project Setup and Replication
- Create a free Databricks Community Edition account
- Create a new cluster and wait till it is active and running
- Upload the *World Development Indicators.dbc* Notebook to Databricks and connect it to the above cluster.
- Upload the data (CSV files) to Databricks after downloading it from the [source](https://databank.worldbank.org/source/world-development-indicators).
- Run the cells, view and analyse the data as desired.### Access the data
```
%scalaval Indicators = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/Indicators.csv")display(Indicators)
```### Create or Replace Temporary view
Temporary view allows to use SQL queries on the DataFrame as if it were an SQL table.
```
%scalaIndicators.createOrReplaceTempView("Indicators")
```### Write desired SQL queries for data visualization and analysis
```
%sqlselect CountryName,Value,Year from Indicators where IndicatorCode in ("NY.GNP.PCAP.CD") and Year = 1962 and CountryName in ("Japan","China","France","United States") order by Value asc;
```#### Output
![image](https://github.com/spshah1701/World-Development-Indicators/assets/142957290/5203ac11-ceab-49ca-bc49-f02b7f1fe50b)