https://github.com/vermicida/capstone-project

Capstone, the code corresponding the project #8 of the Udacity's Data Engineer Nanodegree Program
https://github.com/vermicida/capstone-project

Last synced: 4 days ago
JSON representation

Capstone, the code corresponding the project #8 of the Udacity's Data Engineer Nanodegree Program

Host: GitHub
URL: https://github.com/vermicida/capstone-project
Owner: vermicida
Created: 2019-08-21T19:31:09.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2019-08-31T07:28:08.000Z (over 5 years ago)
Last Synced: 2025-02-17T14:49:07.830Z (3 months ago)
Language: Python
Size: 911 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Capstone project

We want to analyze how the weather in Spain is doing. The Meteorology Statal Agency of Spain (AEMET) gathers weather information from 2500+ meteorology stations every day. These stations are located all over the country, so we can aggregate the data by cities and analyze the metrics in detail.

## Table of contents

- [Description](#description)
- [Structure](#structure)
- [Requirements](#requirements)
- [Cloning the repository](#cloning-the-repository)
- [Requesting the AEMET API Key](#requesting-the-aemet-api-key)
- [How to use](#how-to-use)
- [Running Apache Airflow](#running-apache-airflow)
- [Configuring connections](#configuring-connections)
- [AEMET API connection](#aemet-api-connection)
- [Application database connection](#application-database-connection)
- [Running the Capstone Project DAG](#running-the-capstone-project-dag)
- [Analyzing the data](#analyzing-the-data)
- [Cleaning the environment](#cleaning-the-environment)

---

## Description

The premises are:

- The weather data is retrieved via API from a external provider (AEMET); it can be given in any format and maybe wrong typed, so we need a mechanism that retrieve, clean and type the data properly.
- Once the master data is ready, we must store it somewhere. As we want to analyze the weather information, we must think about aggregations: relational databases are a good option.
- Lastly, we are going to clusterize the data by city and by time range -monthly, quarterly and yearly-. We can use the same relational database to create these fact tables.

We can use [Apache Airflow](https://airflow.apache.org/) and [PostgreSQL](https://www.postgresql.org/) to orchestrate the pipeline. It's a simple pipeline; we can run it locally in a Docker services topology. The architecture model can be like this:

Arquitecture model

- Apache Airflow with a LocalExecutor
- A PostgreSQL instance for Apache Airflow's Database Backend
- A PostgreSQL for the application database.

How this is accomplished with Docker is commented [later](#running-apache-airflow).

The data pipeline is shown below:

Data flow

- The source data is retrieved from the AEMET API using a custom Apache Airflow hook. This hook queries the API and transforms the response, a JSON object, into a [Pandas](https://pandas.pydata.org/) DataFrame to ease the data wrangling.
- A custom Apache Airflow operator takes the weather DataFrame and pushes it into a staging table in PostgreSQL. This master data will be use as the source for the fact tables.
- Another custom Apache Airflow operator queries the staging table to aggregate the data and insert it in fact tables, also in PostgreSQL.

The data pipeline be monitored from the Apache Airflow console: follow the instructions to set it up!

## Structure

This tree shows the repository structure. Only the project's main files are described.

```
.
├── images
│ ├── airflow-adhoc-query-01.png
│ ├── airflow-adhoc-query-02.png
│ ├── airflow-adhoc-query-03.png
│ ├── airflow-adhoc-query-04.png
│ ├── airflow-connections.png
│ ├── airflow-dag-01.png
│ ├── airflow-dag-02.png
│ ├── airflow-dag-03.png
│ ├── architecture-model.png
│ ├── data-flow.png
│ └── request-aemet-api-key.png
├── src
│ ├── airflow
│ │ ├── dags
│ │ │ └── capstone.py # The capstone project main DAG
│ │ └── plugins
│ │ └── capstone_plugin
│ │ ├── helpers
│ │ │ ├── __init__.py
│ │ │ └── queries.py # Queries used by the custom operators
│ │ ├── hooks
│ │ │ ├── __init__.py
│ │ │ └── aemet.py # AEMET API hook
│ │ ├── operators
│ │ │ ├── __init__.py
│ │ │ ├── aggregate_table.py # Custom operator to aggregate data
│ │ │ ├── create_table.py # Custom operator to create tables
│ │ │ ├── data_quality.py # Custom operator to test the data quality
│ │ │ └── import_weather.py # Custom operator to import weather data
│ │ └── __init__.py
├── .editorconfig
├── .gitignore
├── docker-compose.yml # Descriptor for the capstone project deployment
└── README.md
```

---

## Requirements

It is assumed that the tools below are properly installed locally:

- [Docker Engine / Desktop](https://hub.docker.com/search/?type=edition&offering=community) powers millions of applications worldwide, providing a standardized packaging format for diverse applications.

### Requesting the AEMET API Key

We need an API Key in order to retrieve weather data from AEMET. It can be requested from the [AEMET OpenData site](https://opendata.aemet.es/centrodedescargas/inicio) just given an email:

Request AEMET API Key

You will be sent a [JSON Web Token](https://jwt.io/): have it on hand, we will use it shortly.

### Cloning the repository

The first step is to clone this repository. Just type the following command in your Terminal:

```bash
# Clone the repository...
git clone https://github.com/vermicida/capstone-project.git

# ...and move to its directory
cd capstone-project
```

---

## How to use

Here are listed the steps to follow in order to make the pipeline work.

### Running Apache Airflow

We lean on Docker to run Apache Airflow. In the root directory of the project you will find the file `docker-compose.yml`: that's the one that make magic happens! It creates a containers topology:

- A container running Apache Airflow with a [LocalExecutor](https://www.astronomer.io/guides/airflow-executors-explained/)
- A container running PostgreSQL as the Apache Airflow's [Database Backend](https://airflow.readthedocs.io/en/stable/howto/initialize-database.html)
- A container running PostgreSQL as the application database.

Also, it mounts the directories `src/airflow/dags` and `src/airflow/plugins` in the Apache Airflow container to be able to work with our Capstone Project DAG.

Let's do it! Run this command in your terminal:

```bash
docker-compose up
```

Wait a minute while Docker is starting the services. Now open your browser and navigate to `http://localhost:8080`: Apache Airflow is running now!

### Configuring connections

In the Apache Airflow console, go to the menu **Admin** and select **Connections**:

Connection option

#### AEMET API connection

Create a new connection for the AEMET API using the following values:

- **Conn Id:** `aemet_conn`
- **Conn Type:** `HTTP`
- **Host:** `https://opendata.aemet.es/`
- **Extra**: `{"api_key": "your-aemet-api-key"}`

Remeber to replace the value of the property `api_key`, in the **Extra** field, by the JSON Web Token you were given [before](#requesting-the-aemet-api-key).

Click **Save**.

#### Application database connection

Create a new connection for the application database using the following values:

- **Conn Id:** `ddbb_conn`
- **Conn Type:** `Postgres`
- **Host:** `ddbb`
- **Schema:** `weather`
- **Login:** `admin`
- **Password:** `P4ssw0rd`
- **Port:** `5432`

Click **Save**.

### Running the Capstone Project DAG

You can go to the DAGs menu and see the Capstone Project DAG listed. By default, the operator in charge of importing the weather from the AEMET API, will retrieve any info between **Jun 20th** and **Jul 10th**, but you can change that by editing the DAG file -located at `src/airflow/dag/capstone.py`-, in the instantiation of the operator `ImportWeatherOperator`:

```python
...
import_weather = ImportWeatherOperator(
task_id='import_weather',
ddbb_conn_id='ddbb_conn',
aemet_conn_id='aemet_conn',
from_date='2019-06-20', # This is the initial date
to_date='2019-07-10' # This is the ending date
)
...
```

Use the standard [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) to set a date.

Turn **On** the switch next to the DAG name to make the Apache Airflow scheduler run the defined tasks:

DAG list

You can navigate the DAG details by clicking in its name. The **Graph view** shows a diagram on how the tasks are going to be executed:

DAG graph view

You can also check how the Apache Airflow scheduler is doing with the DAG tasks on the **Tree view** tab. Dark green means good!

DAG tree view

### Analyzing the data

You can use the Apache Airflow console to run queries over the fact tables. Go to the option **Ad Hoc Query** in the menu **Data Profiling**:

Ad Hoc Query option

And run queries like these -don't forget to select the connection `ddbb_conn`-:

```sql

/*
The average, max and min temperatures in Madrid by month
*/

select *
from temps_by_month
where city = 'MADRID'
```

```sql
/*
The average, max and min temperatures of the second quarter in Basque Country -higher max temperature first-
*/
select *
from temps_by_quarter
where quarter = 2
and city = any('{BIZKAIA,GIPUZKOA,ARABA/ALAVA}')
order by tmax desc
```

```sql
/*
The average, max and min temperatures of the 10 hottest cities this year
*/
select *
from temps_by_year
order by tavg desc
limit 10
```

---

## Cleaning the environment

Once the DAG has been executed, you checked it did well, and queried the fact tables, you can clean the environment by stopping the Apache Airflow and PostgreSQL containers with `Ctrl + C`. When the processes stop, run this command:

```bash
docker-compose down
```

It will delete all the resources created to run Apache Airflow (containers, networks, etc).

And that's all :-)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vermicida/capstone-project

Awesome Lists containing this project

README