Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sli239/featured_articles_metadata_etl

Airflow ETL Pipeline to Collect Metadata of Featured Articles from Wikipedia, for Creating a Data Lake on AWS
https://github.com/sli239/featured_articles_metadata_etl

airflow aws docker docker-compose etl pipeline

Last synced: 6 days ago
JSON representation

Airflow ETL Pipeline to Collect Metadata of Featured Articles from Wikipedia, for Creating a Data Lake on AWS

Host: GitHub
URL: https://github.com/sli239/featured_articles_metadata_etl
Owner: SLI239
Created: 2024-10-28T02:50:26.000Z (4 months ago)
Default Branch: main
Last Pushed: 2024-10-29T17:04:42.000Z (4 months ago)
Last Synced: 2025-02-13T23:17:05.610Z (6 days ago)
Topics: airflow, aws, docker, docker-compose, etl, pipeline
Language: Python
Homepage:
Size: 48.8 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ETL Pipeline for Creating a Catalog of Featured Articles on AWS
This Airflow pipeline has been developed to collect metadata of featured articles from Wikipedia, for creating a data lake on AWS. It 1) fetches metadata of daily featured articles from Wikimedia, 2) filters and cleans the dataset and 3) uploads the dataset to S3 and triggers AWS Glue crawler. It creates/updates the Glue data catalog on a daily basis, and you can easily query the catalog using AWS Athena. The future goal of this project is to build a basis of a collective article collection on AWS, by adding more data sources.

## Prerequisite
1. **Wikimedia Cliend Id and Client Secret**\
Take a look at [Getting started with Wikimedia APIs](https://api.wikimedia.org/wiki/Getting_started_with_Wikimedia_APIs)
2. **AWS Stack**\
Create an S3 bucket and Glue Crawler using AWS the Cloud Formation Template (`aws_create_stack.yaml`). Make sure that the name of the S3 bucket is *articles-metadata-bucket* and the Glue crawler is *articles-metadata-crawler*. Once the stack is created, download AWS access key.
3. **Docker Desktop**\
Installation guide on the official website is [here](https://docs.docker.com/compose/install/)

## How to Run
1. Git clone and Set up directory structure
```
git clone https://github.com/SLI239/featured_articles_metadata_etl.git
cd featured_articles_metadata_etl
mkdir ./config ./data ./logs ./plugins
```
2. Build Docker image and run Docker Compose
```
docker build --build-arg AIRFLOW_VERSION=2.10.2 -t apache/airflow-custom:2.10.2 .
docker-compose up
```
3. Set up variables and connections on Airflow web UI: Default ID and Password for login is `airflow`

- Variables

| Key | Value |
| ----------------------- | ------- |
| **WIKI_ACCESS_TOKEN** | *{"client_id":"`your client id`", "client_secret":"`your client secret`"}* |
| **S3_BUCKET_NAME** | *articles-metadata-bucket* |
| **GLUE_CRAWLER_NAME** | *articles-metadata-crawler* |

- Connections

| Conn Id | Conn Type | Extra |
| ----------------------- | ------- | ----------------------- |
| **MY_AWS_CONN** | AWS Web Services | *{"aws_access_key_id":"`your access key Id`", "aws_secret_access_key": "`your secret access key`"}* |

## Note
- **Take extra caution in securing Wikimedia token**: The API token type is bearer token and it is pushed to XCom, and can be easily found on the web UI
- `GlueTriggerCrawlerOperator` is from [data-pipelines-with-apache-airflow](https://github.com/BasPH/data-pipelines-with-apache-airflow/blob/master/chapter16/dags/custom/operators.py)
- More information about running Airflow with the CeleryExecutor in Docker is [here](https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html)