https://github.com/kevinknights29/airflow_wikipedia_pageviews
This projects implements the Airflow DAG presented in chapter 4 from the book `Data Pipelines with Apache Airflow` by B. Harenslak and J. de Ruiter
https://github.com/kevinknights29/airflow_wikipedia_pageviews
airflow astro-cli python sql
Last synced: about 2 months ago
JSON representation
This projects implements the Airflow DAG presented in chapter 4 from the book `Data Pipelines with Apache Airflow` by B. Harenslak and J. de Ruiter
- Host: GitHub
- URL: https://github.com/kevinknights29/airflow_wikipedia_pageviews
- Owner: kevinknights29
- License: apache-2.0
- Created: 2024-02-22T04:30:59.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-23T08:21:59.000Z (about 1 year ago)
- Last Synced: 2025-01-27T23:46:58.169Z (4 months ago)
- Topics: airflow, astro-cli, python, sql
- Language: Python
- Homepage:
- Size: 62.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Airflow_Wikipedia_Pageviews
This project implements the Airflow DAG presented in chapter 4 of the book [_Data Pipelines with Apache Airflow_ by B. Harenslak and J. de Ruiter](https://amzn.to/49qSLIV)
## Results
This pipeline fetches page views from `https://dumps.wikimedia.org/`.
Pages of interest are:
- Meta
- Microsoft
- Apple
- Amazon
- Netflix
- NvidiaOverall pipeline runs in less than **20 seconds**. This includes fetching results as zip, unziping, processing, inserting to postgress, and analytics.

## Prerequisites
- [ ] Have Docker installed
To install check: [Docker Dekstop Install](https://www.docker.com/products/docker-desktop/)
- [ ] Have Astro CLI installed
If you use brew, you can run: `brew install astro`
For other systems, please refer to: [Install Astro CLI](https://docs.astronomer.io/astro/cli/install-cli)
## Getting Started
1. Run `astro dev init` to create the necessary files for your environment.
2. Run `astro dev start` to start the airflow service with docker.
3. Configure Postrges connection by following this steps:
1. Run `astro dev bash` to access airflow terminal.
2. Run the following command to add the connection:
```bash
airflow connections add \
--conn-type postgres \
--conn-host host.docker.internal \
--conn-login postgres \
--conn-password postgres \
postgres_default
```Here using localhost will create an error. For an in depth explanation check: [Connect to local Postgres from docker airflow](https://stackoverflow.com/questions/72452675/connect-to-local-postgres-from-docker-airflow)
## Execution
To execute DAG, please visit: [Airflow UI](http://localhost:8080/)
In the DAGs section, you should see a DAG called `wikipedia_pageviews`.

> NOTE: Your run section will be empty instead of the colored options you see in the image.
Click the dag to open it, and to run it click the trigger `play` button in the top right side.

To take at the process flow of the pipeline. Select the `Graph` view.

## Project Structure
```text
.
├── Dockerfile
├── LICENSE
├── README.md
├── dags
│ ├── sql
│ │ └── most_popular_hour_per_page.sql
│ └── wikipedia_pageviews.py
├── packages.txt
├── pyproject.toml
├── requirements.txt
└── tests
└── dags
└── test_dag_example.py
```Generated with: `tree --gitignore --prune`
### Have fun! 😄
## Reference
- [_Data Pipelines with Apache Airflow_ by B. Harenslak and J. de Ruiter](https://amzn.to/49qSLIV)
- [Develop your Astro project](https://docs.astronomer.io/astro/cli/develop-project)
- [Airflow Docs](https://airflow.apache.org/docs/apache-airflow/stable/index.html)
- [TemplateNotFound error when running simple Airflow BashOperator](https://stackoverflow.com/questions/42147514/templatenotfound-error-when-running-simple-airflow-bashoperator)
- [How to Change the Timezone of a Postgres Database](https://www.commandprompt.com/education/how-to-change-the-timezone-of-a-postgres-database/)
- [Airflow PostgresHook Example](https://gist.github.com/antweiss/a6716339983bcc93aa505fd0c620b013)
- [Start a process when the container starts
](https://code.visualstudio.com/remote/advancedcontainers/start-processes)
- [Read JSON file using Python](https://www.geeksforgeeks.org/read-json-file-using-python/)
- [Reading and Writing JSON to a File in Python](https://www.geeksforgeeks.org/reading-and-writing-json-to-a-file-in-python/)
- [Passing a command line argument to airflow BashOperator](https://stackoverflow.com/questions/42016491/passing-a-command-line-argument-to-airflow-bashoperator)
- [Templates reference](https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html)
- [Time Zones](https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/timezone.html)