Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/robnewman/etl-airflow-s3

ETL of newspaper article keywords using Apache Airflow, Newspaper3k, Quilt T4 and AWS S3
https://github.com/robnewman/etl-airflow-s3

Last synced: 6 days ago
JSON representation

ETL of newspaper article keywords using Apache Airflow, Newspaper3k, Quilt T4 and AWS S3

Host: GitHub
URL: https://github.com/robnewman/etl-airflow-s3
Owner: robnewman
License: mit
Created: 2019-01-25T20:41:10.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2024-10-26T00:07:54.000Z (19 days ago)
Last Synced: 2024-10-31T21:42:08.702Z (13 days ago)
Language: Python
Homepage:
Size: 143 KB
Stars: 15
Watchers: 3
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# etl-airflow-s3

ETL of newspaper article keywords using Apache Airflow, Newspaper3k, Quilt T4
and AWS S3

## Setup

As of writing, Apache Airflow does not support Python 3.7 (my default install), so we need to install a local version of Python 3.6. We do this using [pyenv](https://github.com/pyenv/pyenv).

1. Install Homebrew version of pyenv (on OSX):

`$ brew install pyenv`

2. Install the latest Python 3.6 (as of writing, 3.6.8)

`$ pyenv install 3.6.8`

3. Use Python 3.6.8 for this project

`$ pyenv local 3.6.8`

4. Create a new virtual environment (using `venv` for the project using our installed Python 3.6.8):

`$ python -m venv /path/to/virtual-environment`

5. Activate the virtual environment:

`$ source /path/to/virtual-environment/bin/activate`

6. Install Apache Airflow:

`$ pip install apache-airflow`

7. Install Quilt T4:

`$ pip install t4`

8. Install Newspaper3k:

```
$ brew install libxml2 libxslt
$ brew install libtiff libjpeg webp little-cms2
$ pip install newspaper3k
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python
```

9. Change the default location of of `AIRFLOW_HOME` to your project directory

`$ export AIRFLOW_HOME=~/airflow`

10. Initialize Airflow's database (defaults to SQLite):

`$ airflow initdb`

11. Start the scheduler:

`$ airflow scheduler`

12. Start the webserver (DAG interface):

`$ airflow webserver`

## Define a custom DAG folder

If you are storing you DAGs in a local repository (as part of a
larger version-controlled data engineering infrastructure) rather
than globally in `AIRFLOW_HOME/dags`, you’ll need to update the entry
in `airflow.cfg` to reflect this new DAG folder location:

```
[core]
# The home folder for airflow, default is ~/airflow
airflow_home = /Users//airflow

# The folder where your airflow pipelines live, most likely a
# subfolder in a code repository
# This path must be absolute
dags_folder = /Users///dags
```

## Simplify the DAG web interface

By default, Airflow helpfully loads ~15 example DAGs: great for
learning but which clutter the UI. You can remove these (which you
will definitely want to do before moving to a production environment)
by setting the `load_examples` flag to `False` in the `[core]` section
of `AIRFLOW_HOME/airflow.cfg`:

```
# Whether to load the examples that ship with Airflow. It's good to
# get started, but you probably want to set this to False in
# a production environment
load_examples = False
```

Note: If you started the Airflow scheduler and webserver _before_
updating this setting, you'll still see the example DAGs in the web
UI. To reset the view, run the following command (warning - this
will destroy all current DAG information!):

`$ airflow resetdb`

## Check your DAGs

List current DAGs:

`$ airflow list_dags`

Check each task in your DAG:

`$ airflow list_tasks `

Test DAG tasks end-to-end:

`$ airflow test `

## Scrape headlines

Scrape headlines from online news sources by running the `headlines` DAG in
the Airflow user-interface, or by running the following atomic tasks:

```
$ airflow test headlines scrape_articles YYYY-MM-DD
$ airflow test headlines write_to_json YYYY-MM-DD
$ airflow test headlines add_to_package YYYY-MM-DD
```

where `YYYY-MM-DD` is today's date