Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/robnewman/etl-airflow-s3
ETL of newspaper article keywords using Apache Airflow, Newspaper3k, Quilt T4 and AWS S3
https://github.com/robnewman/etl-airflow-s3
Last synced: 6 days ago
JSON representation
ETL of newspaper article keywords using Apache Airflow, Newspaper3k, Quilt T4 and AWS S3
- Host: GitHub
- URL: https://github.com/robnewman/etl-airflow-s3
- Owner: robnewman
- License: mit
- Created: 2019-01-25T20:41:10.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2024-10-26T00:07:54.000Z (19 days ago)
- Last Synced: 2024-10-31T21:42:08.702Z (13 days ago)
- Language: Python
- Homepage:
- Size: 143 KB
- Stars: 15
- Watchers: 3
- Forks: 4
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# etl-airflow-s3
ETL of newspaper article keywords using Apache Airflow, Newspaper3k, Quilt T4
and AWS S3## Setup
As of writing, Apache Airflow does not support Python 3.7 (my default install), so we need to install a local version of Python 3.6. We do this using [pyenv](https://github.com/pyenv/pyenv).
1. Install Homebrew version of pyenv (on OSX):
`$ brew install pyenv`
2. Install the latest Python 3.6 (as of writing, 3.6.8)
`$ pyenv install 3.6.8`
3. Use Python 3.6.8 for this project
`$ pyenv local 3.6.8`
4. Create a new virtual environment (using `venv` for the project using our installed Python 3.6.8):
`$ python -m venv /path/to/virtual-environment`
5. Activate the virtual environment:
`$ source /path/to/virtual-environment/bin/activate`
6. Install Apache Airflow:
`$ pip install apache-airflow`
7. Install Quilt T4:
`$ pip install t4`
8. Install Newspaper3k:
```
$ brew install libxml2 libxslt
$ brew install libtiff libjpeg webp little-cms2
$ pip install newspaper3k
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python
```9. Change the default location of of `AIRFLOW_HOME` to your project directory
`$ export AIRFLOW_HOME=~/airflow`
10. Initialize Airflow's database (defaults to SQLite):
`$ airflow initdb`
11. Start the scheduler:
`$ airflow scheduler`
12. Start the webserver (DAG interface):
`$ airflow webserver`
If you are storing you DAGs in a local repository (as part of a
larger version-controlled data engineering infrastructure) rather
than globally in `AIRFLOW_HOME/dags`, you’ll need to update the entry
in `airflow.cfg` to reflect this new DAG folder location:```
[core]
# The home folder for airflow, default is ~/airflow
airflow_home = /Users//airflow# The folder where your airflow pipelines live, most likely a
# subfolder in a code repository
# This path must be absolute
dags_folder = /Users///dags
```## Simplify the DAG web interface
By default, Airflow helpfully loads ~15 example DAGs: great for
learning but which clutter the UI. You can remove these (which you
will definitely want to do before moving to a production environment)
by setting the `load_examples` flag to `False` in the `[core]` section
of `AIRFLOW_HOME/airflow.cfg`:```
# Whether to load the examples that ship with Airflow. It's good to
# get started, but you probably want to set this to False in
# a production environment
load_examples = False
```Note: If you started the Airflow scheduler and webserver _before_
updating this setting, you'll still see the example DAGs in the web
UI. To reset the view, run the following command (warning - this
will destroy all current DAG information!):`$ airflow resetdb`
## Check your DAGs
List current DAGs:
`$ airflow list_dags`
Check each task in your DAG:
`$ airflow list_tasks `
Test DAG tasks end-to-end:
`$ airflow test `
## Scrape headlines
Scrape headlines from online news sources by running the `headlines` DAG in
the Airflow user-interface, or by running the following atomic tasks:```
$ airflow test headlines scrape_articles YYYY-MM-DD
$ airflow test headlines write_to_json YYYY-MM-DD
$ airflow test headlines add_to_package YYYY-MM-DD
```where `YYYY-MM-DD` is today's date