An open API service indexing awesome lists of open source software.

https://github.com/matmoore/etl

Data pipelines for rescuing my data from various web services
https://github.com/matmoore/etl

Last synced: 7 months ago
JSON representation

Data pipelines for rescuing my data from various web services

Awesome Lists containing this project

README

          

# ETL
This repository contains data pipelines for rescuing my data from various web services.

These are implemented in Python and Airflow.

Goals:

- I have a backup of my data in case the service goes away or I stop using it
- I can easily query my data without writing code
- I can join and aggregate data from different sources
- I can filter data by the date it was generated

## Local development

### Python setup
Install Python 3.8.

```
python -m venv env
. env/bin/activate
pip install -r requirements.txt --use-deprecated legacy-resolver
```

### Airflow setup
(Optional) [configure airflow](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html). By default, `~/airflow/airflow.cfg` will be generated when you install airflow.

By default this will use the SequentialExecutor with SQLite, which is not recommended for production usage. For a more production-like setup using Postgres, change the following settings:

```
sql_alchemy_conn = postgres://etl:etl@localhost/airflow
executor = LocalExecutor
```

By default `dags_folder` will be set to `$AIRFLOW_HOME/dags`. To run these dags, configure it to point to the `src/dags` directory ([either in the airflow config or via environment variables](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder)).

Run `airflow initdb`

### Database setup
Create a postgres database for all the data to go into. E.g. on ubuntu:

`sudo -u postgres createdb `

Create a database user as well:

```
sudo -u postgres psql
> create user with password '';
> grant all privileges on database dbname to ;
```

### Generate API keys
Generate an Airtable API key on the [account page](https://airtable.com/account).

### Configure the local environment
Create a `.env` file in the root of the project. Set the following environment variables:

```
AIRTABLE_API_KEY=####
AIRFLOW_CONN_POSTGRES_MOVIES=postgresql://:@localhost/
```

### Debugging commands
```
# Run a task in isolation
airflow test syncing_movie_and_tv_data create_my_ratings 2015-06-01

# Backfill
airflow backfill syncing_movie_and_tv_data -s 2020-12-06
```

### Automated tests
There is a CI build defined in `.github/workflows/ci.yaml`.

You can run this build locally using [act](https://github.com/nektos/act).

```
# Workaround for ubuntu-latest not working
cd .github/workflows/ubuntu2004
docker build .
docker image tag sha256: github-ubuntu-20.04

act -P ubuntu-latest=github-ubuntu-20.04
```

## Pipelines

### TV and Movies
This builds a database of TV shows and movies I've watched.

How this should work:

- Identify shows and movies by IMDB ID and/or https://www.themoviedb.org/
- Get my ratings from Airtable and/or Letterboxd
- Store images in S3
- Copy movie metadata to Airtable

## Licence

All code is licenced under MIT.