https://github.com/matmoore/etl
Data pipelines for rescuing my data from various web services
https://github.com/matmoore/etl
Last synced: 7 months ago
JSON representation
Data pipelines for rescuing my data from various web services
- Host: GitHub
- URL: https://github.com/matmoore/etl
- Owner: MatMoore
- Created: 2020-12-06T13:41:47.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2020-12-31T14:14:42.000Z (over 5 years ago)
- Last Synced: 2025-04-02T22:30:41.226Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 7.81 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ETL
This repository contains data pipelines for rescuing my data from various web services.
These are implemented in Python and Airflow.
Goals:
- I have a backup of my data in case the service goes away or I stop using it
- I can easily query my data without writing code
- I can join and aggregate data from different sources
- I can filter data by the date it was generated
## Local development
### Python setup
Install Python 3.8.
```
python -m venv env
. env/bin/activate
pip install -r requirements.txt --use-deprecated legacy-resolver
```
### Airflow setup
(Optional) [configure airflow](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html). By default, `~/airflow/airflow.cfg` will be generated when you install airflow.
By default this will use the SequentialExecutor with SQLite, which is not recommended for production usage. For a more production-like setup using Postgres, change the following settings:
```
sql_alchemy_conn = postgres://etl:etl@localhost/airflow
executor = LocalExecutor
```
By default `dags_folder` will be set to `$AIRFLOW_HOME/dags`. To run these dags, configure it to point to the `src/dags` directory ([either in the airflow config or via environment variables](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder)).
Run `airflow initdb`
### Database setup
Create a postgres database for all the data to go into. E.g. on ubuntu:
`sudo -u postgres createdb `
Create a database user as well:
```
sudo -u postgres psql
> create user with password '';
> grant all privileges on database dbname to ;
```
### Generate API keys
Generate an Airtable API key on the [account page](https://airtable.com/account).
### Configure the local environment
Create a `.env` file in the root of the project. Set the following environment variables:
```
AIRTABLE_API_KEY=####
AIRFLOW_CONN_POSTGRES_MOVIES=postgresql://:@localhost/
```
### Debugging commands
```
# Run a task in isolation
airflow test syncing_movie_and_tv_data create_my_ratings 2015-06-01
# Backfill
airflow backfill syncing_movie_and_tv_data -s 2020-12-06
```
### Automated tests
There is a CI build defined in `.github/workflows/ci.yaml`.
You can run this build locally using [act](https://github.com/nektos/act).
```
# Workaround for ubuntu-latest not working
cd .github/workflows/ubuntu2004
docker build .
docker image tag sha256: github-ubuntu-20.04
act -P ubuntu-latest=github-ubuntu-20.04
```
## Pipelines
### TV and Movies
This builds a database of TV shows and movies I've watched.
How this should work:
- Identify shows and movies by IMDB ID and/or https://www.themoviedb.org/
- Get my ratings from Airtable and/or Letterboxd
- Store images in S3
- Copy movie metadata to Airtable
## Licence
All code is licenced under MIT.