https://github.com/iht/tfx-airflow-summit-2022
https://github.com/iht/tfx-airflow-summit-2022
Last synced: over 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/iht/tfx-airflow-summit-2022
- Owner: iht
- License: apache-2.0
- Created: 2022-05-15T14:51:41.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-05-18T13:35:44.000Z (about 4 years ago)
- Last Synced: 2025-01-13T02:37:04.287Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 24.4 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tfx-airflow-summit-2022
Python 3.7
Install Mariadb `brew install mariadb`
`airflow db init`
Create a user `airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin`
`airflow webserver -p 8080`
Change `expose_config = True` in ~/airflow/airflow.cfg
Run `airflow scheduler` in another terminal
Run Bigquery without setting any beam options, check it fails, check BQ jobs
in console UI https://console.cloud.google.com/bigquery?project=tfx-airflow-summit-2022
Add beam pipeline args for the direct runner
Check output in /tmp
find /tmp/tfx-airflow-summit-2022/BigQueryExampleGen/examples/
Add statistics, schema and validator
Check stats (compare train and eval for drift in distribution) and decide to
normalize some features using the Transform component
preprocessing_fn to dir accesible by Airflow (for Direct Runner), then to GCS
metadata db: should be available to Airflow, but metadata is communicated through the components of TFX, the components don't need access to the metadata db