https://github.com/atlasoflivingaustralia/pipelines-airflow
About Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.
https://github.com/atlasoflivingaustralia/pipelines-airflow
airflow spark
Last synced: 17 days ago
JSON representation
About Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.
- Host: GitHub
- URL: https://github.com/atlasoflivingaustralia/pipelines-airflow
- Owner: AtlasOfLivingAustralia
- License: other
- Created: 2023-09-05T11:49:56.000Z (over 1 year ago)
- Default Branch: develop
- Last Pushed: 2025-04-17T04:38:00.000Z (about 1 month ago)
- Last Synced: 2025-04-17T18:42:16.628Z (about 1 month ago)
- Topics: airflow, spark
- Language: Python
- Homepage:
- Size: 250 KB
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# pipelines-airflow
Airflow DAGs and supporting files for running pipelines on Apache Airflow
with Elastic Map Reduce.## Installation
These scripts have been tested with Airflow (MWAA) and EMR.
## DAGS
This section describes some of the important DAGs in this project.
### [load_dataset_dag.py](dags/load_dataset_dag.py)
Steps:
* Look up the dataset in the collectory
* Retrieve the details of DwCA associated with the dataset
* Copy the DwCA to S3 for ingestion
* Determine the file size of the dataset, and either run pipelines on:
* single node cluster for a small dataset
* multi node cluster for a large dataset
* Run all pipelines to ingest the dataset, excluding SOLR indexing.### [load_provider_dag.py](dags/load_provider_dag.py)
Steps:
* Look up the data provider in the collectory
* Retrieve the details of DwCA associated with the datasets
* Copy the DwCAs to S3 for all datasets for this provider ready for ingestion
* Run all pipelines to ingest the dataset, excluding SOLR indexing
This can be used to load all the datasets associated with an IPT
### [ingest_small_datasets_dag.py](dags/ingest_small_datasets_dag.py)
A DAG used by the `Ingest_all_datasets` DAG to load large numbers of small datasets using a **single node cluster** in EMR.
This will not run SOLR indexing.
Includes the following options:
* `load_images` - whether to load images for archives
* `skip_dwca_to_verbatim` - skip the DWCA to Verbatim stage (which is expensive), and just reprocess### [ingest_large_datasets_dag.py](dags/ingest_large_datasets_dag.py)
A DAG used by the `Ingest_all_datasets` DAG to load large numbers of large datasets using a **multi node cluster** in EMR.
This will not run SOLR indexing.
Includes the following options:
* `load_images` - whether to load images for archives
* `skip_dwca_to_verbatim` - skip the DWCA to Verbatim stage (which is expensive), and just reprocess### [ingest_all_datasets_dag.py](dags/ingest_all_datasets_dag.py)
Steps:
* Retrieve a list of all available DwCAs in S3
* Run all pipelines to ingest each dataset. To do this it creates:
* Several single node clusters for small datasets
* Several multi-node clusters for large datasets
* A single multi-node cluster for the largest dataset (eBird)
Includes the following options:
* `load_images` - whether to load images for archives
* `skip_dwca_to_verbatim` - skip the DWCA to Verbatim stage (which is expensive), and just reprocess
* `run_index` - whether to run a complete reindex on completion of ingestion
### [full_index_to_solr.py](dags/full_index_to_solr.py)
Steps:
* Run Sampling of environmental and contextual layers
* Run Jackknife environmental outlier detection
* Run Clustering
* Run Expert Distribution outlier detection
* Run SOLR indexing for all datasets### [solr_dataset_indexing](dags/solr_dataset_indexing.py)
Run SOLR indexing for single dataset into the live index.
This does not run the all dataset processes (Jackknife etc)