https://github.com/atlasoflivingaustralia/pipelines-airflow

About Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.
https://github.com/atlasoflivingaustralia/pipelines-airflow

airflow spark

Last synced: 17 days ago
JSON representation

About Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.

Host: GitHub
URL: https://github.com/atlasoflivingaustralia/pipelines-airflow
Owner: AtlasOfLivingAustralia
License: other
Created: 2023-09-05T11:49:56.000Z (over 1 year ago)
Default Branch: develop
Last Pushed: 2025-04-17T04:38:00.000Z (about 1 month ago)
Last Synced: 2025-04-17T18:42:16.628Z (about 1 month ago)
Topics: airflow, spark
Language: Python
Homepage:
Size: 250 KB
Stars: 0
Watchers: 3
Forks: 1
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

# pipelines-airflow

Airflow DAGs and supporting files for running pipelines on Apache Airflow
with Elastic Map Reduce.

## Installation

These scripts have been tested with Airflow (MWAA) and EMR.

Screen Shot 2022-03-02 at 1 52 28 pm

## DAGS

This section describes some of the important DAGs in this project.

### [load_dataset_dag.py](dags/load_dataset_dag.py)
Steps:
* Look up the dataset in the collectory
* Retrieve the details of DwCA associated with the dataset
* Copy the DwCA to S3 for ingestion
* Determine the file size of the dataset, and either run pipelines on:
* single node cluster for a small dataset
* multi node cluster for a large dataset
* Run all pipelines to ingest the dataset, excluding SOLR indexing.

### [load_provider_dag.py](dags/load_provider_dag.py)
Steps:
* Look up the data provider in the collectory
* Retrieve the details of DwCA associated with the datasets
* Copy the DwCAs to S3 for all datasets for this provider ready for ingestion
* Run all pipelines to ingest the dataset, excluding SOLR indexing
This can be used to load all the datasets associated with an IPT

![load_provider](https://user-images.githubusercontent.com/444897/158418989-52229ae7-5c12-485d-b479-a26bc894d1f4.jpg)

### [ingest_small_datasets_dag.py](dags/ingest_small_datasets_dag.py)
A DAG used by the `Ingest_all_datasets` DAG to load large numbers of small datasets using a **single node cluster** in EMR.
This will not run SOLR indexing.
Includes the following options:
* `load_images` - whether to load images for archives
* `skip_dwca_to_verbatim` - skip the DWCA to Verbatim stage (which is expensive), and just reprocess

### [ingest_large_datasets_dag.py](dags/ingest_large_datasets_dag.py)
A DAG used by the `Ingest_all_datasets` DAG to load large numbers of large datasets using a **multi node cluster** in EMR.
This will not run SOLR indexing.
Includes the following options:
* `load_images` - whether to load images for archives
* `skip_dwca_to_verbatim` - skip the DWCA to Verbatim stage (which is expensive), and just reprocess

### [ingest_all_datasets_dag.py](dags/ingest_all_datasets_dag.py)
Steps:
* Retrieve a list of all available DwCAs in S3
* Run all pipelines to ingest each dataset. To do this it creates:
* Several single node clusters for small datasets
* Several multi-node clusters for large datasets
* A single multi-node cluster for the largest dataset (eBird)
Includes the following options:
* `load_images` - whether to load images for archives
* `skip_dwca_to_verbatim` - skip the DWCA to Verbatim stage (which is expensive), and just reprocess
* `run_index` - whether to run a complete reindex on completion of ingestion

Screen Shot 2022-03-16 at 12 52 42 pm

### [full_index_to_solr.py](dags/full_index_to_solr.py)
Steps:
* Run Sampling of environmental and contextual layers
* Run Jackknife environmental outlier detection
* Run Clustering
* Run Expert Distribution outlier detection
* Run SOLR indexing for all datasets

### [solr_dataset_indexing](dags/solr_dataset_indexing.py)
Run SOLR indexing for single dataset into the live index.
This does not run the all dataset processes (Jackknife etc)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/atlasoflivingaustralia/pipelines-airflow

Awesome Lists containing this project

README