Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/os-climate/essd-ingest-pipeline
Ingestion pipeline for Earth System Science Data (ESSD).
https://github.com/os-climate/essd-ingest-pipeline
Last synced: 7 days ago
JSON representation
Ingestion pipeline for Earth System Science Data (ESSD).
- Host: GitHub
- URL: https://github.com/os-climate/essd-ingest-pipeline
- Owner: os-climate
- License: apache-2.0
- Created: 2021-11-19T09:46:06.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2024-10-21T19:55:28.000Z (24 days ago)
- Last Synced: 2024-10-22T13:38:41.541Z (23 days ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 8.16 MB
- Stars: 0
- Watchers: 2
- Forks: 3
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
> [!IMPORTANT]
> On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)# ESSD Ingestion Pipeline
[Earth System Science Data](https://www.earth-system-science-data.net/) (ESSD) is an international, interdisciplinary journal for the publication of articles on original research data (sets), furthering the reuse of high-quality data of benefit to Earth system sciences. Its open data licensing and high-quality data sources and editorial review processes make it an excellent source of data for the [Data Commons](https://github.com/os-climate/os_c_data_commons) in general, and for region-based GHG timeseries data in particular.
This data pipeline was initially forked from the AI CoE [project template](https://github.com/aicoe-aiops/project-template), which is geared toward AI/ML data extraction. The ESSD data is already highly curated, and so we don't really use much of that structure here. However, by having a common build and run environment, the pipeline is friendly to our CI/CD systems, GitHub, and help connect all the pipelines to the Data Commons in a consistent fashion.
The principal ingestion code can be found in the [notebooks](notebooks) directory. At present there are two steps in the pipeline:
1. Extract (which copies data into a Pachyderm repository to support data reproducibility)
2. Load (which loads data into Trino, builds the DBT transformas, and initializes metadata for Open Metadata).A third pipeline step may be to elaborate and curate the metadata to better support an ever-expanding Data Catalog.
The data transformation step runs from the [dbt/essd_transform](dbt/essd_transform) directory and is/will be documented there.
Metadata for the tables we have ingested can be viewed from our [OpenMetadata portal](https://openmetadata-openmetadata.apps.odh-cl2.apps.os-climate.org/explore/tables/?searchFilter=databaseschema%3Dessd) (GitHub User ID and ODH User access tokens required).
If you have questions, please file [Issues](https://github.com/os-climate/essd-ingest-pipeline/issues). If you have answers, please contribute [Pull Requests](https://github.com/os-climate/essd-ingest-pipeline/pulls)!
---
Project based on the cookiecutter data science project template. #cookiecutterdatascience