https://github.com/eth-library/data-assets-pipeline
A pipeline for ingesting and processing digital archive assets, extracting metadata from METS files, and orchestrating archive workflows.
https://github.com/eth-library/data-assets-pipeline
dagster data-engineering data-pipeline digital-archive digital-preservation mets mets-xml oais
Last synced: 23 days ago
JSON representation
A pipeline for ingesting and processing digital archive assets, extracting metadata from METS files, and orchestrating archive workflows.
- Host: GitHub
- URL: https://github.com/eth-library/data-assets-pipeline
- Owner: eth-library
- Created: 2024-11-21T22:41:38.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-08-26T13:06:57.000Z (about 1 month ago)
- Last Synced: 2025-08-26T16:49:30.069Z (about 1 month ago)
- Topics: dagster, data-engineering, data-pipeline, digital-archive, digital-preservation, mets, mets-xml, oais
- Language: Python
- Homepage:
- Size: 70.3 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Archive Assets Pipeline
## Overview
This project implements a digital asset processing pipeline that implements the Submission Information Package (SIP)
component of the Open Archival Information System (OAIS) reference model and METS (Metadata Encoding and Transmission
Standard) specifications. It processes and manages digital assets within a data archive by extracting metadata from METS
files and organizing them into structured SIPs.The system uses [Dagster](https://dagster.io/) as its core data orchestrator, providing robust workflow management
for complex archiving processes. The implementation ensures:- **OAIS SIP Processing**: Implements the OAIS Submission Information Package model with structured metadata handling
- **METS Standard Support**: Full parsing and processing of METS XML files
- **Data Validation**: Robust validation using Pydantic models
- **Scalable Architecture**: Modular design for handling complex archiving workflows## Setup
### Recommended Nix + Direnv Setup
We recommend using the fully automatic setup method using Nix Flakes and Direnv:
#### Prerequisites
- [Nix](https://nixos.org/download.html) package manager with [flakes](https://wiki.nixos.org/wiki/Flakes) enabled
- [direnv](https://direnv.net/docs/installation.html) for environment management#### Steps
1. Clone the repository
2. Allow direnv in the project directory:```bash
direnv allow
```This will automatically:
- Create a Python 3.12 virtual environment in `.venv`
- Install all dependencies using UV package manager
- Set up the development environmentIf you need to manually activate the environment without direnv:
```bash
nix develop
```## Dependency Management
Dependencies are managed using [UV](https://github.com/astral-sh/uv), a modern Python package manager:
- `pyproject.toml`: Defines project dependencies (requires Python 3.12+)
- `uv.lock`: Locks dependencies to specific versionsCommon UV commands:
```bash
# Update dependencies
uv sync# Update lock file
uv lock# Install dependencies (for manual setup)
uv install
```## Usage
### Starting the Dagster UI
Launch the Dagster web interface:
```bash
dagster dev
```Access the UI at http://localhost:3000
### Pipeline Structure
The pipeline consists of the following components:
1. **Assets**:
- `sip_asset`: Parses METS XML files into a structured SIP model
- `intellectual_entities`: Extracts and processes Intellectual Entity models
- `representations`: Collects and processes file representations
- `files`: Extracts and processes file metadata
- `fixities`: Extracts and processes file checksums2. **Jobs**:
- `ingest_sip_job`: Orchestrates the complete SIP creation process3. **Sensors**:
- `xml_file_sensor`: Monitors for new METS XML files and triggers processing### Running Tests
Execute the test suite:
```bash
pytest da_pipeline_tests
```## Project Configuration
- `flake.nix`: Defines the development environment and dependencies
- `.envrc`: Configures direnv to use the Nix flake
- `pyproject.toml`: Defines Python package metadata and dependencies
- `workspace.yaml`: Configures Dagster code locations
- `uv.lock`: Locks dependencies to specific versions