{"id":31036978,"url":"https://github.com/eth-library/data-assets-pipeline","last_synced_at":"2025-10-08T08:10:06.992Z","repository":{"id":311769335,"uuid":"892324339","full_name":"eth-library/data-assets-pipeline","owner":"eth-library","description":"A pipeline for ingesting and processing digital archive assets, extracting metadata from METS files, and orchestrating archive workflows.","archived":false,"fork":false,"pushed_at":"2025-08-26T13:06:57.000Z","size":72,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-08-26T16:49:30.069Z","etag":null,"topics":["dagster","data-engineering","data-pipeline","digital-archive","digital-preservation","mets","mets-xml","oais"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eth-library.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-21T22:41:38.000Z","updated_at":"2025-08-26T14:03:46.000Z","dependencies_parsed_at":"2025-08-26T16:55:42.814Z","dependency_job_id":"c2899287-e2b5-4fe6-b44b-87dc6a90be25","html_url":"https://github.com/eth-library/data-assets-pipeline","commit_stats":null,"previous_names":["eth-library/data-assets-pipeline"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/eth-library/data-assets-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-library%2Fdata-assets-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-library%2Fdata-assets-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-library%2Fdata-assets-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-library%2Fdata-assets-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eth-library","download_url":"https://codeload.github.com/eth-library/data-assets-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-library%2Fdata-assets-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275062968,"owners_count":25398888,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-14T02:00:10.474Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dagster","data-engineering","data-pipeline","digital-archive","digital-preservation","mets","mets-xml","oais"],"created_at":"2025-09-14T04:45:05.922Z","updated_at":"2025-10-08T08:10:01.957Z","avatar_url":"https://github.com/eth-library.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Archive Assets Pipeline\n\n## Overview\n\nThis project implements a digital asset processing pipeline that implements the Submission Information Package (SIP)\ncomponent of the Open Archival Information System (OAIS) reference model and METS (Metadata Encoding and Transmission\nStandard) specifications. It processes and manages digital assets within a data archive by extracting metadata from METS\nfiles and organizing them into structured SIPs.\n\nThe system uses [Dagster](https://dagster.io/) as its core data orchestrator, providing robust workflow management\nfor complex archiving processes. The implementation ensures:\n\n- **OAIS SIP Processing**: Implements the OAIS Submission Information Package model with structured metadata handling\n- **METS Standard Support**: Full parsing and processing of METS XML files\n- **Data Validation**: Robust validation using Pydantic models\n- **Scalable Architecture**: Modular design for handling complex archiving workflows\n\n## Setup\n\n### Recommended Nix + Direnv Setup\n\nWe recommend using the fully automatic setup method using Nix Flakes and Direnv:\n\n#### Prerequisites\n\n- [Nix](https://nixos.org/download.html) package manager with [flakes](https://wiki.nixos.org/wiki/Flakes) enabled\n- [direnv](https://direnv.net/docs/installation.html) for environment management\n\n#### Steps\n\n1. Clone the repository\n2. Allow direnv in the project directory:\n\n```bash\ndirenv allow\n```\n\nThis will automatically:\n- Create a Python 3.12 virtual environment in `.venv`\n- Install all dependencies using UV package manager\n- Set up the development environment\n\nIf you need to manually activate the environment without direnv:\n\n```bash\nnix develop\n```\n\n\n## Dependency Management\n\nDependencies are managed using [UV](https://github.com/astral-sh/uv), a modern Python package manager:\n\n- `pyproject.toml`: Defines project dependencies (requires Python 3.12+)\n- `uv.lock`: Locks dependencies to specific versions\n\nCommon UV commands:\n\n```bash\n# Update dependencies\nuv sync\n\n# Update lock file\nuv lock\n\n# Install dependencies (for manual setup)\nuv install\n```\n\n## Usage\n\n### Starting the Dagster UI\n\nLaunch the Dagster web interface:\n\n```bash\ndagster dev\n```\n\nAccess the UI at http://localhost:3000\n\n### Pipeline Structure\n\nThe pipeline consists of the following components:\n\n1. **Assets**:\n   - `sip_asset`: Parses METS XML files into a structured SIP model\n   - `intellectual_entities`: Extracts and processes Intellectual Entity models\n   - `representations`: Collects and processes file representations\n   - `files`: Extracts and processes file metadata\n   - `fixities`: Extracts and processes file checksums\n\n2. **Jobs**:\n   - `ingest_sip_job`: Orchestrates the complete SIP creation process\n\n3. **Sensors**:\n   - `xml_file_sensor`: Monitors for new METS XML files and triggers processing\n\n### Running Tests\n\nExecute the test suite:\n\n```bash\npytest da_pipeline_tests\n```\n\n## Project Configuration\n\n- `flake.nix`: Defines the development environment and dependencies\n- `.envrc`: Configures direnv to use the Nix flake\n- `pyproject.toml`: Defines Python package metadata and dependencies\n- `workspace.yaml`: Configures Dagster code locations\n- `uv.lock`: Locks dependencies to specific versions\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feth-library%2Fdata-assets-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feth-library%2Fdata-assets-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feth-library%2Fdata-assets-pipeline/lists"}