{"id":34701735,"url":"https://github.com/hubmapconsortium/reharmonize-legacy-metadata","last_synced_at":"2026-03-13T22:34:03.459Z","repository":{"id":317417760,"uuid":"1062012147","full_name":"hubmapconsortium/reharmonize-legacy-metadata","owner":"hubmapconsortium","description":"Resources and outputs for aligning legacy metadata with new schema standards through mappings, patches and glossaries","archived":false,"fork":false,"pushed_at":"2026-02-20T17:30:19.000Z","size":16787,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-20T21:57:46.959Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hubmapconsortium.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-22T17:27:06.000Z","updated_at":"2026-02-20T17:30:20.000Z","dependencies_parsed_at":"2025-10-15T04:14:33.020Z","dependency_job_id":"44926567-b991-4471-b1dd-963a0515301c","html_url":"https://github.com/hubmapconsortium/reharmonize-legacy-metadata","commit_stats":null,"previous_names":["hubmapconsortium/reharmonize-legacy-metadata"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hubmapconsortium/reharmonize-legacy-metadata","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hubmapconsortium%2Freharmonize-legacy-metadata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hubmapconsortium%2Freharmonize-legacy-metadata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hubmapconsortium%2Freharmonize-legacy-metadata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hubmapconsortium%2Freharmonize-legacy-metadata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hubmapconsortium","download_url":"https://codeload.github.com/hubmapconsortium/reharmonize-legacy-metadata/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hubmapconsortium%2Freharmonize-legacy-metadata/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30478168,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T20:45:58.186Z","status":"ssl_error","status_checked_at":"2026-03-13T20:45:20.133Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-24T22:52:56.100Z","updated_at":"2026-03-13T22:34:03.451Z","avatar_url":"https://github.com/hubmapconsortium.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HubMAP Legacy Metadata Standardization\n\n## Overview\n\nThis project standardizes legacy metadata from 17 different assay types into schema-compliant formats for the Human BioMolecular Atlas Program (HubMAP). The transformation ensures data quality, consistency, and compliance with current metadata standards.\n\n**Per October 31, 2025**: 2,192 legacy metadata files processed across 12 assay types\n\n**Per December 4, 2025**: 2,568 legacy metadata files processed across 15 assay types, and 2,192 are reviewed\n\n**Per March 2, 2026**: 2,568 legacy metadata files processed and reviewed\n\n---\n\n## Deliverable 1: Standardized Metadata by Dataset Type\n\nThe `metadata/` folder contains processed metadata organized by assay type. Each subfolder represents a distinct experimental methodology:\n\n### Dataset Types\n\n| Dataset Type | # of Files | Description | Processed | Reviewed | Todo | Summary |\n|-------------|------------|-------------|:---------:|:--------:|:----:|:-------:|\n| **[rnaseq](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/rnaseq)** | 639 | RNA sequencing metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/rnaseq/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/rnaseq/transformation-summary.html) |\n| **[atacseq](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/atacseq)** | 567 | ATAC sequencing metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/atacseq/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/atacseq/transformation-summary.html) |\n| **[lcms](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/lcms)** | 267 | Liquid chromatography-mass spectrometry metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/lcms/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/lcms/transformation-summary.html) |\n| **[mibi](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/mibi)** | 211 | Multiplexed ion beam imaging metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/mibi/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/mibi/transformation-summary.html) |\n| **[af](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/af)** | 136 | Auto-fluorescence imaging metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/af/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/af/transformation-summary.html) |\n| **[codex](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/codex)** | 133 | CODEX imaging metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/codex/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/codex/transformation-summary.html) |\n| **[maldi](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/maldi)** | 93 | MALDI imaging mass spectrometry metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/maldi/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/maldi/transformation-summary.html) |\n| **[histology](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/histology)** | 77 | Histology imaging metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/histology/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/histology/transformation-summary.html) |\n| **[celldive](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/celldive)** | 32 | Cell DIVE imaging metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/celldive/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/celldive/transformation-summary.html) |\n| **[desi](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/desi)** | 15 | DESI imaging mass spectrometry metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/desi/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/desi/transformation-summary.html) |\n| **[imc-2d](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/imc-2d)** | 13 | Imaging mass cytometry 2D metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/imc-2d/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/imc-2d/transformation-summary.html) |\n| **[lightsheet](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/lightsheet)** | 9 | Light sheet microscopy metadata | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/lightsheet/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/lightsheet/transformation-summary.html) |\n| **[10x-multiome](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-10x-multiome)** | 102 | 10X Multiome metadata (from CEDAR instances) | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-10x-multiome/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/cedar-10x-multiome/transformation-summary.html) |\n| **[histology](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-histology)** | 88 | Histology imaging metadata (from CEDAR instances) | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-histology/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/cedar-histology/transformation-summary.html) |\n| **[visium-no-probes](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-visium-no-probes)** | 83 | Visium (no probes) metadata (from CEDAR instances) | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-visium-no-probes/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/cedar-visium-no-probes/transformation-summary.html) |\n| **[rnaseq](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-rnaseq)** | 50 | RNA sequencing metadata (from CEDAR instances) | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-rnaseq/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/cedar-rnaseq/transformation-summary.html) |\n| **[af](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-af)** | 28 | Auto-fluorescence imaging metadata (from CEDAR instances) | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-af/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/cedar-af/transformation-summary.html) |\n| **[music](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-music)** | 14 | MuSIC metadata (from CEDAR instances) | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-music/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/cedar-music/transformation-summary.html) |\n| **[lightsheet](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-lightsheet)** | 8 | Light sheet microscopy metadata (from CEDAR instances) | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-lightsheet/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/cedar-lightsheet/transformation-summary.html) |\n| **[maldi](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-maldi)** | 3 | MALDI imaging mass spectrometry metadata (from CEDAR instances) | ✅ | ✅ | [link](https://github.com/hubmapconsortium/reharmonize-legacy-metadata/tree/main/metadata/cedar-maldi/todo) | [link](https://github-html-preview.dohyeon5626.com/?https://github.com/hubmapconsortium/reharmonize-legacy-metadata/blob/main/metadata/cedar-maldi/transformation-summary.html) |\n\n**Acknowledgement**: Metadata review conducted by Jean G. Rosario ([jrosar7](https://github.com/jrosar7)) at University of Pennsylvania.\n\n### Folder Structure\n\nEach dataset type folder contains:\n\n```\nmetadata/{dataset-type}/\n├── input/                          # Original legacy metadata files\n├── output/                         # Transformed, schema-compliant files\n├── todo/                           # Excel reports for curator review (grouped by institution)\n│   ├── {Institution Name}.xlsx    # Institution-specific review spreadsheet\n│   └── summary-report.json        # Aggregated statistics\n├── {dataset-type}-field-mappings.csv    # Field name mapping rules\n├── {dataset-type}-patches.json    # Conditional transformation rules\n├── {dataset-type}-nonstandard-values.json  # Quality assurance analysis results\n└── transformation-summary.html    # HTML report summarizing all transformations\n```\n\n#### Input Folder (`input/`)\nContains original legacy metadata JSON files with inconsistent field names and value formats from various data providers.\n\n#### Output Folder (`output/`)\nContains standardized metadata JSON files with:\n- Schema-compliant field names and values\n- Full transformation provenance (processing logs)\n- JSON patches applied during transformation\n- Both original and modified metadata for comparison\n\nEach output file includes:\n```json\n{\n  \"uuid\": \"...\",\n  \"hubmap_id\": \"...\",\n  \"metadata\": { /* original legacy metadata */ },\n  \"modified_metadata\": { /* standardized metadata */ },\n  \"json_patch\": [ /* transformation operations applied */ ],\n  \"processing_log\": {\n    /* complete audit trail of:\n       - field name changes\n       - value standardizations\n       - excluded data\n       - transformation decisions */\n  }\n}\n```\n\n#### Todo Folder (`todo/`)\nContains Excel spreadsheets grouped by institution for data provider review. Each spreadsheet identifies:\n- **Non-standard values:** Legacy values with no current schema equivalent\n- **Missing required data:** Required fields that are null or empty\n- **Validation issues:** Values that don't meet schema constraints (e.g., regex patterns)\n\nData providers use these spreadsheets to:\n1. Review flagged values and determine appropriate actions\n2. Update value mappings to include new standard equivalents\n3. Request missing data from data providers\n4. Propose schema updates for legitimate legacy values\n\n#### Transformation Summary (`transformation-summary.html`)\nAn HTML report providing a comprehensive overview of all transformations applied to the dataset type. The report includes:\n- **Field mappings table:** Shows the mapping from legacy field names to target schema field names\n- **Value mappings table:** Aggregated and deduplicated value standardizations extracted from output files\n- **Patches list:** Human-readable narration of conditional transformation rules applied\n\nThis report serves as documentation for reviewers to understand exactly what transformations were performed without needing to inspect individual output files.\n\n---\n\n## Deliverable 2: Metadata Processing System\n\n### Overview\nAn automated, rule-based transformation pipeline with quality assurance analysis.\n\n### Components\n\n#### Core Transformation Tools (`tools/`)\n\n**metadata-transformer** (v1.2.0)\n- Command-line tool for 4-phase metadata transformation\n- Phases: Conditional patching → Field mapping → Value standardization → Schema compliance\n- Comprehensive logging for full traceability\n\n**json-rules-engine** (v1.0.0)\n- Conditional transformation engine\n- Supports complex if/then logic for context-dependent transformations\n\n#### Analysis Scripts (`scripts/`)\n\n**generate-field-mapping.py**\n- Converts human-readable CSV field mappings to machine-readable JSON format\n\n**generate-target-schema.py**\n- Fetches and converts schemas from HubMAP repository\n- Ensures alignment with current metadata standards\n\n**find-nonstandard-values.py**\n- Quality assurance analysis after transformation\n- Identifies values requiring curator review\n- Generates institution-grouped Excel reports\n\n**generate-transformation-summary.py**\n- Generates HTML report summarizing all transformations for a dataset type\n- Aggregates field mappings, value mappings, and conditional patches\n- Produces human-readable narration of patch rules using templates\n\n### Transformation Pipeline\n\n```\nLegacy Metadata → [4-Phase Transformation] → Standardized Metadata → [QA Analysis] → Data Provider Review\n```\n\n**Phase 0:** Apply conditional patches (complex transformations)\n**Phase 1:** Rename fields (legacy → standard names)\n**Phase 2:** Standardize values (legacy → standard values)\n**Phase 3:** Apply schema (ensure all required fields present)\n**Phase 4:** Generate logs (complete audit trail)\n\n### Automation\n\nGitHub Actions workflows automate the complete transformation and analysis pipeline for each dataset type:\n- Triggered manually via workflow dispatch\n- Steps: Generate mappings → Transform → Analyze → Commit results\n- Ensures reproducible, version-controlled transformations\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhubmapconsortium%2Freharmonize-legacy-metadata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhubmapconsortium%2Freharmonize-legacy-metadata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhubmapconsortium%2Freharmonize-legacy-metadata/lists"}