{"id":50876093,"url":"https://github.com/wayscience/iceberg-bioimage","last_synced_at":"2026-06-15T10:01:24.657Z","repository":{"id":347670268,"uuid":"1194488310","full_name":"WayScience/iceberg-bioimage","owner":"WayScience","description":"A format-agnostic framework for cataloging and querying bioimaging data (Parquet, OME-Zarr, OME-TIFF) with Apache Iceberg","archived":false,"fork":false,"pushed_at":"2026-04-21T20:15:17.000Z","size":2908,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-21T20:42:57.315Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://wayscience.github.io/iceberg-bioimage/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WayScience.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-28T12:31:00.000Z","updated_at":"2026-04-21T20:14:50.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/WayScience/iceberg-bioimage","commit_stats":null,"previous_names":["d33bs/iceberg-bioimage","wayscience/iceberg-bioimage"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/WayScience/iceberg-bioimage","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WayScience%2Ficeberg-bioimage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WayScience%2Ficeberg-bioimage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WayScience%2Ficeberg-bioimage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WayScience%2Ficeberg-bioimage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WayScience","download_url":"https://codeload.github.com/WayScience/iceberg-bioimage/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WayScience%2Ficeberg-bioimage/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34357282,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-15T10:01:21.457Z","updated_at":"2026-06-15T10:01:24.651Z","avatar_url":"https://github.com/WayScience.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://raw.githubusercontent.com/wayscience/iceberg-bioimage/main/docs/src/_static/iceberg-bioimage-logo.png\" alt=\"iceberg-bioimage logo\" width=\"150\"\u003e\n\n# iceberg-bioimage\n\n[![Software DOI badge](https://zenodo.org/badge/DOI/10.5281/zenodo.19672521.svg)](https://doi.org/10.5281/zenodo.19672521)\n[![PyPI - Version](https://img.shields.io/pypi/v/iceberg-bioimage)](https://pypi.org/project/iceberg-bioimage/)\n[![Build Status](https://github.com/wayscience/iceberg-bioimage/actions/workflows/run-tests.yml/badge.svg?branch=main)](https://github.com/wayscience/iceberg-bioimage/actions/workflows/run-tests.yml?query=branch%3Amain)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)\n\n`iceberg-bioimage` is a Python package for cataloging bioimaging metadata with Apache Iceberg and exporting Cytomining-compatible warehouse layouts.\n\nIt is designed for teams that want:\n\n- Iceberg is the control plane for cataloging, schemas, joins, and snapshots.\n- Cytomining-compatible Parquet warehouses are a first-class export target.\n- Flexible image data planes, including Zarr, OME-TIFF, and OME-Arrow-centered workflows.\n- Adapters that normalize source formats into a single `ScanResult` model.\n- Integration with external execution/query tools such as DuckDB, xarray, and tifffile.\n\n## Key capabilities\n\n- Scan supported source stores, including Zarr and OME-TIFF, into canonical `ScanResult` objects.\n- Summarize scanned datasets into user-facing `DatasetSummary` objects.\n- Publish `image_assets` and `chunk_index` metadata tables with PyIceberg.\n- Ingest one or more datasets into Cytotable-compatible Iceberg warehouses.\n- Export new or existing datasets into Cytomining-compatible Parquet warehouses.\n- Validate profile tables against the microscopy join contract.\n- Join scanned image metadata to profile tables through a simple top-level API.\n- Query canonical metadata through optional DuckDB helpers.\n- Load catalog-backed metadata tables into Arrow for downstream joins.\n\n## Project layout\n\n```text\nsrc/iceberg_bioimage/\n  __init__.py\n  api.py\n  cli.py\n  adapters/\n  integrations/\n  models/\n  publishing/\n  validation/\n```\n\n## Dependencies\n\nCore runtime dependencies include:\n\n- `pyarrow` for Arrow/Parquet table operations\n- `pyiceberg` for catalog/table publishing\n- `tifffile` for OME-TIFF metadata scanning when OME-TIFF sources are used\n- `zarr` for Zarr metadata scanning\n\nOptional integration groups:\n\n- `duckdb` for query helpers and examples\n- `ome-arrow` for Arrow-native tabular image payloads and lazy image access\n\n## Getting started\n\n- If you want a catalog-free first run, start with Cytomining export:\n  `iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr`\n- If you want Iceberg-backed publishing, configure a PyIceberg catalog first.\n- For step-by-step setup, see `docs/src/getting-started.md` and `docs/src/catalog-setup.md`.\n\n## Zarr support\n\n`iceberg-bioimage` keeps the user-facing API simple: use `scan_store(...)` for\nboth local Zarr v2 stores and local Zarr v3 metadata stores.\n\n- Zarr v2 arrays are scanned through the `zarr` Python package\n- Local Zarr v3 stores are scanned from `zarr.json` metadata without requiring\n  a separate API\n- Summaries report the storage variant as `zarr-v2` or `zarr-v3`\n- The base package allows either Zarr 2 or Zarr 3 runtimes so that optional\n  forward-facing integrations can coexist in the same environment\n\n## Quickstart\n\n```python\nfrom iceberg_bioimage import (\n    export_store_to_cytomining_warehouse,\n    ingest_stores_to_warehouse,\n    join_profiles_with_store,\n    register_store,\n    summarize_store,\n    validate_microscopy_profile_table,\n)\n\nregistration = register_store(\n    \"data/experiment.zarr\",\n    \"default\",\n    \"bioimage.cytotable\",\n)\nprint(registration.to_dict())\n\nsummary = summarize_store(\"data/experiment.zarr\")\nprint(summary.to_dict())\n\ncontract = validate_microscopy_profile_table(\"data/cells.parquet\")\nprint(contract.is_valid)\n\n# Requires the optional DuckDB integration:\n#   pip install 'iceberg-bioimage[duckdb]'\njoined = join_profiles_with_store(\"data/experiment.zarr\", \"data/cells.parquet\")\nprint(joined.num_rows)\n\nwarehouse = ingest_stores_to_warehouse(\n    [\"data/experiment-a.zarr\", \"data/experiment-b.zarr\"],\n    \"default\",\n    \"bioimage.cytotable\",\n)\nprint(warehouse.to_dict())\n\ncytomining_export = export_store_to_cytomining_warehouse(\n    \"data/experiment-a.zarr\",\n    \"warehouse-root\",\n    profiles=\"data/cells.parquet\",\n    profile_dataset_id=\"experiment-a\",\n)\nprint(cytomining_export.to_dict())\n```\n\n```bash\niceberg-bioimage scan data/experiment.zarr\niceberg-bioimage summarize data/experiment.zarr\niceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr\niceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/experiment-a.zarr data/experiment-b.zarr\niceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr\niceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr\niceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr\niceberg-bioimage validate-contract data/cells.parquet\niceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet\n```\n\n- `examples/quickstart.py` for a minimal scan, publish, and validation script\n- `examples/catalog_duckdb.py` for a catalog-backed query workflow\n- `examples/synthetic_workflow.py` for a self-contained local workflow\n\nInstall optional integrations with:\n\n```bash\npip install 'iceberg-bioimage[duckdb]'\npip install 'iceberg-bioimage[ome-arrow]'\n```\n\n## DuckDB helpers\n\nDuckDB is supported as an optional integration layer, not as a required engine.\nThe join helpers also accept common `pycytominer` and `coSMicQC`-style\n`Metadata_*` aliases for `dataset_id`, `image_id`, `plate_id`, `well_id`, and\n`site_id`. If a profile table is missing `dataset_id` but all rows belong to\none dataset, pass `profile_dataset_id=...` to the high-level join helpers.\n\n```python\nimport pyarrow as pa\n\nfrom iceberg_bioimage import join_image_assets_with_profiles, query_metadata_table\n\nimage_assets = pa.table(\n    {\n        \"dataset_id\": [\"ds-1\"],\n        \"image_id\": [\"img-1\"],\n        \"array_path\": [\"0\"],\n        \"uri\": [\"data/example.zarr\"],\n    }\n)\nprofiles = pa.table(\n    {\n        \"dataset_id\": [\"ds-1\"],\n        \"image_id\": [\"img-1\"],\n        \"cell_count\": [42],\n    }\n)\n\njoined = join_image_assets_with_profiles(image_assets, profiles)\nfiltered = query_metadata_table(\n    joined,\n    filters=[(\"cell_count\", \"\u003e\", 10)],\n)\n```\n\nInstall the optional integration with `uv sync --group duckdb`.\n\n## Cytomining warehouse export\n\nThe package supports Cytomining interoperability as a primary workflow.\nBesides publishing canonical metadata to Iceberg, it can materialize a\nParquet-backed warehouse root that tools like `pycytominer` can consume\ndirectly.\n\n```python\nfrom iceberg_bioimage import export_store_to_cytomining_warehouse\n\nresult = export_store_to_cytomining_warehouse(\n    \"data/experiment.zarr\",\n    \"warehouse-root\",\n    profiles=\"data/profiles.parquet\",\n    profile_dataset_id=\"experiment\",\n)\nprint(result.to_dict())\n```\n\nThis writes one or more of:\n\n- `images/image_assets/`\n- `images/chunk_index/`\n- `profiles/joined_profiles/`\n\nIt can also append downstream Cytomining tables into the same warehouse root,\nusing namespaces that match table semantics, for example:\n\n- `profiles/pycytominer_profiles/`\n- `quality_control/cosmicqc_profiles/`\n\n## OME-Arrow helpers\n\nOME-Arrow is available as an optional forward-facing integration for tabular\nimage payloads stored in Arrow-compatible formats.\nProjects may also choose an OME-Arrow-first workflow for source image handling.\n\n```python\nfrom iceberg_bioimage import create_ome_arrow, scan_ome_arrow\n\noa = create_ome_arrow(\"image.ome.tiff\")\nlazy_oa = scan_ome_arrow(\"image.ome.parquet\")\n```\n\nInstall it with `uv sync --group ome-arrow` or\n`pip install 'iceberg-bioimage[ome-arrow]'`.\n\n## Local synthetic workflow\n\nFor a catalog-free onboarding path, `examples/synthetic_workflow.py` creates a\nsmall Zarr store and profile table, validates the join contract, derives\ncanonical metadata rows, and joins them with the optional DuckDB helpers.\n\nRun it with:\n\n```bash\nuv run --group duckdb python examples/synthetic_workflow.py\n```\n\n## Catalog-backed query workflow\n\nIf you already published canonical metadata tables, you can read them from a\ncatalog and join them to analysis outputs directly:\n\n```python\nimport pyarrow as pa\n\nfrom iceberg_bioimage import join_catalog_image_assets_with_profiles\n\nprofiles = pa.table(\n    {\n        \"dataset_id\": [\"ds-1\"],\n        \"image_id\": [\"img-1\"],\n        \"cell_count\": [42],\n    }\n)\n\njoined = join_catalog_image_assets_with_profiles(\n    \"default\",\n    \"bioimage.cytotable\",\n    profiles,\n    chunk_index_table=\"chunk_index\",\n)\n```\n\n## Documentation\n\n- `docs/src/getting-started.md` for first-time setup\n- `docs/src/catalog-setup.md` for catalog configuration\n- `docs/src/cytomining.md` for warehouse export workflows\n- `docs/src/warehouse-spec.md` for the warehouse interoperability specification\n- `docs/src/workflow.md` for CLI-driven end-to-end examples\n\n## Troubleshooting\n\n- `DuckDB helpers require the optional duckdb dependency group`:\n  install with `pip install 'iceberg-bioimage[duckdb]'` or `uv sync --group duckdb`.\n- `Profiles do not satisfy the microscopy join contract`:\n  run `iceberg-bioimage validate-contract ...` and pass\n  `--profile-dataset-id` when `dataset_id` is missing but implied.\n- `Missing table: ...` for catalog-backed paths:\n  verify catalog configuration, namespace, and table names.\n\n## Architecture note\n\nThe package focuses on metadata scanning, publishing, Cytomining warehouse\nexport, validation, and joins. OME-Arrow remains the place for Arrow-native\nimage payload handling and lazy image access.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwayscience%2Ficeberg-bioimage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwayscience%2Ficeberg-bioimage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwayscience%2Ficeberg-bioimage/lists"}