https://github.com/wayscience/iceberg-bioimage

A format-agnostic framework for cataloging and querying bioimaging data (Parquet, OME-Zarr, OME-TIFF) with Apache Iceberg
https://github.com/wayscience/iceberg-bioimage

Last synced: about 4 hours ago
JSON representation

A format-agnostic framework for cataloging and querying bioimaging data (Parquet, OME-Zarr, OME-TIFF) with Apache Iceberg

Host: GitHub
URL: https://github.com/wayscience/iceberg-bioimage
Owner: WayScience
License: bsd-3-clause
Created: 2026-03-28T12:31:00.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-21T20:15:17.000Z (about 2 months ago)
Last Synced: 2026-04-21T20:42:57.315Z (about 2 months ago)
Language: Python
Homepage: https://wayscience.github.io/iceberg-bioimage/
Size: 2.77 MB
Stars: 1
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

          

# iceberg-bioimage

[![Software DOI badge](https://zenodo.org/badge/DOI/10.5281/zenodo.19672521.svg)](https://doi.org/10.5281/zenodo.19672521)

[![PyPI - Version](https://img.shields.io/pypi/v/iceberg-bioimage)](https://pypi.org/project/iceberg-bioimage/)

[![Build Status](https://github.com/wayscience/iceberg-bioimage/actions/workflows/run-tests.yml/badge.svg?branch=main)](https://github.com/wayscience/iceberg-bioimage/actions/workflows/run-tests.yml?query=branch%3Amain)

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)

`iceberg-bioimage` is a Python package for cataloging bioimaging metadata with Apache Iceberg and exporting Cytomining-compatible warehouse layouts.

It is designed for teams that want:

- Iceberg is the control plane for cataloging, schemas, joins, and snapshots.

- Cytomining-compatible Parquet warehouses are a first-class export target.

- Flexible image data planes, including Zarr, OME-TIFF, and OME-Arrow-centered workflows.

- Adapters that normalize source formats into a single `ScanResult` model.

- Integration with external execution/query tools such as DuckDB, xarray, and tifffile.

## Key capabilities

- Scan supported source stores, including Zarr and OME-TIFF, into canonical `ScanResult` objects.

- Summarize scanned datasets into user-facing `DatasetSummary` objects.

- Publish `image_assets` and `chunk_index` metadata tables with PyIceberg.

- Ingest one or more datasets into Cytotable-compatible Iceberg warehouses.

- Export new or existing datasets into Cytomining-compatible Parquet warehouses.

- Validate profile tables against the microscopy join contract.

- Join scanned image metadata to profile tables through a simple top-level API.

- Query canonical metadata through optional DuckDB helpers.

- Load catalog-backed metadata tables into Arrow for downstream joins.

## Project layout

```text

src/iceberg_bioimage/

  __init__.py

  api.py

  cli.py

  adapters/

  integrations/

  models/

  publishing/

  validation/

```

## Dependencies

Core runtime dependencies include:

- `pyarrow` for Arrow/Parquet table operations

- `pyiceberg` for catalog/table publishing

- `tifffile` for OME-TIFF metadata scanning when OME-TIFF sources are used

- `zarr` for Zarr metadata scanning

Optional integration groups:

- `duckdb` for query helpers and examples

- `ome-arrow` for Arrow-native tabular image payloads and lazy image access

## Getting started

- If you want a catalog-free first run, start with Cytomining export:

  `iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr`

- If you want Iceberg-backed publishing, configure a PyIceberg catalog first.

- For step-by-step setup, see `docs/src/getting-started.md` and `docs/src/catalog-setup.md`.

## Zarr support

`iceberg-bioimage` keeps the user-facing API simple: use `scan_store(...)` for

both local Zarr v2 stores and local Zarr v3 metadata stores.

- Zarr v2 arrays are scanned through the `zarr` Python package

- Local Zarr v3 stores are scanned from `zarr.json` metadata without requiring

  a separate API

- Summaries report the storage variant as `zarr-v2` or `zarr-v3`

- The base package allows either Zarr 2 or Zarr 3 runtimes so that optional

  forward-facing integrations can coexist in the same environment

## Quickstart

```python

from iceberg_bioimage import (

    export_store_to_cytomining_warehouse,

    ingest_stores_to_warehouse,

    join_profiles_with_store,

    register_store,

    summarize_store,

    validate_microscopy_profile_table,

)

registration = register_store(

    "data/experiment.zarr",

    "default",

    "bioimage.cytotable",

)

print(registration.to_dict())

summary = summarize_store("data/experiment.zarr")

print(summary.to_dict())

contract = validate_microscopy_profile_table("data/cells.parquet")

print(contract.is_valid)

# Requires the optional DuckDB integration:

#   pip install 'iceberg-bioimage[duckdb]'

joined = join_profiles_with_store("data/experiment.zarr", "data/cells.parquet")

print(joined.num_rows)

warehouse = ingest_stores_to_warehouse(

    ["data/experiment-a.zarr", "data/experiment-b.zarr"],

    "default",

    "bioimage.cytotable",

)

print(warehouse.to_dict())

cytomining_export = export_store_to_cytomining_warehouse(

    "data/experiment-a.zarr",

    "warehouse-root",

    profiles="data/cells.parquet",

    profile_dataset_id="experiment-a",

)

print(cytomining_export.to_dict())

```

```bash

iceberg-bioimage scan data/experiment.zarr

iceberg-bioimage summarize data/experiment.zarr

iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr

iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/experiment-a.zarr data/experiment-b.zarr

iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr

iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr

iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr

iceberg-bioimage validate-contract data/cells.parquet

iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet

```

- `examples/quickstart.py` for a minimal scan, publish, and validation script

- `examples/catalog_duckdb.py` for a catalog-backed query workflow

- `examples/synthetic_workflow.py` for a self-contained local workflow

Install optional integrations with:

```bash

pip install 'iceberg-bioimage[duckdb]'

pip install 'iceberg-bioimage[ome-arrow]'

```

## DuckDB helpers

DuckDB is supported as an optional integration layer, not as a required engine.

The join helpers also accept common `pycytominer` and `coSMicQC`-style

`Metadata_*` aliases for `dataset_id`, `image_id`, `plate_id`, `well_id`, and

`site_id`. If a profile table is missing `dataset_id` but all rows belong to

one dataset, pass `profile_dataset_id=...` to the high-level join helpers.

```python

import pyarrow as pa

from iceberg_bioimage import join_image_assets_with_profiles, query_metadata_table

image_assets = pa.table(

    {

        "dataset_id": ["ds-1"],

        "image_id": ["img-1"],

        "array_path": ["0"],

        "uri": ["data/example.zarr"],

    }

)

profiles = pa.table(

    {

        "dataset_id": ["ds-1"],

        "image_id": ["img-1"],

        "cell_count": [42],

    }

)

joined = join_image_assets_with_profiles(image_assets, profiles)

filtered = query_metadata_table(

    joined,

    filters=[("cell_count", ">", 10)],

)

```

Install the optional integration with `uv sync --group duckdb`.

## Cytomining warehouse export

The package supports Cytomining interoperability as a primary workflow.

Besides publishing canonical metadata to Iceberg, it can materialize a

Parquet-backed warehouse root that tools like `pycytominer` can consume

directly.

```python

from iceberg_bioimage import export_store_to_cytomining_warehouse

result = export_store_to_cytomining_warehouse(

    "data/experiment.zarr",

    "warehouse-root",

    profiles="data/profiles.parquet",

    profile_dataset_id="experiment",

)

print(result.to_dict())

```

This writes one or more of:

- `images/image_assets/`

- `images/chunk_index/`

- `profiles/joined_profiles/`

It can also append downstream Cytomining tables into the same warehouse root,

using namespaces that match table semantics, for example:

- `profiles/pycytominer_profiles/`

- `quality_control/cosmicqc_profiles/`

## OME-Arrow helpers

OME-Arrow is available as an optional forward-facing integration for tabular

image payloads stored in Arrow-compatible formats.

Projects may also choose an OME-Arrow-first workflow for source image handling.

```python

from iceberg_bioimage import create_ome_arrow, scan_ome_arrow

oa = create_ome_arrow("image.ome.tiff")

lazy_oa = scan_ome_arrow("image.ome.parquet")

```

Install it with `uv sync --group ome-arrow` or

`pip install 'iceberg-bioimage[ome-arrow]'`.

## Local synthetic workflow

For a catalog-free onboarding path, `examples/synthetic_workflow.py` creates a

small Zarr store and profile table, validates the join contract, derives

canonical metadata rows, and joins them with the optional DuckDB helpers.

Run it with:

```bash

uv run --group duckdb python examples/synthetic_workflow.py

```

## Catalog-backed query workflow

If you already published canonical metadata tables, you can read them from a

catalog and join them to analysis outputs directly:

```python

import pyarrow as pa

from iceberg_bioimage import join_catalog_image_assets_with_profiles

profiles = pa.table(

    {

        "dataset_id": ["ds-1"],

        "image_id": ["img-1"],

        "cell_count": [42],

    }

)

joined = join_catalog_image_assets_with_profiles(

    "default",

    "bioimage.cytotable",

    profiles,

    chunk_index_table="chunk_index",

)

```

## Documentation

- `docs/src/getting-started.md` for first-time setup

- `docs/src/catalog-setup.md` for catalog configuration

- `docs/src/cytomining.md` for warehouse export workflows

- `docs/src/warehouse-spec.md` for the warehouse interoperability specification

- `docs/src/workflow.md` for CLI-driven end-to-end examples

## Troubleshooting

- `DuckDB helpers require the optional duckdb dependency group`:

  install with `pip install 'iceberg-bioimage[duckdb]'` or `uv sync --group duckdb`.

- `Profiles do not satisfy the microscopy join contract`:

  run `iceberg-bioimage validate-contract ...` and pass

  `--profile-dataset-id` when `dataset_id` is missing but implied.

- `Missing table: ...` for catalog-backed paths:

  verify catalog configuration, namespace, and table names.

## Architecture note

The package focuses on metadata scanning, publishing, Cytomining warehouse

export, validation, and joins. OME-Arrow remains the place for Arrow-native

image payload handling and lazy image access.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wayscience/iceberg-bioimage

Awesome Lists containing this project

README