https://github.com/lincc-frameworks/rubin-dash

DRP Afterburner for Super HATS - importing rubin catalogs to HATS
https://github.com/lincc-frameworks/rubin-dash
Last synced: 18 days ago
JSON representation
DRP Afterburner for Super HATS - importing rubin catalogs to HATS
Host: GitHub
URL: https://github.com/lincc-frameworks/rubin-dash
Owner: lincc-frameworks
License: mit
Created: 2026-04-16T19:21:11.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-06-08T09:41:44.000Z (22 days ago)
Last Synced: 2026-06-08T11:24:30.538Z (22 days ago)
Language: Python
Homepage:
Size: 111 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          
# rubin-dash

**D**RP **A**fterburner for **S**uper **HATS** — converts Rubin DRP outputs into

[HATS](https://hats.readthedocs.io/) catalogs suitable for use with

[lsdb](https://lsdb.readthedocs.io/).

[![Template](https://img.shields.io/badge/Template-LINCC%20Frameworks%20Python%20Project%20Template-brightgreen)](https://lincc-ppt.readthedocs.io/en/latest/)

[![PyPI](https://img.shields.io/pypi/v/rubin-dash?color=blue&logo=pypi&logoColor=white)](https://pypi.org/project/rubin-dash/)

[![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/lincc-frameworks/rubin-dash/smoke-test.yml)](https://github.com/lincc-frameworks/rubin-dash/actions/workflows/smoke-test.yml)

[![Codecov](https://codecov.io/gh/lincc-frameworks/rubin-dash/branch/main/graph/badge.svg)](https://codecov.io/gh/lincc-frameworks/rubin-dash)

[![Read The Docs](https://img.shields.io/readthedocs/rubin-dash)](https://rubin-dash.readthedocs.io/)

## Overview

The pipeline runs a sequence of stages that read from a Butler repository and

write HATS catalogs to an output directory:

| Stage | Description                                           |

|---|-------------------------------------------------------|

| `butler` | Find catalog parquet files from the Butler repository |

| `raw_sizes` | Measure raw parquet file sizes                        |

| `import` | Import catalogs into HATS format                      |

| `postprocess` | Post-process imported catalogs                        |

| `nesting` | Build nested (light-curve) catalogs                   |

| `collections` | Generate HATS collections                             |

| `crossmatch` | Cross-match against external surveys (e.g. ZTF, PS1)  |

| `generate_json` | Generate JSON metadata for the HATS collections       |

## Setting up the environment

This pipeline requires IDAC access and is normally run on USDF SLAC nodes. It

cannot be run on the login node. It is *highly recommended* to use `tmux` or `screen` so

you can detach and reattach without losing your session. The pipeline typically

takes at least ~5h and can take closer to ~15h.

### Request a reserved node

Your connection path should look like this:

```mermaid

graph LR

    L["login node"] --> T("tmux/screen")

    T --> I["interactive node"]

    I --> R["reserved node"]

style T fill:lightblue,stroke:darkblue,stroke-width:2px

```

From an interactive node, request a reserved node:

```shell

srun --pty --exclusive --nodes=1 --time=48:00:00 \

     --partition=torino --account=rubin:commissioning bash

```

Do not exit the reserved node shell directly — use `tmux detach` or screen's `ctrl+a -> d` instead so the

job keeps running.

### Load the LSST stack

```shell

source /sdf/group/rubin/sw/loadLSST.sh

setup lsst_distrib

```

### Install rubin-dash

```shell

pip install git+https://github.com/lincc-frameworks/rubin-dash.git

```

## Running the pipeline

### 1. Create a config file

The package ships a `default_config.toml` with sensible defaults for all

catalogs, nested catalogs, collections, crossmatch surveys, and Dask settings.

Your config file is merged on top of those defaults — you only need to specify

what changes for your run.

Copy `example_config.toml` and fill in the `[run]` section. The values come

from the JIRA ticket associated with the weekly release. For example, the

collection string `LSSTCam/runs/DRP/20250417_20250921/w_2025_49/DM-53545`

breaks down as:

```toml

[run]

instrument = "LSSTCam"

repo       = "/repo/embargo"         # Butler repo path

version    = "w_2025_49"

collection = "DM-53545"

output_dir = "/sdf/data/rubin/shared/lsdb_commissioning"

run        = "20250417_20250921"      # optional — omit for releases without a run segment

```

#### Overriding stages

By default all stages run. Restrict to a subset:

```toml

[stages]

enabled = ["butler", "raw_sizes", "import", "postprocess"]

```

#### Overriding catalogs

By default all six catalogs are processed: `dia_object`, `dia_source`,

`dia_object_forced_source`, `object`, `source`, `object_forced_source`.

Restrict to a subset:

```toml

[catalogs]

enabled = ["dia_object", "object"]

```

Override settings for a specific catalog:

```toml

[catalogs.object]

chunksize = 100_000   # DimensionParquetReader batch size (default 250_000 for object)

[catalogs.object.import_args]

pixel_threshold = 500_000   # override any hats-import argument

```

Add a custom catalog not in the defaults (all fields required):

```toml

[catalogs.my_catalog]

dims            = ["tract"]

group_by        = ["tract"]

flux_columns    = []

add_mjds        = false

use_schema_file = false

chunksize       = 500_000

[catalogs.my_catalog.import_args]

ra_column       = "ra"

dec_column      = "dec"

catalog_type    = "object"

pixel_threshold = 1_000_000

```

#### Overriding nested catalogs

The defaults define two nested catalogs (`dia_object_lc` and `object_lc`).

Override settings or restrict which ones are built:

```toml

[nested]

enabled = ["object_lc"]   # omit to run all

[nested.object_lc]

pixel_threshold       = 20_000   # override any field

highest_healpix_order = 10

```

#### Overriding collections

```toml

[collections]

enabled = ["object_collection"]   # omit to run all

[collections.object_collection]

margin_threshold = 10.0

```

#### Overriding crossmatch surveys

The defaults cross-match against ZTF DR22 and PS1. Add, remove, or reconfigure:

```toml

# Disable all crossmatches by leaving surveys empty

[crossmatch]

# Or override a survey's search radius

[crossmatch.surveys.ztf_dr22]

radius_arcsec = 0.5

```

#### Overriding Dask settings

Global settings apply to all stages; stage-specific sections override them for

that stage only:

```toml

[dask]

n_workers        = 32

threads_per_worker = 1

memory_limit     = "16GB"

[dask.stages.nesting]

n_workers    = 8

memory_limit = "32GB"

```

#### Layering multiple config files

You can split settings across files and layer them at run time — later files

override earlier ones:

```shell

rubin-dash run --config base.toml --config this_week.toml --config overrides.toml

```

### 2. Run the full pipeline

```shell

rubin-dash run --config my_config.toml

```

### CLI options

```

rubin-dash run --config CONFIG [--config CONFIG ...]

               [--stages butler,import,postprocess]

               [--from-stage STAGE]

               [--catalogs dia_object,object]

               [--nestings object_lc]

               [--collections object_collection]

```

| Option | Description |

|---|---|

| `--config` | TOML config file. Repeat to layer overrides (later files win). |

| `--stages` | Comma-separated list of stages to run. |

| `--from-stage` | Run all enabled stages starting from this one. |

| `--catalogs` | Restrict to a subset of catalogs. |

| `--nestings` | Restrict to specific nested catalogs. |

| `--collections` | Restrict to specific collections. |

Examples:

```shell

# Re-run only the import and postprocess stages

rubin-dash run --config my_config.toml --stages import,postprocess

# Resume from the nesting stage onward

rubin-dash run --config my_config.toml --from-stage nesting

# Layer a base config with per-run overrides

rubin-dash run --config base.toml --config overrides.toml

```

### 3. Interactive notebook access

To open the notebooks interactively from within the processing environment:

```shell

rubin-dash notebook --port 8769

```

This starts a Jupyter server and prints the SSH tunnel command you need to run

on your laptop to forward the port. It will look something like:

```shell

ssh -J user@s3dflogin.slac.stanford.edu,user@sdfiana004 \

    -L 8769:localhost:8769 \

    user@sdfmilan005

```

### 4. Rerunning a single stage after a failure

If the pipeline fails partway through, you can rerun from a specific stage:

```shell

rubin-dash run --config my_config.toml --from-stage import

```

Or run a single stage in isolation:

```shell

rubin-dash run --config my_config.toml --stages import

```

If you need to debug interactively, the `notebooks/` directory contains a

notebook for each stage. Run them individually after confirming the environment

variables are set. If you encounter unexpected issues with upstream data, reach

out in `#dm-algorithms-pipelines` on the Rubin Observatory Slack.

## Development

```shell

conda create -n rubin-dash python=3.11

conda activate rubin-dash

pip install -e ".[dev]"

chmod +x .setup_dev.sh

./.setup_dev.sh

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lincc-frameworks/rubin-dash

Awesome Lists containing this project

README