Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/MultimodalUniverse/MultimodalUniverse

Large-Scale Multimodal Dataset of Astronomical Data
https://github.com/MultimodalUniverse/MultimodalUniverse

Last synced: 25 days ago
JSON representation

Large-Scale Multimodal Dataset of Astronomical Data

Awesome Lists containing this project

README

        

# Multimodal Universe: Enabling Large-Scale Machine Learning with 70TBs of Astronomical Scientific Data

Dataset on Hugging Face
[![Demo on Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MultimodalUniverse/MultimodalUniverse/blob/main/notebooks/getting_started.ipynb) [![Test](https://github.com/AstroPile/AstroPile_prototype/actions/workflows/tiny_dset_test.yml/badge.svg)](https://github.com/AstroPile/AstroPile_prototype/actions/workflows/tiny_dset_test.yml) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![All Contributors](https://img.shields.io/badge/all_contributors-28-orange.svg)](#contributors-)

## Overview

The Multimodal Universe dataset is a large scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.

![image](assets/astropile.png)

## Quick Start

All datasets can be previewed directly from our [HuggingFace hub](https://huggingface.co/MultimodalUniverse) and accessed via `load_dataset('MultimodalUniverse/dataset_name')`!
Preview datasets include ~1k examples from each survey.

```py
from datasets import load_dataset

dset = load_dataset('MultimodalUniverse/plasticc',
split='train', streaming=True)

example = next(iter(dset))
```
You can try this out with our [getting started notebook](https://colab.research.google.com/github/MultimodalUniverse/MultimodalUniverse/blob/main/notebooks/getting_started.ipynb)!

## Data Access

To access the full dataset, we recommend downloading the data locally. This is necessary for using the provided cross-matching utilities.

The full dataset content is hosted at the Flatiron Institute and available either through HTTPS or through [GLOBUS](https://www.globus.org/):

- https://users.flatironinstitute.org/~flanusse/MultimodalUniverse
- https://app.globus.org/file-manager?origin_id=58a4d334-d750-454d-88a3-9d8256d091a6

GLOBUS is much preferable when downloading large amounts of data, or a large number of files. Local download of the full data in its native HDF5 format is necessary for using the provided cross-matching utilities.

After downloading the data, you can use Hugging Face's `datasets` library to load the data directly from your local copy. For example, to load the PLAsTiCC dataset:
```py
from datasets import load_dataset

dset = load_dataset('path/to/downloaded/plasticc',
split='train', streaming=True)
dset = dset.with_format('numpy')

example = next(iter(dset))
```

## Datasets
The Multimodal Universe currently contains data from the following surveys/modalities:
| **Survey** | **Modality** | **Science Use Case** | **# samples** |
|----------------------|---------------------|----------------------|---------------|
| Legacy Surveys DR10 | Images | Galaxies | 124M |
| Legacy Surveys North | Images | Galaxies | 15M |
| HSC | Images | Galaxies | 477k |
| BTS | Images | Supernovae | 400k |
| JWST | Images | Galaxies | 300k |
| Gaia BP/RP | Spectra | Stars | 220M |
| SDSS-II | Spectra | Galaxies, Stars | 4M |
| DESI | Spectra | Galaxies | 1M |
| APOGEE SDSS-III | Spectra | Stars | 716k |
| GALAH | Spectra | Stars | 325k |
| Chandra | Spectra | Galaxies, Stars | 129k |
| VIPERS | Spectra | Galaxies | 91k |
| MaNGA SDSS-IV | Hyperspectral Image | Galaxies | 12k |
| PLAsTiCC | Time Series | Time-varying objects | 3.5M |
| TESS | Time Series | Exoplanets | 160k |
| CfA Sample | Time Series | Supernovae | 1k |
| YSE | Time Series | Supernovae | 2k |
| PS1 SNe Ia | Time Series | Supernovae | 369 |
| DES Y3 SNe Ia | Time Series | Supernovae | 248 |
| SNLS | Time Series | Supernovae | 239 |
| Foundation | Time Series | Supernovae | 180 |
| CSP SNe Ia | Time Series | Supernovae | 134 |
| Swift SNe Ia | Time Series | Supernovae | 117 |
| Gaia | Tabular | Stars | 220M |
| PROVABGS | Tabular | Galaxies | 221k |
| Galaxy10 DECaLS | Tabular | Galaxies | 15k |

We are accepting new datasets! Check out our [contribution guidelines](./CONTRIBUTING.md) for more details.

## Data License

We openly distribute the Multimodal Universe dataset under the [Creative Commons Attribution (CC BY) 4.0](https://creativecommons.org/licenses/by/4.0/) license, noting however that when using specific subsets, the license and conditions of utilisation should be respected.

## Architecture

Illustration of the methodology behind the Multimodal Universe. Domain scientists with expertise in a given astronomical survey provide data download and formatting scripts through Pull Requests. All datasets are then downloaded from their original source and made available as Hugging Face datasets sharing a common data schema for each modality and associated metadata. End-users can then generate any combination of subsets using provided cross-matching utilities to generate multimodal datasets.

Please see the [Design Document](https://github.com/MultimodalUniverse/MultimodalUniverse/blob/main/DESIGN.md) for more context about the project.

## Contributors

#### Full Contribution List



Francois Lanusse
Francois Lanusse

πŸ“† πŸ’‘ πŸ’»
Liam Parker
Liam Parker

πŸ“† πŸ’‘ πŸ’»
Micah Bowles
Micah Bowles

πŸ“† πŸ’‘ πŸ’»
mhuertascompany
mhuertascompany

πŸ“† πŸ’‘ πŸ’»
Mike Smith
Mike Smith

πŸ“† πŸ’‘ πŸ’»
Helen Qu
Helen Qu

πŸ“† πŸ’‘ πŸ’»
Aaron
Aaron

πŸ’‘ πŸ’»


Ben Boyd
Ben Boyd

πŸ’‘ πŸ’»
Brian Cherinka
Brian Cherinka

πŸ’»
Connor Stone, PhD
Connor Stone, PhD

πŸ’‘
David Chemaly
David Chemaly

πŸ’‘ πŸ’»
Erin Hayes
Erin Hayes

πŸ’‘ πŸ’»
Henry Leung
Henry Leung

πŸ’»
Ioana Ciucă
Ioana Ciucă

πŸ–‹


Jeff Shen
Jeff Shen

πŸ’»
jeraud
jeraud

πŸ’‘ πŸ’»
John F. Wu
John F. Wu

πŸ–‹
CambridgeAstroStat
CambridgeAstroStat

πŸ§‘β€πŸ«
Kartheik Iyer
Kartheik Iyer

πŸ’»
Lucas Meyer
Lucas Meyer

πŸ’»
Matthew Grayling
Matthew Grayling

πŸ’‘ πŸ’»


Maja JabΕ‚oΕ„ska
Maja JabΕ‚oΕ„ska

πŸ’»
Mike Walmsley
Mike Walmsley

πŸ’‘ πŸ’»
Miles Cranmer
Miles Cranmer

πŸ–‹
Peter Melchior
Peter Melchior

πŸ’»
Rafael MartΓ­nez-Galarza
Rafael MartΓ­nez-Galarza

πŸ’»
Tom Hehir
Tom Hehir

πŸ’‘ πŸ’»
Shirley Ho
Shirley Ho

πŸ” πŸ–‹