Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/saalfeldlab/n5-spark

Spark-driven processing utilities for N5 datasets.
https://github.com/saalfeldlab/n5-spark

Last synced: 2 months ago
JSON representation

Spark-driven processing utilities for N5 datasets.

Awesome Lists containing this project

README

        

[![Build Status](https://github.com/saalfeldlab/n5-spark/actions/workflows/build-main.yml/badge.svg)](https://github.com/saalfeldlab/n5-spark/actions/workflows/build-main.yml)

# N5 Spark

A collection of utilities for [N5](https://github.com/saalfeldlab/n5) datasets that can run on an Apache Spark cluster.

Supported operations:
* thresholding and labeling of connected components
* resaving using different blocksize / datatype / compression
* downsampling (isotropic / non-isotropic)
* max intensity projection
* conversion between N5 and slice TIFF series
* parallel remove

N5 Spark can run on **Amazon Web Services** and **Google Cloud**, more information is available in the [wiki](https://github.com/saalfeldlab/n5-spark/wiki).

## Usage

Clone the repository with submodules:

```bash
git clone --recursive https://github.com/saalfeldlab/n5-spark.git
```

If you have already cloned the repository, run this after cloning to fetch the submodules:
```bash
git submodule update --init --recursive
```

You can use the library in your Spark-based project or as a standalone tool. To use in your project, add a maven dependency and make sure that your application is set to be compiled as a shaded jar that contains all dependencies.

Alternatively, to use as a standalone tool, compile the package for the desired execution environment:

Compile for running on Janelia cluster

```bash
python build.py
```


While OpenJDK and Maven's surefire plugin have this joint [bug](https://stackoverflow.com/questions/53010200/maven-surefire-could-not-find-forkedbooter-class), you will have to build with
```bash
_JAVA_OPTIONS=-Djdk.net.URLClassPath.disableClassPathURLCheck=true ./build.py
```

Compile for running on local machine

```bash
python build-spark-local.py
```


While OpenJDK and Maven's surefire plugin have this joint [bug](https://stackoverflow.com/questions/53010200/maven-surefire-could-not-find-forkedbooter-class), you will have to build with
```bash
_JAVA_OPTIONS=-Djdk.net.URLClassPath.disableClassPathURLCheck=true ./build-spark-local.py
```


The scripts for starting the application are located under `startup-scripts/spark-janelia` and `startup-scripts/spark-local`, and their usage is explained below.

If running locally, you can access the Spark job tracker at http://localhost:4040/ to monitor the progress of the tasks.

### N5 connected components

Run on Janelia cluster

```bash
spark-janelia/n5-connected-components.py

-n
-i
-o
[-t ]
[-s ]
[-m ]
[-b ]
[-c ]
```

Run on local machine

```bash
spark-local/n5-connected-components.py
-n
-i
-o
[-t ]
[-s ]
[-m ]
[-b ]
[-c ]
```

Finds and labels all connected components in a binary mask extracted from the input N5 dataset and saves the relabeled dataset as an `uint64` output dataset.

Optional parameters:
* **Threshold**: min value in the input data to be included in the binary mask. The condition is `input >= threshold`. If omitted, all input values higher than 0 are included in the binary mask (`input > 0`).
* **Neighborhood shape**: specifies how pixels are grouped into connected components. There are two options:
* Diamond (default): only direct neighbors are considered (4-neighborhood in 2D, 6-neighborhood in 3D).
* Box (`-s box`): diagonal pixels are considered as well (8-neighborhood in 2D, 26-neighborhood in 3D). It can be used to decrease the number of trivial components.
* **Min size**: minimum size of a connected component in pixels. If specified, components that contain fewer number of pixels are discarded from the resulting set. By default all components are kept.
* *block size*: comma-separated list. If omitted, the block size of the input dataset is used.
* *compression scheme*: if omitted, the compression scheme of the input dataset is used.

When the processing is completed, some statistics about the extracted connected components will be stored in the file `stats.txt` in the directory with the output dataset (works only for filesystem-based N5).

### N5 converter

Run on Janelia cluster

```bash
spark-janelia/n5-convert.py

-ni
-i
[-no ]
-o
[-b ]
[-c ]
[-t ]
[-min ]
[-max ]
[--force to overwrite output dataset if already exists]
```

Run on local machine

```bash
spark-local/n5-convert.py
-ni
-i
[-no ]
-o
[-b ]
[-c ]
[-t ]
[-min ]
[-max ]
[--force to overwrite output dataset if already exists]
```

Resaves an N5 dataset possibly changing all or some of the following dataset attributes:
* *block size*: if omitted, the block size of the input dataset is used.
* *compression scheme*: if omitted, the compression scheme of the input dataset is used.
* *data type*: if omitted, the data type of the input dataset is used.

If specified and is different from the input dataset type, the values are mapped from the input value range to the output value range.

The optional `-min` and `-max` arguments specify the input data value range. If omitted, the input value range is derived from the input data type for integer types, or set to `[0,1]` for real types by default.

The output value range is derived from the output data type for integer types, or set to `[0,1]` for real types.

### N5 downsampling

Generates a single downsampled export:

* N-dimensional downsampling: performs a single downsampling step with given factors. The downsampling factors parameter is formatted as a comma-separated list, for example, `2,2,2`.

Run on Janelia cluster

```bash
spark-janelia/n5-downsample.py

-n
-i
-o
-f
[-b ]
```


Run on local machine

```bash
spark-local/n5-downsample.py
-n
-i
-o
-f
[-b ]
```


* N-dimensional label downsampling: performs a single downsampling step with given factors. The downsampling factors parameter is formatted as a comma-separated list, for example, `2,2,2`.

The most frequent value is used instead of averaging. In case of multiple values with the same frequency, the smallest value among them is selected.

Run on Janelia cluster

```bash
spark-janelia/n5-downsample-label.py

-n
-i
-o
-f
[-b ]
```


Run on local machine

```bash
spark-local/n5-downsample-label.py
-n
-i
-o
-f
[-b ]
```

* N-dimensional offset downsampling: performs a single downsampling step with given factors and offset. The downsampling factors and offset parameters are formatted as comma-separated lists, for example, `2,2,2`.

Run on Janelia cluster

```bash
spark-janelia/n5-downsample-offset.py

-n
-i
-o
-f
-s
[-b ]
```


Run on local machine

```bash
spark-local/n5-downsample-offset.py
-n
-i
-o
-f
-s
[-b ]
```

Generates a scale pyramid:

* N-dimensional scale pyramid: generates a scale pyramid with given factors. The downsampling factors parameter specifies relative scaling between any two consecutive scale levels in the output scale pyramid, and is formatted as a comma-separated list, for example, `2,2,2`.

Run on Janelia cluster

```bash
spark-janelia/n5-scale-pyramid.py

-n
-i
-f
[-o ]
```


Run on local machine

```bash
spark-local/n5-scale-pyramid.py
-n
-i
-f
[-o ]
```


* N-dimensional offset scale pyramid: generates a scale pyramid with given factors and half-pixel offset applied at every scale level. The downsampling factors parameter specifies relative scaling between any two consecutive scale levels in the output scale pyramid, and is formatted as a comma-separated list, for example, `2,2,2`.

Run on Janelia cluster

```bash
spark-janelia/n5-scale-pyramid-offset.py

-n
-i
-f
-s
[-o ]
```


Run on local machine

```bash
spark-local/n5-scale-pyramid-offset.py
-n
-i
-f
-s
[-o ]
```


* Non-isotropic scale pyramid: generates a scale pyramid of a dataset with different resolution in X/Y and Z. Depending on whether the resolution is better in X/Y than in Z or vice versa, the downsampling factors are adjusted to make the scale levels as close to isotropic as possible. The pixel resolution parameter is given in um (microns) formatted as a comma-separated list, for example, `0.097,0.097,0.18`.

If the optional argument `-p` is provided, all downsampling factors are forced to be powers of two. This mode is faster as it does not require any intermediate downsampling steps.

Only the first 3 dimensions of the input data are downsampled. If the input data is of higher dimensionality than 3, the rest of the dimensions are written out as is.

Run on Janelia cluster

```bash
spark-janelia/n5-scale-pyramid-nonisotropic.py

-n
-i
-r
[-o ]
[-p]
```


Run on local machine

```bash
spark-local/n5-scale-pyramid-nonisotropic.py
-n
-i
-r
[-o ]
[-p]
```

If the output group argument is omitted for scale pyramid exporters, the resulting datasets will be stored in the same group with the input dataset. The naming scheme for the lower resolution datasets is `s1`, `s2`, `s3` and so on.

If the block size argument is omitted, the resulting dataset will have the same block size as the input dataset. Downsampling factors are written into the attributes metadata of the lower resolution datasets.

### Slice TIFF series to N5 converter

Run on Janelia cluster

```bash
spark-janelia/slice-tiff-to-n5.py

-i
-n
-o
-b
[-c ]
```

Run on local machine

```bash
spark-local/slice-tiff-to-n5.py
-i
-n
-o
-b
[-c ]
```

The tool lists all slice TIFF images contained in the input directory and converts them into a 3D N5 dataset.

The slice images are automatically sorted by their filenames in natural order, such that `1.tif` and `2.tif` are placed before `10.tif`.

The block size can be specified as three comma-separated values, or as a single value as a shortcut for cube-shaped blocks.

The input images are assumed to be XY slices.

### N5 to slice TIFF series converter

Run on Janelia cluster

```bash
spark-janelia/n5-to-slice-tiff.py

-n
-i
-o
[-d ]
[-c ]
[-f ]
[--fill ]
```

Run on local machine

```bash
spark-local/n5-to-slice-tiff.py
-n
-i
-o
[-d ]
[-c ]
[-f ]
[--fill ]
```

The tool converts a given dataset into slice TIFF series and saves them in the specified output folder.

The slice dimension can be specified as `-d x`, `-d y`, or `-d z` (default) to generate YZ, XZ, or XY slices respectively.

The filename format is expected to contain a single placeholder for an integer representing the index of the slice, such as:
* `-f slice%dz.tif` to produce filenames such as *slice0z.tif*, *slice1z.tif*, *slice2z.tif*, etc.
* `-f slice%03dz.tif` to produce filenames such as *slice000z.tif*, *slice001z.tif*, *slice002z.tif*, etc.

The default pattern is `%d.tif` so the resulting filenames by default are simply *0.tif*, *1.tif*, *2.tif*, etc.

If the input dataset is sparse and some of the N5 blocks do not exist, this empty space will be filled in the TIFF images with 0 by default. The `--fill` parameter allows to change this fill value.

Output TIFF images are written as uncompressed by default. LZW compression can be enabled by supplying `-c lzw`.

**WARNING:** LZW compressor can be very slow. It is not recommended for general use unless saving disk space is crucial.

### N5 max intensity projection

Run on Janelia cluster

```bash
spark-janelia/n5-mips.py

-n
-i
-o
[-m ]
[-c ]
```

Run on local machine

```bash
spark-local/n5-mips.py
-n
-i
-o
[-m ]
[-c ]
```

The tool generates max intensity projections in X/Y/Z directions and saves them as TIFF images in the specified output folder.

By default the entire volume is used to create a single MIP in X/Y/Z. You can specify MIP step as a number of cells included in a single MIP (e.g. `-m 5,5,3`).

Output TIFF images are written as uncompressed by default. LZW compression can be enabled by supplying `-c lzw`.

**WARNING:** LZW compressor can be very slow. It is not recommended for general use unless saving disk space is crucial.

### N5 remove

Run on Janelia cluster

```bash
spark-janelia/n5-remove.py

-n
-i
```

Run on local machine

```bash
spark-local/n5-remove.py
-n
-i
```

The tool removes a group or dataset parallelizing over inner groups. This is typically much faster than deleting the group on a single machine, in particular when removing groups with many nested groups and/or n5 blocks.