https://github.com/vmikk/phylonext

A pipeline for phylogenetic diversity analysis of GBIF-mediated data
https://github.com/vmikk/phylonext
beta-diversity biodiverse docker endemism gbif nextflow phylodiversity phylogenetic-diversity r randomisations singularity
Last synced: 7 months ago
JSON representation
A pipeline for phylogenetic diversity analysis of GBIF-mediated data
Host: GitHub
URL: https://github.com/vmikk/phylonext
Owner: vmikk
License: mit
Created: 2022-02-09T11:21:45.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2025-05-30T09:52:34.000Z (9 months ago)
Last Synced: 2025-05-30T12:40:39.424Z (9 months ago)
Topics: beta-diversity, biodiverse, docker, endemism, gbif, nextflow, phylodiversity, phylogenetic-diversity, r, randomisations, singularity
Language: R
Homepage: https://phylonext.github.io
Size: 18.1 MB
Stars: 13
Watchers: 4
Forks: 1
Open Issues: 12
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project

README

          # PhyloNext - PD (Phylogenetic Diversity) in the cloud 

![GitHub (latest release)](https://img.shields.io/github/v/release/vmikk/PhyloNext?label=GitHub%20release)

[![Nextflow](https://img.shields.io/badge/Nextflow%20DSL2-%E2%89%A522.10.0-23aa62.svg?labelColor=000000)](https://www.nextflow.io/)

[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)

[![run with singularity](https://img.shields.io/badge/run%20with-singularity-blue?style=flat&logo=singularity)](https://sylabs.io/docs/)

[![GitHub license](https://img.shields.io/github/license/vmikk/PhyloNext)](https://github.com/vmikk/PhyloNext/blob/main/LICENSE)  

CI/CD status:

[![Nextflow (full pipeline)](https://github.com/vmikk/PhyloNext/actions/workflows/Nextflow_test.yml/badge.svg)](https://github.com/vmikk/PhyloNext/actions/workflows/Nextflow_test.yml)

[![OToL](https://github.com/vmikk/PhyloNext/actions/workflows/OToL_test.yml/badge.svg)](https://github.com/vmikk/PhyloNext/actions/workflows/OToL_test.yml)

[![Biodiverse](https://github.com/vmikk/PhyloNext/actions/workflows/Biodiverse_test.yml/badge.svg)](https://github.com/vmikk/PhyloNext/actions/workflows/Biodiverse_test.yml)  

[![DOI - 10.1186/s12862-024-02256-9](https://img.shields.io/badge/DOI-10.1186%2Fs12862--024--02256--9-24B064)](https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-024-02256-9)

[![DOI](https://zenodo.org/badge/457327826.svg)](https://zenodo.org/badge/latestdoi/457327826)

PhyloNext is the automated pipeline for the analysis of phylogenetic diversity using [GBIF occurrence data](https://www.gbif.org/occurrence/search?occurrence_status=present), species phylogenies from [Open Tree of Life](https://tree.opentreeoflife.org), and [Biodiverse software](https://shawnlaffan.github.io/biodiverse/).

## Introduction

Current pipeline brings together two critical research data infrastructures, the Global

Biodiversity Information Facility [(GBIF)](https://www.gbif.org/) and Open Tree of Life [(OToL)](https://tree.opentreeoflife.org), to make them more accessible to non-experts.

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses [Docker](https://www.docker.com/) containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

The pipeline could be launched in a cloud environment (e.g., the [Microsoft Azure Cloud Computing Services](https://azure.microsoft.com/en-us/), [Amazon AWS Web Services](https://aws.amazon.com/), and [Google Cloud Computing Services](https://cloud.google.com/)).

## Pipeline summary

1. Filtering of GBIF species occurrences for various taxonomic clades and geographic areas

2. Removal of non-terrestrial records and spatial outliers (using density-based clustering)

3. Preparation of phylogenetic tree (currently, only pre-constructed phylogenetic trees are available; with the update of OToL, phylogenetic trees will be downloaded automatically using API) and name-matching with GBIF species keys

4. Spatial binning of species occurrences using Uber’s H3 system (hexagonal hierarchical spatial index)

5. Estimation of phylogenetic diversity and endemism indices using [Biodiverse program](https://shawnlaffan.github.io/biodiverse/)

6. Visualization of the obtained results

## Quick Start

An example command to run the pipilene:

```bash

nextflow run vmikk/phylonext -r main \

  --input "/mnt/GBIF/Parquet/2022-01-01/occurrence.parquet/" \

  --classis "Mammalia" --family  "Felidae,Canidae" \

  --country "DE,PL,CZ"  \

  --minyear 2000  \

  --dbscan true  \

  --phytree $(realpath "${HOME}/.nextflow/assets/vmikk/phylonext/test_data/phy_trees/Mammals.nwk") \

  --iterations 100  \

  -resume

```

## Web GUI

To facilitate easy and efficient navigation for exploring the PhyloNext pipeline, a user-friendly, web-based graphical user interface (GUI) has been developed by [Thomas Stjernegaard Jeppesen](https://github.com/thomasstjerne).

The GUI is available at [https://phylonext.gbif.org/](https://phylonext.gbif.org/).

**NB!** To access the GUI, users must have a GBIF user account. To register an account, please visit https://www.gbif.org/.

## Documentation

The PhyloNext pipeline comes with documentation about the pipeline usage

at [https://phylonext.github.io/](https://phylonext.github.io/).

Main pipeline parameters and output are desribed here:

- [parameters](https://phylonext.github.io/parameters/)

- [output](https://phylonext.github.io/outputs/)

To show a help message, run `nextflow run vmikk/phylonext -r main --help`.

```

=====================================================================

PhyloNext: GBIF phylogenetic diversity pipeline :  Version 1.4.0

=====================================================================

Pipeline Usage:

To run the pipeline, enter the following in the command line:

    nextflow run vmikk/phylonext -r main --input ... --outdir ...

Options:

REQUIRED:

    --input               Path to the directory with parquet files (GBIF occurrcence dump)

    --outdir              The output directory where the results will be saved

OPTIONAL:

    --phylum              Phylum to analyze (multiple comma-separated values allowed); e.g., "Chordata"

    --classis             Class to analyze (multiple comma-separated values allowed); e.g., "Mammalia"

    --order               Order to analyze (multiple comma-separated values allowed); e.g., "Carnivora"

    --family              Family to analyze (multiple comma-separated values allowed); e.g., "Felidae,Canidae"

    --genus               Genus to analyze (multiple comma-separated values allowed); e.g., "Felis,Canis,Lynx"

    --specieskeys         Custom list of GBIF specieskeys (file with a single column, with header)

    --phytree             Custom phylogenetic tree

    --taxgroup            Specific taxonomy group in Open Tree of Life (default, "All_life")

    --phylabels           Type of tip labels on a phylogenetic tree ("OTT" or "Latin")

    --maxage              Manually assign root age for a tree obtained from Open Tree of Life; e.g., 127

    --phyloonly           Prune Open Tree tips for which there are no phylogenetic inputs; logical, default, false

    --country             Country code, ISO 3166 (multiple comma-separated values allowed); e.g., "DE,PL,CZ"

    --latmin              Minimum latitude of species occurrences (decimal degrees); e.g., 5.1

    --latmax              Maximum latitude of species occurrences (decimal degrees); e.g., 15.5

    --lonmin              Minimum longitude of species occurrences (decimal degrees); e.g., 47.0

    --lonmax              Maximum longitude of species occurrences (decimal degrees); e.g., 55.5

    --minyear             Minimum year of record's occurrences; default, 1945

    --maxyear             Maximum year of record's occurrences; default, none

    --coordprecision      Coordinate precision threshold (less than maximum allowed value; default, 0.1)

    --coorduncertainty    Maximum allowed coordinate uncertainty, meters (default, 10000)

    --coorduncertaintyexclude Black list of coordinate uncertainty values (default, "301,3036,999,9999")

    --basisofrecordinclude Basis of record to include from the data; e.g., "PRESERVED_SPECIMEN"

    --basisofrecordexclude Basis of record to exclude from the data; e.g., "FOSSIL_SPECIMEN,LIVING_SPECIMEN"

    --polygon             Custom area of interest (a file with polygons in GeoPackage format)

    --wgsrpd              Polygons of World Geographical Regions; e.g., "pipeline_data/WGSRPD.RData"

    --regions             Names of World Geographical Regions; e.g., "L1_EUROPE,L1_ASIA_TEMPERATE"

    --noextinct           File with extinct species specieskeys for their removal (file with a single column, with header)

    --excludehuman        Logical, exclude genus "Homo" from occurrence data (default, true)

    --roundcoords         Numeric, round spatial coordinates to N decimal places, to reduce the dataset size (default, 2; set to negative to disable rounding)

    --h3resolution        Spatial resolution of the H3 geospatial indexing system; e.g., 4

    --dbscan              Logical, remove spatial outliers with density-based clustering; e.g., "false"

    --dbscannoccurrences  Minimum species occurrence to perform DBSCAN; e.g., 30

    --dbscanepsilon       DBSCAN parameter epsilon, km; e.g., "700"

    --dbscanminpts        DBSCAN min number of points; e.g., "3"

    --terrestrial         Land polygon for removal of non-terrestrial occurrences; e.g., "pipeline_data/Land_Buffered_025_dgr.RData"

    --rmcountrycentroids  Polygons with country and province centroids; e.g., "pipeline_data/CC_CountryCentroids_buf_1000m.RData"

    --rmcountrycapitals   Polygons with country capitals; e.g., "pipeline_data/CC_Capitals_buf_10000m.RData"

    --rminstitutions      Polygons with biological institutuions and museums; e.g., "pipeline_data/CC_Institutions_buf_100m.RData"

    --rmurban             Polygons with urban areas; e.g., "pipeline_data/CC_Urban.RData"

    --deriveddataset      Prepare a list of DOIs for the datasets used (default, true)

    --indices             Comma-seprated list of diversity and endemism indices; e.g., "calc_richness,calc_pd,calc_pe"

    --randname            Randomisation scheme type; e.g., "rand_structured"

    --iterations          Number of randomisation iterations; e.g., 1000

    --biodiversethreads   Number of Biodiverse threads; e.g., 10

    --randconstrain       Polygons to perform spatially constrained randomization (GeoPackage format)

Leaflet interactive visualization:

    --leaflet_var         Variables to plot; e.g., "RICHNESS_ALL,PD,SES_PD,PD_P,ENDW_WE,SES_ENDW_WE,PE_WE,SES_PE_WE,CANAPE,Redundancy"

    --leaflet_canapesuper Include the `superendemism` class in CANAPE results (default, false)

    --leaflet_color       Color scheme for continuous variables (default, "RdYlBu")

    --leaflet_palette     Color palette for continuous variables (default, "quantile")

    --leaflet_bins        Number of color bins for continuous variables (default, 5)

    --leaflet_sescolor    Color scheme for standardized effect sizes, SES (default, "threat"; alternative - "hotspots)

    --leaflet_redundancy  Redundancy threshold for hiding the grid cells with low number of records (default, 0 = display all grid cells)

Static visualization:

    --plotvar             Variables to plot (multiple comma-separated values allowed); e.g., "RICHNESS_ALL,PD,PD_P"

    --plottype            Plot type

    --plotformat          Plot format (jpg,pdf,png)

    --plotwidth           Plot width (default, 18 inches)

    --plotheight          Plot height (default, 18 inches)

    --plotunits           Plot size units (in,cm)

    --world               World basemap

NEXTFLOW-SPECIFIC:

    -qs                   Queue size (max number of processes that can be executed in parallel); e.g., 8

    -w                    Path to the working directory to store intermediate results (default, "./work")

    -resume               Execute the pipeline using the cached results.
Useful to continue executions that was stopped by an error

    -profile              Configuration profile; e.g., "docker"

    -params-file          Parameter file in YAML or JSON format (e.g., "Mammals.yaml")

    -c / -C               Configuration file (`-C` ignores all default values) (default, "nextflow.config")

```

Source code for the documentation can be found at [https://github.com/PhyloNext/phylonext.github.io](https://github.com/PhyloNext/phylonext.github.io).

## Credits

PhyloNext pipeline was developed by [Vladimir Mikryukov](https://github.com/vmikk) and [Kessy Abarenkov](https://github.com/kessya).

[Biodiverse program](https://shawnlaffan.github.io/biodiverse/) and Perl scripts accompanying PhyloNext were written by [Shawn Laffan](https://github.com/shawnlaffan) (Laffan et al., 2010).

Scripts for getting an induced subtree from the Open Tree of Life were developed by [Emily Jane McTavish](https://github.com/snacktavish).

We thank the following people for their extensive assistance in the development of this pipeline: Joe Miller, Shawn Laffan, Tim Robertson, Emily Jane McTavish, John Waller, Thomas Stjernegaard Jeppesen, and Matthew Blissett.

Also we are very grateful to [Manuele Simi](https://github.com/manuelesimi) and [nf-core](https://nf-co.re/) community for helpful advices on the development of this pipeline.

For more details, please see the [Acknowledgments section](https://phylonext.github.io/acknowledgements/) in the docs.

## Funding

The work is supported by a grant “PD (Phylogenetic Diversity) in the Cloud” to GBIF Supplemental funds from the GEO-Microsoft Planetary Computer Programme.

## Contributions and Support

If you would like to contribute to this pipeline, please see the [contributing guidelines](CONTRIBUTING.md).

For further information or help, don't hesitate to file an [issue on GitHub](https://github.com/vmikk/PhyloNext/issues).

## Future plans

- Add support of [`Shifter`](https://nersc.gitlab.io/development/shifter/how-to-use/) or [`Charliecloud`](https://hpc.github.io/charliecloud/) containers.

## Citations

If you use PhyloNext pipeline for your analysis, please cite it as:

Mikryukov V, Abarenkov K, Laffan S, Robertson T, McTavish EJ, Jeppesen TS, Waller J, Blissett M, Kõljalg U, Miller JT (2024). PhyloNext: A pipeline for phylogenetic diversity analysis of GBIF-mediated data. BMC Ecology and Evolution, 24(1), 76. [DOI:10.1186/s12862-024-02256-9](https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-024-02256-9)

Laffan SW, Lubarsky E, Rosauer DF (2010) Biodiverse, a tool for the spatial analysis of biological and related diversity. Ecography, 33: 643-647. [DOI: 10.1111/j.1600-0587.2010.06237.x](https://onlinelibrary.wiley.com/doi/10.1111/j.1600-0587.2010.06237.x)

An extensive list of references for the tools used by the pipeline can be found in the [Citations](https://phylonext.github.io/citations/) section in the documentation.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vmikk/phylonext

Awesome Lists containing this project

README