https://github.com/vinuesa/get_phylomarkers
A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using the multispecies coalescent and concatenation approaches
https://github.com/vinuesa/get_phylomarkers
bash clock-hypothesis codon-alignment core-genome genomics markers maximum-likelihood microbiology orthologues perl phylogenetic-trees phylogenetics phylogenomics phylogenomics-pipeline pipeline population-genetics r recombination species-delimitation species-trees
Last synced: 4 months ago
JSON representation
A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using the multispecies coalescent and concatenation approaches
- Host: GitHub
- URL: https://github.com/vinuesa/get_phylomarkers
- Owner: vinuesa
- License: other
- Created: 2017-05-12T00:09:24.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2025-12-05T13:14:30.000Z (6 months ago)
- Last Synced: 2025-12-08T23:50:33.265Z (6 months ago)
- Topics: bash, clock-hypothesis, codon-alignment, core-genome, genomics, markers, maximum-likelihood, microbiology, orthologues, perl, phylogenetic-trees, phylogenetics, phylogenomics, phylogenomics-pipeline, pipeline, population-genetics, r, recombination, species-delimitation, species-trees
- Language: Perl
- Homepage:
- Size: 115 MB
- Stars: 59
- Watchers: 3
- Forks: 9
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.txt
- License: LICENSE.txt
Awesome Lists containing this project
README
# GET_PHYLOMARKERS
[](https://travis-ci.com/vinuesa/get_phylomarkers)
[](https://doi.org/10.3389/fmicb.2018.00771)
[](./LICENSE.txt)
[](https://hub.docker.com/r/vinuesa/get_phylomarkers)
**GET_PHYLOMARKERS** ([Vinuesa et al., 2018](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00771/full)) **is an open-source software package for selecting optimal markers for microbial phylogenomics and species tree estimation**. It implements a [**bioinformatics pipeline**](https://www.frontiersin.org/files/Articles/351767/fmicb-09-00771-HTML/image_m/fmicb-09-00771-g001.jpg) to filter core-genome gene clusters computed by the companion package [**GET_HOMOLOGUES**](https://github.com/eead-csic-compbio/get_homologues), and selects only those with optimal attributes for phylogenetic inference using maximum likelihood (ML). The multiple sequence alignments of the filtered loci are concatenated into a supermatrix to estimate a species tree using the state-of-the-art fast ML tree searching algorithms [FastTree](http://www.microbesonline.org/fasttree) or [IQ-TREE](http://www.iqtree.org). It also estimates ML and parsimony trees from the pan-genome matrix, including unsupervised learning methods. We have also tested it successfully with **plant coding sequences** (details [here](https://github.com/vinuesa/get_phylomarkers?tab=readme-ov-file#manual-and-tutorials)).
## GET_PHYLOMARKERS 2
Starting with **release 2.0.0** (2022-11-20), [GET_PHYLOMARKERS](https://github.com/vinuesa/get_phylomarkers) also computes a **concatenation-independent species** tree from the ML gene trees estimated from top-scoring alignments using [**ASTRAL-III**](https://github.com/smirarab/ASTRAL).
**Release 2.1.0 (2024-03-31)** introduced **maximal matched-pairs tests** to assess violations of the data to the Stationarity, Reversibility, and Homogeneity (**SRH**) assumptions made by maximum-likelihood phylogenetic models, as implemented in IQ-TREE. Additionally, [**ASTRAL-IV**](https://github.com/chaoszhang/ASTER/blob/master/tutorial/astral4.md) is used to estimate the species tree directly from the filtered ML source gene trees, which computes terminal and internal branch lengths in substitution-per-site units.
[**Release 2.2.0** (v2.2.0, 2024-04-14)](https://github.com/vinuesa/get_phylomarkers/releases/tag/2.2.0) introduced new features, most notably:
- significant extension of run mode 2 (run_get_phylomarkers_pipeline.sh -R 2) for **population genetics** (multiple sequences from the same species), computing a **population tree from a SNP matrix** of top-scoring, neutral alignments.
- Reporting of a detailed numeric overview of the different filtering steps
- run_get_phylomarkers_pipeline.sh now also calls the C binary [**WEIGHTED-ASTRAL**](https://github.com/chaoszhang/ASTER) to estimate a species tree using as input the filtered gene trees estimated by [iqtree2](http://www.iqtree.org/) or [FastTree](http://www.microbesonline.org/fasttree/)
- Complex protein mixture models are used for concatenated protein alignments
[**Release 2.2.1 (2024-04-16)**](https://github.com/vinuesa/get_phylomarkers/releases/tag/v2.2.1) includes a static binary of [**snp-sites**](https://github.com/sanger-pathogens/snp-sites), which is called by run_get_phylomarkers_pipeline.sh >= v2.8.1.0_2024-04-15 under run mode 2 (-R 2, population genetics) to generate **SNP matrices in FASTA and [VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf) formats** from the concatenated alignments of filtered, highly informative, and neutral loci. The FASTA SNP matrix is used for estimating a **ML population tree**. Thanks to Alfredo Hernández @ccg_unam for compiling snp-sites-static.
- This release was used to build the latest [**Docker GET_PHYLOMARKERS image (20240418)**](https://hub.docker.com/r/vinuesa/get_phylomarkers) ready to pull from Docker Hub (docker pull vinuesa/get_phylomarkers:latest. This is a significantly lighter (2G.0B) image than the previous one (v20240414 (2.09GB), because several unnecessary R packages were removed. On [Dockerhub](https://hub.docker.com/r/vinuesa/get_phylomarkers), you will find **detailed instructions on installing and configuring the Docker client** on your machine, pulling the latest image, and running the containerized instance of the [GET_PHYLOMARKERS](https://github.com/vinuesa/get_phylomarkers) pipeline.
[**GET_PHYLOMARKERS**](https://github.com/vinuesa/get_phylomarkers) has a detailed [**manual**](https://vinuesa.github.io/get_phylomarkers/#get_phylomarkers-manual) and step-by-step [**tutorials**](https://vinuesa.github.io/get_phylomarkers/#get_phylomarkers-tutorial) document the software and help the user to get up and running quickly. For convenience, [**html**](https://vinuesa.github.io/get_phylomarkers/) and [**markdown**](https://github.com/vinuesa/get_phylomarkers/blob/master/docs/GET_PHYLOMARKERS_manual.md) versions of the documentation material are available.
## Installation, dependencies and Docker image
For detailed instructions and dependencies please check [**INSTALL.md**](INSTALL.md).
A [**GET_PHYLOMARKERS Docker image**](https://hub.docker.com/r/vinuesa/get_phylomarkers) is available, as well as an [**image bundling GET_PHYLOMARKERS +
GET_HOMOLOGUES**](https://github.com/eead-csic-compbio/get_homologues), ready to use. Detailed instructions for setting up the Docker environment are provided in [**INSTALL.md**](INSTALL.md). How to run container instances with the test sequences distributed with **GET_PHYLOMARKERS** is described in the [**tutorial**](https://vinuesa.github.io/get_phylomarkers/#get_phylomarkers-tutorial).
## Aim
**GET_PHYLOMARKERS** ([Vinuesa et al. 2018](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00771/full)) implements a series of sequential filters (detailed below) to selects markers from the homologous gene clusters produced by [GET_HOMOLOGUES](https://github.com/eead-csic-compbio/get_homologues) with optimal attributes for phylogenomic inference. It estimates **gene-trees** and **species-trees** under the **maximum likelihood (ML) optimality criterion** using state-of-the-art fast ML tree searching algorithms. The species tree is estimated from the supermatrix of concatenated, top-scoring alignments that passed the quality filters outlined in the figures below and explained in detail in the [**manual**](https://vinuesa.github.io/get_phylomarkers/#get_phylomarkers-manual) and [publication](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00771/full).
 
Figure 1A. Simplified flow-chart of the GET_PHYLOMARKERS pipeline showing only those parts used and described in this work. The left branch, starting at the top of the diagram, is fully under control of the master script run_get_phylomarkes_pipeline.sh. The names of the worker scripts called by the master program are indicated on the relevant points along the flow, as detailed in the [**manual**](https://vinuesa.github.io/get_phylomarkers/#get_phylomarkers-manual). The image corresponds to [**Fig. 1 of Vinuesa et al. 2018**](https://www.frontiersin.org/files/Articles/351767/fmicb-09-00771-HTML/image_m/fmicb-09-00771-g001.jpg).
Figure 1B. Combined filtering actions performed by GET_HOMOLOGUES and GET_PHYLOMARKERS to select top-ranking phylogenetic markers to be concatenated for phylogenomic analyses, and benchmark results of the performance of the FastTree (FT) and IQ-TREE (IQT) maximum-likelihood (ML) phylogeny inference programs. The image corresponds to [**Fig. 3 of Vinuesa et al. 2018**](https://www.frontiersin.org/files/Articles/351767/fmicb-09-00771-HTML/image_m/fmicb-09-00771-g003.jpg).
**GET_HOMOLOGUES** is a genome-analysis software package for microbial pan-genomics and comparative genomics originally described in the following publications:
- [Contreras-Moreira and Vinuesa, AEM 2013](https://www.ncbi.nlm.nih.gov/pubmed/24096415)
- [Vinuesa and Contreras-Moreira, Meth. Mol. Biol. 2015](https://www.ncbi.nlm.nih.gov/pubmed/25343868)
More recently we developed [GET_HOMOLOGUES-EST](https://github.com/eead-csic-compbio/get_homologues),
which can be used to cluster eukaryotic genes and transcripts, as described in [Contreras-Moreira et al, Front. Plant Sci. 2017](http://journal.frontiersin.org/article/10.3389/fpls.2017.00184/full).
If GET_HOMOLOGUES_EST is fed both .fna and .faa files of CDS sequences it will produce **identical output to that of GET_HOMOLOGUES and thus can be analyzed with GET_PHYLOMARKERS all the same**.
* * *
**GET_PHYLOMARKERS** is primarily tailored towards selecting CDSs (gene markers) to infer DNA-level phylogenies of different species of the same genus or family. It can also select optimal markers for population genetics, when the source genomes belong to the same species ([Vinuesa et al. 2018](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00771/full)).
For more divergent genome sequences, classified in different genera, families, orders or higher taxa,
the pipeline should be run using protein instead of DNA sequences.
 
Figure 2A. Best maximum-likelihood **core-genome phylogeny** for the genus Stenotrophomonas found in the IQ-TREE search, based on the supermatrix obtained by concatenation of 55 top-ranking alignments. The image corresponds to [**Fig. 5 of Vinuesa et al. 2018**](https://www.frontiersin.org/files/Articles/351767/fmicb-09-00771-HTML/image_m/fmicb-09-00771-g005.jpg).
Figure 2B. Maximum-likelihood **pan-genome phylogeny** estimated with IQ-TREE from the consensus pan-genome clusters displayed in the Venn diagram. Clades of lineages belonging to the *S. maltophilia* complex are collapsed and are labeled as in Figure 2A. Numbers on the internal nodes represent the approximate Bayesian posterior probability/UFBoot2 bipartition support values (see methods). The tabular inset shows the results of fitting either the binary (GTR2) or morphological (MK) models implemented in IQ-TREE, indicating that the former has an overwhelmingly better fit. The scale bar represents the number of expected substitutions per site under the binary GTR2+F0+R4 substitution model. The image corresponds to [**Fig. 6 of Vinuesa et al. 2018**](https://www.frontiersin.org/files/Articles/351767/fmicb-09-00771-HTML/image_m/fmicb-09-00771-g006.jpg).
* * *
## Manual and tutorials
Please, follow the links for a detailed [**manual**](https://vinuesa.github.io/get_phylomarkers/#get_phylomarkers-manual) and [**tutorials**](https://vinuesa.github.io/get_phylomarkers/#get_phylomarkers-tutorial), including a [**graphical flowchart**](https://vinuesa.github.io/get_phylomarkers/#brief-presentation-and-graphical-overview-of-the-pipeline) of the pipeline and explanations of the implementation details.
See also our [**plant tutorial**](http://eead-csic-compbio.github.io/get_homologues/plant_pangenome/protocol.html#downstream-phylogenomic-analyses).
## Citation.
Pablo Vinuesa, Luz-Edith Ochoa-Sanchez and Bruno Contreras-Moreira (2018).
GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics
and inferring pan-genome phylogenies, used for a critical geno-taxonomic revision of the
genus *Stenotrophomonas*. [Front. Microbiol. | doi: 10.3389/fmicb.2018.00771](https://doi.org/10.3389/fmicb.2018.00771)
Published in the Research Topic on "Microbial Taxonomy, Phylogeny and Biodiversity"
http://journal.frontiersin.org/researchtopic/5493/microbial-taxonomy-phylogeny-and-biodiversity
## Code
- Source sode is freely available from [GitHub](https://github.com/vinuesa/get_phylomarkers) and released under a [GPLv3-like license](./LICENSE.txt).
- Docker images ready to pull
- [GET_PHYLOMARKERS Docker image](https://hub.docker.com/repository/docker/vinuesa/get_phylomarkers)
- [GET_HOMOLOGUES+GET_PHYLOMARKERS Docker image](https://hub.docker.com/r/csicunam/get_homologues)
## Developers
The code is developed and maintained by [Pablo Vinuesa](https://www.ccg.unam.mx/~vinuesa/)
at [CCG-UNAM, Mexico](https://www.ccg.unam.mx) and
Bruno Contreras-Moreira at [EEAD-CSIC, Spain](https://www.eead.csic.es/compbio).
## Acknowledgements
We thank Alfredo J. Hernández and Víctor del Moral at CCG-UNAM for technical support with server administration.
### Funding
We gratefully acknowledge the funding provided over the years by [DGAPA-PAPIIT/UNAM](https://dgapa.unam.mx/index.php/impulso-a-la-investigacion/papiit) (grants IN201806-2, IN211814, IN206318, and IN216424) and [CONAHCyT-Mexico](https://conahcyt.mx/) (grants P1-60071, 179133 and A1-S-11242) to [Pablo Vinuesa](https://www.ccg.unam.mx/~vinuesa/), as well as the Fundación ARAID,Consejo Superior de Investigaciones Científicas (grant 200720I038 and Spanish MINECO (AGL2013-48756-R) to Bruno Contreras-Moreira.