An open API service indexing awesome lists of open source software.

https://github.com/refresh-bio/phist

Phage-Host Interaction Search Tool
https://github.com/refresh-bio/phist

bacterial-genomes bioinformatics genomics host-prediction k-mers phages

Last synced: about 2 months ago
JSON representation

Phage-Host Interaction Search Tool

Awesome Lists containing this project

README

          

# PHIST
[![Bioconda downloads](https://img.shields.io/conda/dn/bioconda/phist.svg?style=flag&label=Bioconda%20downloads)](https://anaconda.org/bioconda/phist)
[![C/C++ CI](https://github.com/refresh-bio/PHIST/workflows/C/C++%20CI/badge.svg)](https://github.com/refresh-bio/PHIST/actions)

**Phage-Host Interaction Search Tool**

A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of *k*-mers shared between their sequences.

## Quick start
```bash
git clone --recurse-submodules https://github.com/refresh-bio/PHIST

cd PHIST
make

./phist.py ./example/virus ./example/host ./out/

```

## Installation

PHIST uses [Kmer-db](https://github.com/refresh-bio/kmer-db) as a submodule, therefore a recursive repository clone must be performed:
```
git clone --recurse-submodules https://github.com/refresh-bio/PHIST
```
Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):
```
cd PHIST
make
```
Under Windows one have to build Visual Studio 2015 solutions on *kmer-db* and *utils* subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).

## Usage

PHIST takes as input genomic sequences of viruses and candidate hosts in FASTA files (gzipped or not). Virus genomes may be provided in a single FASTA file or in a directory containing multiple FASTA files (one genome per file). Candidate host genomes should be stored individually in a directory (one genome per FASTA file) (see [example](./example/)).

```
./phist.py [options]
```

Positional arguments:
* `virus_path` Input FASTA file or directory with files (plain or gzip)
* `host_dir` Input directory w/ host FASTA files (plain or gzip)
* `out_dir` Output directory (will be created if it does not exist)

Options:
* `-k ` *k*-mer length (default: 25, max: 30)
* `-t ` Number of threads (default: number of cores)
* `-h, --help` Show this help message and exit
* `--keep_temp` Keep temporary kmer-db files [False]
* `--version` Show tool's version number and exit

### Usage example

```
./phist.py example/virus/ example/host/ out/
```

```
./phist.py example/virus_multifasta.fna example/host/ out/
```

## Output format

PHIST outputs two CSV files. One containing a table of common *k*-mers between phages and hosts, and second file with virus-host predictions.

### Common *k*-mers table

The [common_kmers.csv](./example/common_kmers.csv) file stores numbers of common *k*-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero *k*-mer counts are represented as pairs (*column_number* : *value*) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:

| | | | | | |
| :---: | :---: | :---: | :---: | :---: | :---: |
| kmer-length: *k* fraction: *f* | phages | *φ1* | *φ2* | ... | *φn* |
| hosts | total-kmers | |*φ1*| | |*φ2*| | ... | |*φn*| |
| *h1* | |*h1*| | *i11* : |*h1 ∩ φi11*| | *i12* : |*h1 ∩ φi12*| | ||
| *h2* | |*h2*| | *i21* : |*h2 ∩ φi21*| | *i22* : |*h2 ∩ φi22*| | *i23* : |*h2 ∩ φi23*| | |
| *h2* | |*h2*| | ||||
| ... | ... | ... ||||
| *hm* | |*hm*| | *im1* : |*hm ∩ φim1*| | |||

where:
* *k* - k-mer length,
* *φ1*, *φ2*, ..., *φn* - phage names,
* *h1*, *h2*, ..., *hm* - host names,
* |*a*| - number of k-mers in sample *a*,
* |*a ∩ b*| - number of k-mers common for samples *a* and *b*.

### Host predictions

The [predictions.csv](./example/predictions.csv) file assigns each phage to its most likely host (i.e., the one having most *k*-mers in common). If there are multiple potential hosts with same number of common *k*-mers, all are reported. Each virus-host interaction is followed by *p*-value and adjusted *p*-value for multiple comparisons.

| phage | host | common *k*-mers | *p*-value | adj. *p*-value |
| :---: | :---: | :---: | :---: | :---: |
| *φ1* | *host*( *φ1*) | |*φ1* ∩ *host*(*φ1*)| | ... | ... |
| *φ2* | *host*( *φ2*) | |*φ2* ∩ *host*(*φ2*)| | ... | ... |
| *φ3* | *host1*( *φ3*) | |*φ3* ∩ *host1*(*φ3*)| | ... | ... |
| *φ3* | *host2*( *φ3*) | |*φ3* ∩ *host2*(*φ3*)| | ... | ... |
| ... | ... | ... | ... | ... | ... |

## Further analysis

The `utils/matcher` tool retrieves the list of all exact matches of legnth >= *k* for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).

### Usage

```
./utils/matcher [options]
```

Positional arguments:
* `virus` virus FASTA file (gzipped or not),
* `host` host FASTA file (gzipped or not),
* `output` output CSV file

Options:
* `-k --k ` *k*-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),

### Example

```
./utils/matcher example/virus/NC_024123.fna example/host/NC_017548.fna shared_regions.csv
```

```
example/virus/NC_024123.fna,example/host/NC_017548.fna
NC_024123.1:52942-52968,NC_017548.1:1456873-1456847
NC_024123.1:52970-53009,NC_017548.1:1456845-1456806
NC_024123.1:53011-53102,NC_017548.1:1456804-1456713
NC_024123.1:53107-53147,NC_017548.1:1456708-1456668
NC_024123.1:53830-53854,NC_017548.1:2647971-2647947
NC_024123.1:54794-54827,NC_017548.1:679998-679965
```

## Citing
Zielezinski A, Deorowicz S, Gudyś A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences, Bioinformatics. 2022, 38(5):1447-9. doi:[10.1093/bioinformatics/btab837](https://doi.org/10.1093/bioinformatics/btab837).