https://github.com/refresh-bio/phist

Phage-Host Interaction Search Tool
https://github.com/refresh-bio/phist

bacterial-genomes bioinformatics genomics host-prediction k-mers phages

Last synced: 4 months ago
JSON representation

Phage-Host Interaction Search Tool

Host: GitHub
URL: https://github.com/refresh-bio/phist
Owner: refresh-bio
License: gpl-3.0
Created: 2021-08-11T20:18:23.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2024-07-08T14:49:09.000Z (about 2 years ago)
Last Synced: 2024-07-08T18:27:50.061Z (about 2 years ago)
Topics: bacterial-genomes, bioinformatics, genomics, host-prediction, k-mers, phages
Language: C++
Homepage:
Size: 12.2 MB
Stars: 26
Watchers: 4
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-virome - PHIST - K-mer based phage-host prediction. [source] [C++] (Host Prediction / RNA Virus Identification)

README

          # PHIST

[![Bioconda downloads](https://img.shields.io/conda/dn/bioconda/phist.svg?style=flag&label=Bioconda%20downloads)](https://anaconda.org/bioconda/phist)

[![C/C++ CI](https://github.com/refresh-bio/PHIST/workflows/C/C++%20CI/badge.svg)](https://github.com/refresh-bio/PHIST/actions)

**Phage-Host Interaction Search Tool**

A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of *k*-mers shared between their sequences.



## Quick start

```bash

git clone --recurse-submodules https://github.com/refresh-bio/PHIST

cd PHIST

make

./phist.py ./example/virus ./example/host ./out/

```

## Installation

PHIST uses [Kmer-db](https://github.com/refresh-bio/kmer-db) as a submodule, therefore a recursive repository clone must be performed:

```

git clone --recurse-submodules https://github.com/refresh-bio/PHIST

```

Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):

```

cd PHIST

make

```

Under Windows one have to build Visual Studio 2015 solutions on *kmer-db* and *utils* subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).

## Usage

PHIST takes as input genomic sequences of viruses and candidate hosts in FASTA files (gzipped or not). Virus genomes may be provided in a single FASTA file or in a directory containing multiple FASTA files (one genome per file). Candidate host genomes should be stored individually in a directory (one genome per FASTA file) (see [example](./example/)).

```

./phist.py [options]   

```

Positional arguments:

  * `virus_path`         Input FASTA file or directory with files (plain or gzip)

  * `host_dir`          Input directory w/ host FASTA files (plain or gzip)

  * `out_dir`           Output directory (will be created if it does not exist)

Options:

* `-k `   *k*-mer length (default: 25, max: 30)

* `-t `  Number of threads (default: number of cores)

* `-h, --help`             Show this help message and exit

* `--keep_temp`         Keep temporary kmer-db files [False]

* `--version`              Show tool's version number and exit

### Usage example

```

./phist.py example/virus/ example/host/ out/ 

```

```

./phist.py example/virus_multifasta.fna example/host/ out/

```

## Output format

PHIST outputs two CSV files. One containing a table of common *k*-mers between phages and hosts, and second file with virus-host predictions.

### Common *k*-mers table

The [common_kmers.csv](./example/common_kmers.csv) file stores numbers of common *k*-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero *k*-mer counts are represented as pairs (*column_number* : *value*) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:

| 									| 								| 					| 				|		|			|	

| :---: 							| :---: 						| :---: 			| :---:			| :---:	|  :---:	| 

| kmer-length: *k* fraction: *f* 	| phages 					| *φ₁*					| *φ₂* | ... 	|  *φ_n* |

| hosts 					| total-kmers 					| |*φ₁*|		| |*φ₂*| 	| ... 	|  |*φ_n*| |

| *h₁* 					| |*h₁*|	| *i₁₁* : |*h₁ ∩ φ_i₁₁*|	| *i₁₂* : |*h₁ ∩ φ_i₁₂*| | ||

| *h₂* 					| |*h₂*|	| *i₂₁* : |*h₂ ∩ φ_i₂₁*|	| *i₂₂* : |*h₂ ∩ φ_i₂₂*| 	| *i₂₃* : |*h₂ ∩ φ_i₂₃*|  	| |   

| *h₂* 					| |*h₂*|	| ||||

| ... 								| ...							| ... ||||

| *h_m* 					| |*h_m*|	| *i_m1* : |*h_m ∩ φ_{i_m1}*|	| |||

where:

* *k* - k-mer length,

* *φ₁*, *φ₂*,  ...,   *φ_n* - phage names,

* *h₁*, *h₂*,  ...,   *h_m* - host names,

* |*a*| - number of k-mers in sample *a*,

* |*a ∩ b*| - number of k-mers common for samples *a* and *b*.

### Host predictions

The [predictions.csv](./example/predictions.csv) file assigns each phage to its most likely host (i.e., the one having most *k*-mers in common). If there are multiple potential hosts with same number of common *k*-mers, all are reported. Each virus-host interaction is followed by *p*-value and adjusted *p*-value for multiple comparisons.

| 	phage								      | 		host						| 	common *k*-mers				| 	*p*-value			|	adj. *p*-value	|				

| :---: 							       | :---: 						| :---: 			           | :---:			     | :---:	 	       | 

|  *φ₁*   | *host*( *φ₁*) | |*φ₁* ∩ *host*(*φ₁*)| | ... | ... |

|  *φ₂*   | *host*( *φ₂*) | |*φ₂* ∩ *host*(*φ₂*)| | ... | ... |

|  *φ₃*   | *host₁*( *φ₃*) | |*φ₃* ∩ *host₁*(*φ₃*)| | ... | ... |

|  *φ₃*   | *host₂*( *φ₃*) | |*φ₃* ∩ *host₂*(*φ₃*)| | ... | ... |

| ... | ... | ... | ... | ... | ... |

## Further analysis

The `utils/matcher` tool retrieves the list of all exact matches of legnth >= *k* for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).

### Usage

```

./utils/matcher [options]   

```

Positional arguments:

  * `virus`             virus FASTA file (gzipped or not),

  * `host`              host FASTA file (gzipped or not),

  * `output`            output CSV file

Options:

* `-k --k `   *k*-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),

### Example

```

./utils/matcher example/virus/NC_024123.fna example/host/NC_017548.fna shared_regions.csv

```

```

example/virus/NC_024123.fna,example/host/NC_017548.fna

NC_024123.1:52942-52968,NC_017548.1:1456873-1456847

NC_024123.1:52970-53009,NC_017548.1:1456845-1456806

NC_024123.1:53011-53102,NC_017548.1:1456804-1456713

NC_024123.1:53107-53147,NC_017548.1:1456708-1456668

NC_024123.1:53830-53854,NC_017548.1:2647971-2647947

NC_024123.1:54794-54827,NC_017548.1:679998-679965

```

## Citing

Zielezinski A, Deorowicz S, Gudyś A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences, Bioinformatics. 2022, 38(5):1447-9. doi:[10.1093/bioinformatics/btab837](https://doi.org/10.1093/bioinformatics/btab837).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/refresh-bio/phist

Awesome Lists containing this project

README