https://github.com/refresh-bio/phist
Phage-Host Interaction Search Tool
https://github.com/refresh-bio/phist
bacterial-genomes bioinformatics genomics host-prediction k-mers phages
Last synced: about 2 months ago
JSON representation
Phage-Host Interaction Search Tool
- Host: GitHub
- URL: https://github.com/refresh-bio/phist
- Owner: refresh-bio
- License: gpl-3.0
- Created: 2021-08-11T20:18:23.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-07-08T14:49:09.000Z (almost 2 years ago)
- Last Synced: 2024-07-08T18:27:50.061Z (almost 2 years ago)
- Topics: bacterial-genomes, bioinformatics, genomics, host-prediction, k-mers, phages
- Language: C++
- Homepage:
- Size: 12.2 MB
- Stars: 26
- Watchers: 4
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-virome - PHIST - K-mer based phage-host prediction. [source] [C++] (Host Prediction / RNA Virus Identification)
README
# PHIST
[](https://anaconda.org/bioconda/phist)
[](https://github.com/refresh-bio/PHIST/actions)
**Phage-Host Interaction Search Tool**
A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of *k*-mers shared between their sequences.

## Quick start
```bash
git clone --recurse-submodules https://github.com/refresh-bio/PHIST
cd PHIST
make
./phist.py ./example/virus ./example/host ./out/
```
## Installation
PHIST uses [Kmer-db](https://github.com/refresh-bio/kmer-db) as a submodule, therefore a recursive repository clone must be performed:
```
git clone --recurse-submodules https://github.com/refresh-bio/PHIST
```
Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):
```
cd PHIST
make
```
Under Windows one have to build Visual Studio 2015 solutions on *kmer-db* and *utils* subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).
## Usage
PHIST takes as input genomic sequences of viruses and candidate hosts in FASTA files (gzipped or not). Virus genomes may be provided in a single FASTA file or in a directory containing multiple FASTA files (one genome per file). Candidate host genomes should be stored individually in a directory (one genome per FASTA file) (see [example](./example/)).
```
./phist.py [options]
```
Positional arguments:
* `virus_path` Input FASTA file or directory with files (plain or gzip)
* `host_dir` Input directory w/ host FASTA files (plain or gzip)
* `out_dir` Output directory (will be created if it does not exist)
Options:
* `-k ` *k*-mer length (default: 25, max: 30)
* `-t ` Number of threads (default: number of cores)
* `-h, --help` Show this help message and exit
* `--keep_temp` Keep temporary kmer-db files [False]
* `--version` Show tool's version number and exit
### Usage example
```
./phist.py example/virus/ example/host/ out/
```
```
./phist.py example/virus_multifasta.fna example/host/ out/
```
## Output format
PHIST outputs two CSV files. One containing a table of common *k*-mers between phages and hosts, and second file with virus-host predictions.
### Common *k*-mers table
The [common_kmers.csv](./example/common_kmers.csv) file stores numbers of common *k*-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero *k*-mer counts are represented as pairs (*column_number* : *value*) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
| | | | | | |
| :---: | :---: | :---: | :---: | :---: | :---: |
| kmer-length: *k* fraction: *f* | phages | *φ1* | *φ2* | ... | *φn* |
| hosts | total-kmers | |*φ1*| | |*φ2*| | ... | |*φn*| |
| *h1* | |*h1*| | *i11* : |*h1 ∩ φi11*| | *i12* : |*h1 ∩ φi12*| | ||
| *h2* | |*h2*| | *i21* : |*h2 ∩ φi21*| | *i22* : |*h2 ∩ φi22*| | *i23* : |*h2 ∩ φi23*| | |
| *h2* | |*h2*| | ||||
| ... | ... | ... ||||
| *hm* | |*hm*| | *im1* : |*hm ∩ φim1*| | |||
where:
* *k* - k-mer length,
* *φ1*, *φ2*, ..., *φn* - phage names,
* *h1*, *h2*, ..., *hm* - host names,
* |*a*| - number of k-mers in sample *a*,
* |*a ∩ b*| - number of k-mers common for samples *a* and *b*.
### Host predictions
The [predictions.csv](./example/predictions.csv) file assigns each phage to its most likely host (i.e., the one having most *k*-mers in common). If there are multiple potential hosts with same number of common *k*-mers, all are reported. Each virus-host interaction is followed by *p*-value and adjusted *p*-value for multiple comparisons.
| phage | host | common *k*-mers | *p*-value | adj. *p*-value |
| :---: | :---: | :---: | :---: | :---: |
| *φ1* | *host*( *φ1*) | |*φ1* ∩ *host*(*φ1*)| | ... | ... |
| *φ2* | *host*( *φ2*) | |*φ2* ∩ *host*(*φ2*)| | ... | ... |
| *φ3* | *host1*( *φ3*) | |*φ3* ∩ *host1*(*φ3*)| | ... | ... |
| *φ3* | *host2*( *φ3*) | |*φ3* ∩ *host2*(*φ3*)| | ... | ... |
| ... | ... | ... | ... | ... | ... |
## Further analysis
The `utils/matcher` tool retrieves the list of all exact matches of legnth >= *k* for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).
### Usage
```
./utils/matcher [options]
```
Positional arguments:
* `virus` virus FASTA file (gzipped or not),
* `host` host FASTA file (gzipped or not),
* `output` output CSV file
Options:
* `-k --k ` *k*-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),
### Example
```
./utils/matcher example/virus/NC_024123.fna example/host/NC_017548.fna shared_regions.csv
```
```
example/virus/NC_024123.fna,example/host/NC_017548.fna
NC_024123.1:52942-52968,NC_017548.1:1456873-1456847
NC_024123.1:52970-53009,NC_017548.1:1456845-1456806
NC_024123.1:53011-53102,NC_017548.1:1456804-1456713
NC_024123.1:53107-53147,NC_017548.1:1456708-1456668
NC_024123.1:53830-53854,NC_017548.1:2647971-2647947
NC_024123.1:54794-54827,NC_017548.1:679998-679965
```
## Citing
Zielezinski A, Deorowicz S, Gudyś A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences, Bioinformatics. 2022, 38(5):1447-9. doi:[10.1093/bioinformatics/btab837](https://doi.org/10.1093/bioinformatics/btab837).