https://github.com/tseemann/sixess
🔬🐛 Rapid 16s rRNA identification from isolate FASTQ files
https://github.com/tseemann/sixess
bacteria fastq genomics species
Last synced: 6 months ago
JSON representation
🔬🐛 Rapid 16s rRNA identification from isolate FASTQ files
- Host: GitHub
- URL: https://github.com/tseemann/sixess
- Owner: tseemann
- License: gpl-3.0
- Created: 2017-01-26T02:34:42.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-04-24T11:49:53.000Z (over 7 years ago)
- Last Synced: 2024-06-13T00:03:07.697Z (over 1 year ago)
- Topics: bacteria, fastq, genomics, species
- Language: Shell
- Homepage:
- Size: 36.9 MB
- Stars: 23
- Watchers: 7
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://travis-ci.org/tseemann/sixess) [](https://www.gnu.org/licenses/gpl-3.0) [](#lang-au)
:warning: **THIS SOFTWARE IS STILL UNDER DEVELOPMENT - USE AT OWN RISK**
# sixess
Rapid 16s rDNA from isolate FASTQ files## Introduction
`sixess` is a command-line software tool to identify
bacterial species based on 16S rDNA sequence directly
from WGS FASTQ data. It includes databases from
NCBI (default), RDP and SILVA.## Quick start
```
# just give it sequences!
% sixess R1.fastq.gz
Staphylococcus epidermidis# sometimes there is no match
% sixess /dev/null
No matches# give it as many sequence files as needed
% sixess R1.fq R2.fq
Enterococcus faecium# we provide different databases you can choose
% sixess -d RDP contigs.fa
Bacillus cereus# you can pipe to stdin too
% bzcat chernobyl.fq.bz2 | sixess -
Deinococcus radiodurans
```## Installation
### Source
```
cd $HOME
git clone https://github.com/tseemann/sixess
export PATH=$HOME/sixess/bin:$PATH
```
### Homebrew
```
brew install brewsci/bio/sixess # COMING SOON
```
### Bioconda
```
conda install -c bioconda -c conda-forge sixess # COMING SOON
```## Usage
### Input
The input can be one or more sequence files, or `-` denoting `stdin`.
The input data can be FASTQ or FASTA, and may be `.gz` compressed.
Any read length is accepted, even whole chromosomes.### Output
The output is a *single line* to `stdout`.
If a match was found, it will be `Genus species`.
If no prediction could be made, it will be `No matches`.### Options
```
-q Quiet mode, no output
-p DIR Database folder (/home/tseemann/git/sixess/db)
-d FILE Database {NCBI RDP SILVA.gz} (NCBI)
-t NUM CPU threads (1)
-m FILE Save alignments to FILE in PAF format
-V Print version and exit
```* `-q` enables "quiet mode" which only prints to stderr for errors
* `-p` is the location of the sequence databases
* `-d` selects the database; they can be `.gz` compressed (see [Databases](#databases)
* `-t` increases threads; 3 is the suggested value for `minimap2`
* `-m` allows you to save the PAF output of `minimap2`
* `-V` prints the version and exits *e.g.* `sixess 1.0`## Databases
### NCBI (bundled, default)
The [NCBI 16S ribosomal RNA project](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/)
contains curated 16S ribosomal RNA bacteria and archaea RefSeq entries.
It has ~20,000 entries.```
esearch -db nucleotide -query '33175[BioProject] OR 33317[BioProject]' \
| efetch -db nuccore -format fasta \
> $(which sixess)/../db/NCBI
```### RDP (bundled)
Bacterial 16S rDNA sequences for "type strains"
from the [RDP](https://rdp.cme.msu.edu/) database
are included. These are denoted with `(T)` in the
FASTA headers. It contains ~10,000 entries.```
wget --no-check-certificate https://rdp.cme.msu.edu/download/current_Bacteria_unaligned.fa.gz
gunzip -c current_Bacteria_unaligned.fa.gz \
| bioawk -cfastx '/\(T\)/{print ">" $name " " $comment "\n" toupper($seq)}' \
> $(which sixess)/../db/RDP
```### SILVA (bundled)
[SILVA](https://www.arb-silva.de/)
is a comprehensive on-line resource for quality checked and
aligned ribosomal RNA sequence data.
The filtered version of the aligned 16S/18S/SSU database
contains ~100,000 entries.```
# replace "132" with latest version as needed
wget https://www.arb-silva.de/fileadmin/silva_databases/release_132/Exports/SILVA_132_SSURef_Nr99_tax_silva.fasta.gz
gunzip -v SILVA_132_SSURef_Nr99_tax_silva.fasta.gz \
| bioawk -cfastx \
'$comment ~ /^Bacteria;|^Archaea;/ \
&& $comment !~ /(;unidentified|Mitochondria;|;Chloroplast|;uncultured| sp\.)/ \
{ sub(/^.*;/,"",$comment);
gsub("U","T",$seq);
print ">" $name " " $comment "\n" $seq }' \
| seqtk seq -l 60 -U \
> SILVA.tmp1
cd-hit-est -i SILVA.tmp1 -o SILVA.tmp2 -c 1.0 -T 0 -M 2000 -d 250
cp SILVA.tmp2 $(which sixess)/../db/SILVA
rm -f SILVA.tmp1 SILVA.tmp2 SILVA.tmp2.clstr
```## Custom databases
Assuming you have a FASTA file of 16S DNA sequences
called `/home/alex/GG.fa` say, you can do this:### Global installaion
```
cp /home/alex/GG.fa $(which sixess)/../db/GG
sixess -d GG R1.fastq.gz
```### Local installaion
```
sixess -p /home/alex/data -d GG.fa R1.fastq.gz
```## Algorithm
1. Identify reads which look like 16S (`minimap2`)
2. Count up how many reads hit each 16S sequence (possibly weighted)
3. Choose the top hit and report it## Feedback
Report bugs and give suggesions on the
[Issues page](https://github.com/tseemann/sixess/issues)## License
[GPL Version 3](https://raw.githubusercontent.com/tseemann/sixess/master/LICENSE)
## Author
[Torsten Seemann](http://tseemann.github.io)