An open API service indexing awesome lists of open source software.

https://github.com/dcdanko/kraken-linked


https://github.com/dcdanko/kraken-linked

Last synced: 11 months ago
JSON representation

Awesome Lists containing this project

README

          

# KrakenLinked

This is a modification of KrakenUniq that includes two algorithms specific to Metagenomic Linked Reads: Tree-Pruning and Taxa Promotion. Descriptions of the algorithms may be found in the preprint [here](https://www.biorxiv.org/content/10.1101/549667v1)

This work is still under review and this codebase should be considered preliminary. Most of this repo is unchanged from KrakenUniq.

KrakenUniq (formerly KrakenHLL) taxonomic sequence classification system with unique k-mer counting
===============================================

[Kraken](https://github.com/DerrickWood/kraken) is a fast taxonomic classifier for metagenomics data. This project, kraken-hll, adds some additional functionality - most notably a unique k-mer count using the HyperLogLog algorithm. Spurious identifications due to sequence contamination in the dataset or database often leads to many reads, however they usually cover only a small portion of the genome.

KrakenUniq computes the number of unique k-mers observed for each taxon, which allows to filter more false positives. Here's a small example of a classification against a viral database with k=25. There are three species identified by just one read - Enterobacteria phage BP-4795, Salmonella phage SEN22, Sulfolobus monocaudavirus SMV1. Out of those, the identification of Salmonella phage SEN22 is the strongest, as there read was matched with 116 k-mers that are unique to the sequence, while the match to Sulfolobus monocaudavirus SMV1 is only based on a single 25-mer.

```
99.0958 2192 2192 255510 272869 no rank 0 unclassified
0.904159 20 0 2361 2318 no rank 1 root
0.904159 20 0 2361 2318 superkingdom 10239 Viruses
0.904159 20 0 2361 2318 no rank 35237 dsDNA viruses, no RNA stage
0.768535 17 0 2074 2063 order 548681 Herpesvirales
0.768535 17 0 2074 2063 family 10292 Herpesviridae
0.768535 17 0 2074 2063 subfamily 10374 Gammaherpesvirinae
0.768535 17 0 2074 2063 genus 10375 Lymphocryptovirus
0.768535 17 16 2001 1987 species 10376 Human gammaherpesvirus 4
0.045208 1 1 4 4 sequence 1000041143 KC207814.1 Human herpesvirus 4 strain Mutu, complete genome
0.0904159 2 0 254 254 order 28883 Caudovirales
0.045208 1 0 28 28 family 10699 Siphoviridae
0.045208 1 0 28 28 genus 186765 Lambdavirus
0.045208 1 0 28 28 no rank 335795 unclassified Lambda-like viruses
0.045208 1 1 28 28 species 196242 Enterobacteria phage BP-4795
0.045208 1 0 116 116 family 10744 Podoviridae
0.045208 1 0 116 116 no rank 196895 unclassified Podoviridae
0.045208 1 0 116 116 no rank 1758253 Escherichia phage phi191 sensu lato
0.045208 1 1 116 116 species 1647458 Salmonella phage SEN22
0.045208 1 0 1 1 no rank 51368 unclassified dsDNA viruses
0.045208 1 1 1 1 species 1351702 Sulfolobus monocaudavirus SMV1
```

## Usage

For usage, see `krakenuniq --help`. Note that you can use the same database as Kraken with one difference - instead of the files `DB_DIR/taxonomy/nodes.dmp` and `DB_DIR/taxonomy/names.dmp` than kraken relies upon, `kraken-hll` needs the file `DB_DIR/taxDB`. This can be generated with the script `build_taxdb`: `KRAKEN_DIR/build_taxdb DB_DIR/taxonomy/names.dmp DB_DIR/taxonomy/nodes.dmp > DB_DIR/taxDB`. The code behind the taxDB is based on [k-SLAM](https://github.com/aindj/k-SLAM).

### Differences to `kraken`
- Use `krakenuniq --report-file FILENAME ...` to write the kraken report to `FILENAME`.
- Use `krakenuniq --db DB1 --db DB2 --db DB3 ...` to first attempt, for each k-mer, to assign it based on DB1, then DB2, then DB3. You can use this to prefer identifications based on DB1 (e.g. human and contaminant sequences), then DB2 (e.g. completed bacterial genomes), then DB3, etc. Note that this option is incompatible with `krakenuniq-build --generate-taxonomy-ids-for-sequences` since the taxDB between the databases has to be absolutely the same.
- Add a suffix `.gz` to output files to generate gzipped output files

### Differences to `kraken-build`
- Use `krakenuniq-build --generate-taxonomy-ids-for-sequences ...` to add pseudo-taxonomy IDs for each sequence header. An example for the result using this is in the ouput above - one read has been assigned specifically to `KC207814.1 Human herpesvirus 4 strain Mutu, complete genome`.
- `seqid2taxid.map` mapping sequence IDs to taxonomy IDs does NOT parse or require `>gi|`, but rather the sequence ID is the header up to just before the first space

## FAQ

### Installing KrakenUniq on MacOS
OSX by default links `g++` to `clang` without OpenMP support. You can install `g++` with HomeBrew and use the `-c` option of `krakenuniq_install.sh` to specify the HomeBrew `g++`:
```
brew install gcc
./install_krakenuniq -c g++-8
```

### Installing Jellyfish v1.1.11

Currently, KrakenUniq build depends depends on Jellyfish v1.1.11. To install Jellyfish alongside KrakenUniq, use the `-j` flag for the `install_krakenhll.sh` script. Alternatively, you can specify the Jellyfish path to `krakenuniq` with `krakenuniq --jellyfish-bin /usr/bin/jellyfish1`.

### Building a microbial nt database

KrakenUniq supports building databases on subsets of the NCBI nucleotide collection nr/nt, which is most prominently the standard database for BLASTn. On the command line, you can specify to extract all bacterial, viral, archaeal, protozoan, fungal and helminth sequences. The list of protozoan taxa is based on [Kaiju's](https://raw.githubusercontent.com/bioinformatics-centre/kaiju/master/util/taxonlist.tsv).

Example command line:
```
krakenuniq-download --db DB --taxa "archaea,bacteria,viral,fungi,protozoa,helminths" --dust --exclude-environmental-taxa nt
```

### Custom databases with NCBI taxonomy
To build a custom database with the NCBI taxonomy, first download the taxonomy files with
```
krakenuniq-download --db DBDIR taxonomy
```
Then you can add the desired sequence files to the `DBDIR/library` directory:
```
cp SEQ1.fa SEQ2.fa DBDIR/library
```
KrakenUniq needs a _sequence ID to taxonomy ID mapping_ for each sequence. This mappings can be provided in the `DBDIR/library/*.map` - KrakenUniq pools all `.map` files inside of the `library/` folder prior to database building. Format: three tab-separated fields that are, in order, the sequence ID (i. e. the sequence header without '>' up to the first space), the taxonomy ID and the genome or assembly name:
```
Strain1_Chr1_Seq 562 E. Coli Strain Foo
Strain1_Chr2_Seq 562 E. Coli Strain Foo
Strain1_Plasmid1_Seq 562 E. Coli Strain Foo
Strain2_Chr1_Seq 621 S. boydii Strain Bar
Strain2_Plasmid1_Seq 621 S. boydii Strain Bar
```
The third column is optional, and used by KrakenUniq only when `--taxids-for-genomes` is specified for `krakenuniq-build` to add new nodes in the taxonomy tree for the genome. If you'd like to have the sequences identifier in the taxonomy report, too, specifiy `--taxids-for-sequences` for `krakenuniq-build` as well.

Finally, run `krakenuniq-build`:
```
krakenuniq-build --db DBDIR --taxids-for-genomes --taxids-for-sequences
```

Note that for custom databases with fewer sequences you might want to choose a smaller k (default: `--kmer-len 31`) and minimizer length (default: `--minimizer-len 15`).

### Custom databases with custom taxonomies

When using custom taxonomies, please provide `DBDIR/taxonomy/nodes.dmp` and `DBDIR/taxonomy/names.dmp` according to the format of NCBI taxonomy dumps.