https://github.com/plasmapower/intersections
Analyzes intersections between genes and sequences
https://github.com/plasmapower/intersections
Last synced: about 1 year ago
JSON representation
Analyzes intersections between genes and sequences
- Host: GitHub
- URL: https://github.com/plasmapower/intersections
- Owner: PlasmaPower
- License: mit
- Created: 2017-05-29T19:57:24.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-09-14T18:30:36.000Z (almost 9 years ago)
- Last Synced: 2024-10-12T23:20:41.179Z (over 1 year ago)
- Language: Rust
- Homepage:
- Size: 30.3 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Intersections
This program finds the overlap of sequences and genes using format 6 blastn output (http://www.metagenomics.wiki/tools/blast/blastn-output-format-6)
```
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
Query_1 accn|JISN01000002 100.000 28 0 0 29 56 37930 37957 1.32e-08 52.8
```
and gff3 output (from prokka)
```
##gff-version 3
##sequence-region accn_JISN01000001 1 334949
...
accn_JISN01000001 Prodigal:2.6 CDS 240 2849 . + 0 ID=NKHGEDLF_00001;Name=clpB;gene=clpB;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q7A6G6;locus_tag=NKHGEDLF_00001;product=Chaperone protein ClpB
...
>accn_JISN01000001
AATTAATTATCGACCAAGAAAGTGTTTAAATTGGAAGTTTCCTTATGAAGTTTTAT
...
```
Lines 9 and 10 of the blastn output are compared to lines 4 and 5 of the gff3 file (section type 2) for overlap. Any number of bla files can be intersected with an equal number of MATCHING gff files.
## Prerequisites
Folder of .bla files and .gff files MATCHED by NAME (I.E. genome1.bla genome1.gff genome2.bla genome2.gff). Bla files are files created in blastn format 6 by the blasting of one or more sequences against the respective genome. Gff3 files are created (for example) by prokka v1.12 (http://www.vicbioinformatics.com/software.prokka.shtml) for a respective genome.
## Installing
First download rust (instructions from https://rustup.rs/)
```
curl https://sh.rustup.rs -sSf | sh
```
Then download the crate for intersections
```
cargo +nightly install sequence-intersections
```
Intersections can then be found in ~/.cargo/bin/
If a previous version of intersections already exists in the directory use
```
cargo +nightly install -f sequence-intersections
```
## Output and Options
| Column | Description |
| --- | --- |
| name | Name of gene according to gff file. Regions between two genes are denoted Between(GeneNameBefore, GeneNameAfter). Hypothetical proteins are denoted HypotheticalAfter(GeneName) or HypotheticalBefore(GeneName) |
| product | Product of gene according to gff file. Same style as name. |
| total_overlap | Amount of sequence which intersected at this gene. If a sequence of 31 in the blast in put file completely overlapped with this gene (IE blast was in ID_1 and spanned 1000-1031 and the gene was in ID_1 and spanned 1000-1500) then the total_overlap for this gene would add +31. |
| genome_count | The number of genomes which had at least one sequence overlap this gene with at least 1 total_overlap. |
| start_avg | The average start for this gene according to the gff file. |
| start_stdev | The standard deviation of the start of this gene. |
| end_avg | The average end for this gene according to the gff file. |
| end_stdev | The standard deviation of the end of this gene. |
| length_avg | The average span of each gene (# of nucleotides long). Is not related to start or end location but only length of the gene. |
## Example
Example blast and gff intersections at: https://github.com/dUmich/intersections-example
## Errors
Run with this command preceding to get warnings
```
RUST_LOG=warn
```
## Built with
## Versioning
## Authors
* **Lee Bousfield** - *Free-lance code wizard* - [PlasmaPower](https://github.com/PlasmaPower)
* **Daniel Harris** - *Researcher, Snitkin Lab, University of Michigan* - [dUmich](https://github.com/dUmich)