An open API service indexing awesome lists of open source software.

https://github.com/yhoogstrate/egfr-v3-determiner

Free open-source tool to extract counts of spliced EGFR vIII / EGFR non-vIII (wt) reads
https://github.com/yhoogstrate/egfr-v3-determiner

Last synced: 5 months ago
JSON representation

Free open-source tool to extract counts of spliced EGFR vIII / EGFR non-vIII (wt) reads

Awesome Lists containing this project

README

        

# egfr-v3-determiner #

EGFRvIII is the most common mutation found in the EGFR gene in glioblastomma.
It originates from a genomic deletion or multiple inversion resulting in the
loss of exons 2 - 7 at RNA level.

Using RNA-seq, EGFRvIII can be determined based on reads that:

- (1) fall exactly over the splice junction (use `-s` / `--spliced-only`)
- (2) are 'spanning' the splice junction and are thus mapped perfectly within exons 1 and 8

**The only thing that you need are properly aligned RNA-seq BAM files and this tool.**

Please notice that type 2 reads can also arise with other structural variants
in combination with large insert sizes. For instance, if exons 2 - 6 are
deleted and the length of the inner sequence of the RNA frament is longer than
exon 7, it could be wrongly interpreted as EGFRvIII read.

## What is does ##

We designed a small python tool for estimnating the read counts and/or extracting
the actual read names that allows to further analysed the sequencing data.

Estimates the number of EGFR-vIII and EGFR-wt reads from BAM files directly:

```
$ egfr-v3-determiner -r hg38 tmp/test_001.bam tmp/test_002_wt_non-spliced.bam
```

Will result in a text file like this:

| sample | wt-reads | vIII-reads |
|--------|----------|------------|
| tmp/test_001.bam | 0 | 1 |

## egfr-v3-determiner Manuscript ##

If you like to read more about the application of egfr-v3-determiner or when using this package for scientific purposes, please read/cite the following manuscript:

[10.1093/neuonc/noab231](https://doi.org/10.1093/neuonc/noab231)

```
Youri Hoogstrate, Santoesha A Ghisai, Maurice de Wit, Iris de Heer,
Kaspar Draaisma, Job van Riet, Harmen J G van de Werken, Vincent Bours,
Jan Buter, Isabelle Vanden Bempt, Marica Eoli, Enrico Franceschi,
Jean-Sebastien Frenel, Thierry Gorlia, Monique C Hanse, Ann Hoeben,
Melissa Kerkhof, Johan M Kros, Sieger Leenstra, Giuseppe Lombardi,
Slávka Lukacova, Pierre A Robe, Juan M Sepulveda, Walter Taal,
Martin Taphoorn, René M Vernhout, Annemiek M E Walenkamp, Colin Watts,
Michael Weller, Filip Y F de Vos, Guido W Jenster, Martin van den Bent,
Pim J French.

The EGFRvIII transcriptome in glioblastoma, a meta-omics analysis,

Neuro-Oncology, 2021; noab231,
https://doi.org/10.1093/neuonc/noab231
```

This will indirectly help me further maintaining this software package.

Thank you in advance, Youri

## Installation ##

```
git clone https://github.com/yhoogstrate/egfr-v3-determiner.git
cd egfr-v3-determiner

virtualenv -p python3 .venv
source .venv/bin/activate

python setup.py install

egfr-v3-determiner --help

nosetests tests/*.py
```

## Usage ##

```
Usage: egfr-v3-determiner [OPTIONS] [INPUT_BAM]...

Options:
--version Show the version and exit.
-r, --reference-build [hg19|hg38]
Used reference genome (needed for EGFR exon
coordinates) [required]
-s, --spliced-reads-only If paired end reads with an insert size
longer than 801 bases can be expected, wild-
type exon-1 to exon-8 covering reads can can
be expected. Enabling this flag only uses
spliced reads for vIII determination.
-n, --read-names Report all read-names instead of the read
counts.
-i, --include-interchromosomal Include paired-end reads that have an
interchromosomal mapped mate (disabled by
default).
-d, --dataset-suffix TEXT Adds this suffix to the column names; tabs
and newlines not allowed.
-f, --include-duplicates Force including duplicate reads (as marked
by samtools/sambamba/picard etc. - disabled
by default).
-w, --exons-wt TEXT Exons included for counting (--spliced-
reads-only disabled). Options: '2' '2,3'
(equals 'default'), '2,3,4', '2,3,4,5',
'2,3,4,5,6', '2,3,4,5,6,7' (equals 'all'),
default: '2,3'.
-v, --exons-viii TEXT Exons included for counting (--spliced-
reads-only disabled). Options: '8' '8,9'
(equals 'default'), '8,9,10' (equals 'all'),
default: '8,9'.
--help Show this message and exit.
```

### Genomic reference ###

The genomic reference (hg19/hg39) can be changed using the `-r` or
`--reference-build` argument. Genomic references starting with '>1' rather
than 'chr1' will be automatically resolved.

### Interchromosomal reads ###

It may happen that one of the mates splices perfectly over exons 1 - 8 but
that it's mate is mapped to another chromosome. As these are likely derived
from other stuctural variants or odd reads, they are by default excluded.

It may nevertheless, be interesting to analyse these reads, for instance if
there is a suspicion for other structural variants. By using the `-i` /
`--include-interchromosomal` argument, these reads will be included.

### Suffix output column names ###

It may be convenient to add a suffix to the column name in the output, for
instance '-EGFRvIII-reads'. This can be achieved by using the `-d` /
`--dataset-suffix` argument:

```
egfr-v3-determiner -d '-EGFRvIII-reads' sample-1.bam sample-2.bam > EGFRvIII.cnts.txt
```

## I NEED HELP / I FOUND A BUG / I WANT A FEATURE ##

Great, please do so :) I am more than happy to help.
If you find some odd reads that I need to take a look at, feel free
to send them to me.

If you prefer private communication you can e-mail me at:

y {dot} hoogstrate {at} erasmusmc {.} nl

## Duplicate reads ##

Reads marked as duplicate (using sambamba / Picard / samtools etc.) are by
default not included for counting. This is because they typically account
for more data points but not for more information. You can compare it with
scaling a bitmap figure; it contains more pixels but not more information.

If the data is single-end sequenced to an extreme depth, you may like to
include PCR duplicates as they could be coincidentallly marked as duplicate
while they originate from actual distinct RNA molecules. One may wonder why
duplicate marking was performed in the first place.

Nevertheless, to enable countingfor duplicate reads, use `-f` /
`--include-duplicates`.

## Modify used exons ##

The used EGFR exons can be changed using the following arguments:

```
-w, --exons-wt TEXT Exons included for counting (--spliced-
reads-only disabled). Options: '2' '2,3'
(equals 'default'), '2,3,4', '2,3,4,5',
'2,3,4,5,6', '2,3,4,5,6,7' (equals 'all'),
default: '2,3'.
-v, --exons-viii TEXT Exons included for counting (--spliced-
reads-only disabled). Options: '8' '8,9'
(equals 'default'), '8,9,10' (equals 'all'),
default: '8,9'.
```

Ths only works when non-spliced alignments are permitted.

## LICENSE ##

This is FREE software without liability or warranty, following the GPL-3
software license.