https://github.com/nellore/omfgene

discordant alignment filter
https://github.com/nellore/omfgene

Last synced: 11 months ago
JSON representation

discordant alignment filter

Host: GitHub
URL: https://github.com/nellore/omfgene
Owner: nellore
License: mit
Created: 2016-12-20T17:30:57.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2017-12-11T23:32:47.000Z (about 8 years ago)
Last Synced: 2024-12-30T00:24:49.356Z (about 1 year ago)
Language: Jupyter Notebook
Homepage:
Size: 5.9 MB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# omfgene

`omfgene.awk` is an awk script that filters a paired-end RNA-seq SAM/BAM for mates that either a) align to different chromosomes or b) align to the same chromosome > 500k bases apart. Such alignments may be evidence of gene fusions. The output of `omfgene.awk` is in the format
```
mate 1 chrom mate 1 alignment start coordinate rounded to nearest 100 \
mate 2 chrom mate 2 alignment start coordinate rounded to nearest 100
```
. We ran
```
samtools view | mawk -f omfgene.awk | sort | uniq -c \
| gzip >.discord.tsv.gz
```
across TCGA RNA-seq BAMs that were previously aligned to hg38 with STAR's 2-pass protocol on [Seven Bridges' Cancer Genomics Cloud](https://cgc.sbgenomics.com/), where `mawk` is [a fast implementation](http://invisible-island.net/mawk/) of awk. This was done by

1. creating the Docker image omfgene using `Dockerfile` in `cgcrun/`. We ran the following sequence of commands
cd /path/to/omfgene/cgcrun
docker build -t biomawk .
docker run biomawk
docker login cgc-images.sbgenomics.com
docker commit $(docker ps -a | head -n2 | tail -n1 | cut -d' ' -f1) cgc-images.sbgenomics.com/anellor1/omfgene:latest
docker push cgc-images.sbgenomics.com/anellor1/omfgene:latest
Note the `docker login` command above required entering a username (ours was `anellor1`) and password, which was our CGC auth token.
2. using the CGC tool editor to create the tool `omfgene`, which was set up to execute the command beginning with `samtools view` above on an `m1.small` Amazon EC2 instance. The base command we wrote in the editor was

{return "samtools view " + $job.inputs.input_file.path + " | mawk -f /data/cgc_outputs/omfgene.awk \
| sort | uniq -c | gzip"}
. The stdout value we entered was

{ filepath = $job.inputs.input_file.path; filename = filepath.split("/").pop();
return filename + ".discord.tsv.gz"}

. This was also what we entered as the "Glob" of an output port.
3. using the CGC workflow editor to create the workflow `omfgene-wrapper`, which was set up to allow batching inputs to the `omfgene` tool by file. We set `sbg:AWSInstanceType` to `c4.8xlarge` and `sbg:maxNumberOfParallelInstances` to `10`.
4. running `omfgene-wrapper` on all 11,096 STAR 2-pass _hg38_ RNA-seq BAMs on CGC using the `omfgene_submit.ipynb` IPython notebook. This notebook was created by Raunaq Malhotra and Erik Lehnert at Seven Bridges.

We next downloaded the output from CGC and ran `discordex.py` to obtain `samples.tsv` and `discordex.v2.hg38.tsv.gz`, which is indexed in an experimental [Snaptron](http://snaptron.cs.jhu.edu/) instance.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nellore/omfgene

Awesome Lists containing this project

README