{"id":18813205,"url":"https://github.com/nellore/omfgene","last_synced_at":"2026-01-12T08:30:17.368Z","repository":{"id":152814482,"uuid":"76977726","full_name":"nellore/omfgene","owner":"nellore","description":"discordant alignment filter","archived":false,"fork":false,"pushed_at":"2017-12-11T23:32:47.000Z","size":6187,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-30T00:24:49.356Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nellore.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-12-20T17:30:57.000Z","updated_at":"2017-12-10T17:11:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"bb6d52a6-0dc9-461a-b2e8-55c4ea0ce5bf","html_url":"https://github.com/nellore/omfgene","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fomfgene","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fomfgene/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fomfgene/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fomfgene/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nellore","download_url":"https://codeload.github.com/nellore/omfgene/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239748297,"owners_count":19690237,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:36:43.967Z","updated_at":"2025-02-19T23:18:02.826Z","avatar_url":"https://github.com/nellore.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# omfgene\n\n`omfgene.awk` is an awk script that filters a paired-end RNA-seq SAM/BAM for mates that either a) align to different chromosomes or b) align to the same chromosome \u003e 500k bases apart. Such alignments may be evidence of gene fusions. The output of `omfgene.awk` is in the format\n```\nmate 1 chrom \u003cTAB\u003e mate 1 alignment start coordinate rounded to nearest 100 \\\n    \u003cTAB\u003e mate 2 chrom \u003cTAB\u003e mate 2 alignment start coordinate rounded to nearest 100\n```\n. We ran \n```\nsamtools view \u003ca TCGA RNA-seq BAM\u003e | mawk -f omfgene.awk | sort | uniq -c \\\n    | gzip \u003e\u003ca TCGA RNA-seq BAM\u003e.discord.tsv.gz\n```\nacross TCGA RNA-seq BAMs that were previously aligned to hg38 with STAR's 2-pass protocol on [Seven Bridges' Cancer Genomics Cloud](https://cgc.sbgenomics.com/), where `mawk` is [a fast implementation](http://invisible-island.net/mawk/) of awk. This was done by\n\n1. creating the Docker image omfgene using `Dockerfile` in `cgcrun/`. We ran the following sequence of commands\n        cd /path/to/omfgene/cgcrun\n        docker build -t biomawk .\n        docker run biomawk\n        docker login cgc-images.sbgenomics.com\n        docker commit $(docker ps -a | head -n2 | tail -n1 | cut -d' ' -f1) cgc-images.sbgenomics.com/anellor1/omfgene:latest\n        docker push cgc-images.sbgenomics.com/anellor1/omfgene:latest\nNote the `docker login` command above required entering a username (ours was `anellor1`) and password, which was our CGC auth token.\n2. using the CGC tool editor to create the tool `omfgene`, which was set up to execute the command beginning with `samtools view` above on an `m1.small` Amazon EC2 instance. The base command we wrote in the editor was\n\n        {return \"samtools view \" + $job.inputs.input_file.path + \" | mawk -f /data/cgc_outputs/omfgene.awk \\\n            | sort | uniq -c | gzip\"}\n   . The stdout value we entered was\n\n        {  filepath = $job.inputs.input_file.path;  filename = filepath.split(\"/\").pop();\n            return filename + \".discord.tsv.gz\"}\n            \n    . This was also what we entered as the \"Glob\" of an output port.\n3. using the CGC workflow editor to create the workflow `omfgene-wrapper`, which was set up to allow batching inputs to the `omfgene` tool by file. We set `sbg:AWSInstanceType` to `c4.8xlarge` and `sbg:maxNumberOfParallelInstances` to `10`.\n4. running `omfgene-wrapper` on all 11,096 STAR 2-pass _hg38_ RNA-seq BAMs on CGC using the `omfgene_submit.ipynb` IPython notebook. This notebook was created by Raunaq Malhotra and Erik Lehnert at Seven Bridges.\n\nWe next downloaded the output from CGC and ran `discordex.py` to obtain `samples.tsv` and `discordex.v2.hg38.tsv.gz`, which is indexed in an experimental [Snaptron](http://snaptron.cs.jhu.edu/) instance.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnellore%2Fomfgene","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnellore%2Fomfgene","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnellore%2Fomfgene/lists"}