{"id":18813256,"url":"https://github.com/nellore/gi2015","last_synced_at":"2026-01-31T17:32:01.300Z","repository":{"id":152814463,"uuid":"44988371","full_name":"nellore/gi2015","owner":"nellore","description":"Scripts and processed data for reproducing Genome Informatics 2015 talk","archived":false,"fork":false,"pushed_at":"2015-11-22T00:03:14.000Z","size":35344,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-09T13:52:12.906Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Mathematica","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nellore.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-10-26T18:18:46.000Z","updated_at":"2015-10-26T18:30:14.000Z","dependencies_parsed_at":"2023-04-13T16:41:06.206Z","dependency_job_id":null,"html_url":"https://github.com/nellore/gi2015","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nellore/gi2015","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fgi2015","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fgi2015/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fgi2015/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fgi2015/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nellore","download_url":"https://codeload.github.com/nellore/gi2015/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nellore%2Fgi2015/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28948460,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-31T14:26:55.697Z","status":"ssl_error","status_checked_at":"2026-01-31T14:26:52.545Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:36:52.996Z","updated_at":"2026-01-31T17:32:01.277Z","avatar_url":"https://github.com/nellore.png","language":"Mathematica","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gi2015\n\nThis repo contains scripts and processed data for reproducing my Genome Informatics 2015 talk on junctions found across ~21,500 SRA samples.\n\nThe presentation itself is in `gi2015.key` and `gi2015.pdf`. The Python script `gi2015.py` generates all the data used in the talk (and much more), but it depends on a list of junctions that's currently unreleased. The junction list may nonetheless be reproduced by following the instructions at the end of this document. Results from running `gi2015.py` are contained in the following files whose formats are described below. See `gi2015.py`'s docstring for still more information.\n\n### gi2015.venn.txt\nIntersections between junctions obtained from SEQC protocol and junctions\nobtained from Rail for the 1720 samples studied by both SEQC and Rail. See\nfile for details.\n\n### gi2015.sim.txt\nResults of running HISAT2 2.0.0-beta and STAR 2.4.2a on the two simulated\nhuman samples from the RGASP spliced alignment paper\n\"Systematic evaluation of spliced alignment programs for RNA-seq data\"\n(40 million read pairs each). Junction and overlap accuracy as defined\nin the Rail-RNA preprint are given for two protocols: in one, the junctions\nthat are actually in the sample are provided as annotation, and in the other,\nthe union of the three gene annotations considered here\n(Gencode v19, Ensembl v75, and refGene) are provided.\n\n### gi2015.sex.txt\n#### Tab-separated fields, in order of descending field 9\n\n1. project accession number\n2. sample accession number\n3. experiment accession number\n4. run accession number\n2. total Y chromosome junction overlaps\n3. total junction overlaps in XIST (chrX:73040486-73072588)\n4. total overlaps\n5. total junctions on chrY\n6. total junctions in XIST (chrX:73040486-73072588)\n7. total junctions in sample\n8. 1 if sample is annotated as male on SRA; 0 if sample is annotated as female\n9. field 2 / field 4\n\n### gi2015.[type].tsv, where [type] is in {bottom_10_pct, top_10_pct, all_words}\nGives common words among samples in bottom and top 10 percent in terms of\nproportion of junctions that are annotated -- as well as for all samples.\nOnly samples with \u003e= 10k junctions are considered.\n### Tab-separated fields, in order of descending field 2\n\n1. word\n2. number of samples (in which \u003e=10k junctions were found) in which word\n    appears\n\n### gi2015.sample_count_submission_date_overlap_geq_20.tsv\n#### Tab-separated fields\n\n1. count of samples in which a given junction was found\n2. count of projects in which a given junction was found\n3. earliest known discovery date (in units of days after February 27, 2009)\n    -- this is the earliest known submission date of a sample associated with a\n    junction\n\nAbove, each junction is covered by at least 20 reads per sample.\n\n### gi2015.[type].stats.tsv, where [type] is in {project, sample}\n#### Tab-separated fields\n\n1. [type] count\n2. Number of junctions found in \u003e= field 1 [type]s\n3. Number of annotated junctions found in \u003e= field 1 [type]s\n4. Number of exonskips found in \u003e= field 1 [type]s (exon skip: both 5' and 3'\n    splice sites are annotated, but not in the same exon-exon junction)\n5. Number of altstartends found in \u003e= field 1 [type]s (altstartend: either 5'\n    or 3' splice site is annotated, but not both)\n6. Number of novel junctions found in \u003e= field 1 [type]s (novel: both 5' and \n    3' splice sites are unannotated)\n7. Number of GT-AG junctions found in \u003e= field 1 [type]s\n8. Number of annotated GT-AG junctions found in \u003e= field 1 [type]s\n9. Number of GC-AG junctions found in \u003e= field 1 [type]s\n10. Number of annotated GC-AG junctions found in \u003e= field 1 [type]s\n11. Number of AT-AC junctions found in \u003e= field 1 [type]s\n12. Number of annotated AT-AC junctions found in \u003e= field 1 [type]s\n\n### gi2015.stats_by_sample.tsv\n#### Tab-separated fields\n\n1. sample index\n2. project accession number\n3. sample accession number\n4. experiment accession number\n5. run accession number\n6. junction count\n7. annotated junction count\n8. count of junctions overlapped by at least 5 reads\n9. count of annotated junctions overlapped by at least 5 reads\n10. total overlap instances\n11. total annotated overlap instances\n\nMathematica 10 was used to make all plots. See the notebook `gi2015.nb`.\n\n## Recovering the junction list `all_SRA_introns.tsv.gz` used by `gi2015.py`\n\n1. An account with Amazon Web Services is required to recover our results. Get one [here](http://aws.amazon.com/).\n2. [Download](https://github.com/nellore/rail/raw/master/releases/install_rail-rna-0.1.7a) Rail-RNA v0.1.7a and follow the instructions at http://docs.rail.bio/installation/ to install it. Make sure to install and set up the Amazon Web Services (AWS) CLI as described there.\n3. Familiarize yourself with how Rail-RNA works by reviewing the [tutorial](http://docs.rail.bio/tutorial/).\n4. Download and install [PyPy 2.4](http://doc.pypy.org/en/latest/release-2.4.0.html).\n5. Clone this repo, `gi2015`.\n6. At the command line, enter\n\n        cd /path/to/gi2015/sra_runs\n; that is, change to the `sra_runs` subdirectory of your clone.\n7. Run\n\n        python create_runs.py --s3-bucket s3://[bucket] --region [AWS region] --c3-2xlarge-bid-price [lower price] --c3-8x-large-bid-price [higher price]\n, where `[bucket]` is some S3 bucket you own where results will be dumped, `[AWS region]` is an AWS region (e.g., \"us-east-1\"), and `[lower price]`/`[higher price]` is an appropriate bid price for a c3.2xlarge/c3.8xlarge instance, respectively.\n8. Several scripts will be (over)written; they will be named\n\n        sra_batch_X_sample_size_K_prep.sh\n        sra_batch_X_sample_size_K_itn.sh\n, for `X` an integer between `0` and `42` inclusive and some `K`. Each script corresponds to a different job flow to be run on [Amazon Elastic MapReduce](https://aws.amazon.com/elasticmapreduce/). The `prep` script downloads FASTQs from the [EMBL-EBI](https://www.ebi.ac.uk/) server and dumps preprocessed versions of them on S3. The `itn` script aligns the data to find junctions. For each `X`, execute `sh sra_batch_X_sample_size_K_prep.sh`, wait until the job flow is done, and then execute `sh sra_batch_X_sample_size_K_itn.sh`. The scripts that are overwritten are the scripts we ultimately ran. We fiddled with bid prices in individual scripts because the [spot market](https://aws.amazon.com/ec2/spot/) was unpredictable.\n9. Download the hg19 Bowtie index [here](ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip) and unpack it.\n10. Run\n\n        sh retrieve_and_combine_results.sh [output] [bowtie1 idx] [bucket]\n, where `[output]` is some output directory on your local filesystem (20 GB required), `[bowtie1 idx]` is the basename of the Bowtie index you just downloaded, and `[bucket]` is the S3 bucket you specified in step 4. The file `all_SRA_introns.tsv.gz`, which is used by `gi2015.py`, will be written to `[output]`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnellore%2Fgi2015","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnellore%2Fgi2015","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnellore%2Fgi2015/lists"}