{"id":19154489,"url":"https://github.com/bcgsc/rsempipeline","last_synced_at":"2025-02-22T21:23:33.859Z","repository":{"id":20687437,"uuid":"23970643","full_name":"bcgsc/rsempipeline","owner":"bcgsc","description":"A pipeline for running rsem analysis on thousands of samples","archived":false,"fork":false,"pushed_at":"2019-02-22T17:54:50.000Z","size":339,"stargazers_count":1,"open_issues_count":0,"forks_count":2,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-01-13T00:47:12.938Z","etag":null,"topics":["geo","ncbi","python","rsem","sra","transcript-quantification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bcgsc.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-09-12T17:48:09.000Z","updated_at":"2019-07-02T18:17:26.000Z","dependencies_parsed_at":"2022-08-05T09:15:31.102Z","dependency_job_id":null,"html_url":"https://github.com/bcgsc/rsempipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2Frsempipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2Frsempipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2Frsempipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2Frsempipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bcgsc","download_url":"https://codeload.github.com/bcgsc/rsempipeline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240238250,"owners_count":19769898,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["geo","ncbi","python","rsem","sra","transcript-quantification"],"created_at":"2024-11-09T08:27:01.696Z","updated_at":"2025-02-22T21:23:33.834Z","avatar_url":"https://github.com/bcgsc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"|build| |cov|\n\nrsempipeline\n========================\n\nrsempipeline is a pipeline for analyzing `GEO\n\u003chttp://www.ncbi.nlm.nih.gov/geo/\u003e`_ data using `RSEM\n\u003chttp://deweylab.biostat.wisc.edu/rsem/\u003e`_. The typical analysis process is as\nfollows:\n\nThe input to the pipeline are mainly from two resources,\n\n- soft files for all Series (aka. GSE)\n- A GSE_species_GSM.csv file which contains a list of all interested samples\n  (aka. GSM) to be processed\n\nThere are three steps included in this pipeline:\n\n1. Download the sra files for all GSMs from `GEO\n   \u003chttp://www.ncbi.nlm.nih.gov/geo/\u003e`_ website using aspc from `Aspera\n   \u003chttp://downloads.asperasoft.com/\u003e`_ or `wget\n   \u003chttp://www.gnu.org/software/wget/\u003e`_ (in case when aspc fails). aspc and\n   wget use different urls which are linked to copies of the same file.\n\n2. sra files are converted to fastq.gz files using fastq-dump from `SRA Toolkit\n   \u003chttp://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software\u003e`_\n\n3. Run rsem-calculate-expression from `RSEM\n   \u003chttp://deweylab.biostat.wisc.edu/rsem/\u003e`_ package with all fastq.gz files\n   for all GSMs\n\nThe pipeline is designed to run the first two steps (computationally cheap) on\na localhost. Step 3 (computationally expensive) is run on a HPC cluster\n(e.g. genesis, westgrid cluster).\n\nTypically, about 100 GSEs and a few thousands of GSMs are picked by our\ncollaborators and grouped into a batch. Step 1 and 2 are done in a\nsub-batch-by-sub-batch fashion where all GSMs of a sub-batch are processed in\nparallel until finished. Each sub-batch of GSMs are selected based on their\nfile sizes (estimated from sizes of sra and its resultant fastq.gz files) and\nhow much disk space available on the localhost as specified in a configuration\nfile (``rp_config.yml``). At the end of the second step, a submission script\nwill be generated for each GSM, and at Step 3 a new job will be submitted to\nthe cluster for processing the GSM using RSEM. A control mechanism has also\nbeen implemented to avoid overuse of the cluster resources such as compute\nnodes and disk space. The first two steps are run by the command ``rp-run``\nwhile the generation of the submission script and job submission are handled by\nthe command ``rp-transfer``.\n\n..\n   It will create all folders for all GSMs according to a designated structure,\n   i.e. ``\u003cGSE\u003e/\u003cSpecies\u003e/\u003cGSM\u003e``, and then fetch information of the sra files for\n   each GSM from `NCBI FTP server \u003cftp://ftp-trace.ncbi.nlm.nih.gov/\u003e`_ \"NCBI FTP\n   server\"), and then save it to a file named `sras_info.yaml` in each GSM\n   directory. The fetching process will take a while depending on how many GSMs to\n   be processed.\n\n..\n   3. It will filter the samples generated from Step 1 and generate a sublist of\n   samples that will be processed right away based on the sizes of sra files and\n   estimated fastq.gz files (~1.5x) as well as the sizes available to use as\n   specified in the ``rp_config.yml`` (mainly ``LOCAL_MAX_USAGE``,\n   ``LOCAL_MIN_FREE``). Processed files will be saved to a file named\n   ``sra2fastqed_GSMs.txt``.\n\n..\n\nFor installation and usage instructions, please refer to ``INSTALL.rst`` and\n``USAGE.rst``.\n\nIf you have found any bugs, questions, comments, please contact Zhuyi Xue\n(zxue@bcgsc.ca).\n\n\n\n.. |build| image:: https://travis-ci.org/bcgsc/rsempipeline.svg?branch=master\n    :alt: Build Status\n    :target: https://travis-ci.org/bcgsc/rsempipeline\n    \n.. |cov| image:: https://coveralls.io/repos/bcgsc/rsempipeline/badge.svg?branch=master\u0026service=github\n    :alt: Coverage Status\n    :target: https://coveralls.io/github/bcgsc/rsempipeline?branch=master\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcgsc%2Frsempipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbcgsc%2Frsempipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcgsc%2Frsempipeline/lists"}