{"id":17278225,"url":"https://github.com/saketkc/pysradb","last_synced_at":"2025-05-14T22:09:44.018Z","repository":{"id":33612802,"uuid":"159590788","full_name":"saketkc/pysradb","owner":"saketkc","description":"Package for fetching metadata and downloading data from SRA/ENA/GEO","archived":false,"fork":false,"pushed_at":"2025-02-17T12:10:07.000Z","size":7437,"stargazers_count":327,"open_issues_count":33,"forks_count":64,"subscribers_count":11,"default_branch":"develop","last_synced_at":"2025-05-14T22:09:38.028Z","etag":null,"topics":["bioinformatics","bioinformatics-pipeline","ena","ncbi-sra","ncbi-sra-archive","sra","sratoolkit"],"latest_commit_sha":null,"homepage":"https://saketkc.github.io/pysradb","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saketkc.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["saketkc"]}},"created_at":"2018-11-29T01:44:29.000Z","updated_at":"2025-05-11T16:32:43.000Z","dependencies_parsed_at":"2024-10-15T09:11:16.356Z","dependency_job_id":"361a36cc-c601-4823-88fc-672f4dd1009c","html_url":"https://github.com/saketkc/pysradb","commit_stats":{"total_commits":643,"total_committers":14,"mean_commits":45.92857142857143,"dds":0.09797822706065318,"last_synced_commit":"85abf88c636b79c2b22b380cb15829de257ed693"},"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saketkc%2Fpysradb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saketkc%2Fpysradb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saketkc%2Fpysradb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saketkc%2Fpysradb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saketkc","download_url":"https://codeload.github.com/saketkc/pysradb/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254235701,"owners_count":22036964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","bioinformatics-pipeline","ena","ncbi-sra","ncbi-sra-archive","sra","sratoolkit"],"created_at":"2024-10-15T09:11:07.893Z","updated_at":"2025-05-14T22:09:39.006Z","avatar_url":"https://github.com/saketkc.png","language":"Python","funding_links":["https://github.com/sponsors/saketkc"],"categories":[],"sub_categories":[],"readme":"# A Python package for retrieving metadata from SRA/ENA/GEO\n\n[![image](https://img.shields.io/pypi/v/pysradb.svg?style=flat-square)](https://pypi.python.org/pypi/pysradb)\n[![image](https://anaconda.org/bioconda/pysradb/badges/version.svg)](https://anaconda.org/bioconda/pysradb/badges/version.svg)\n[![image](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square)](http://bioconda.github.io/recipes/pysradb/README.html)\n[![image](https://static.pepy.tech/personalized-badge/pysradb?period=month\u0026units=international_system\u0026left_color=black\u0026right_color=brightgreen\u0026left_text=Downloads/month)](https://pepy.tech/project/pysradb)\n[![image](https://anaconda.org/bioconda/pysradb/badges/downloads.svg)](https://anaconda.org/bioconda/pysradb)\n[![image](https://zenodo.org/badge/159590788.svg)](https://zenodo.org/badge/latestdoi/159590788)\n[![image](https://github.com/saketkc/pysradb/workflows/push/badge.svg)](https://github.com/saketkc/pysradb/actions)\n\n## Documentation\n\n\u003chttps://saketkc.github.io/pysradb\u003e\n\n## CLI Usage\n\n`pysradb` supports command line usage. See\n[CLI](https://saket-choudhary.me/pysradb/cmdline.html) instructions or\n[quickstart\nguide](https://www.saket-choudhary.me/pysradb/quickstart.html).\n\n    $ pysradb\n     usage: pysradb [-h] [--version] [--citation]\n                    {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}\n                    ...\n\n     pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.\n     version: 2.0.1\n     Citation: 10.12688/f1000research.18676.1\n\n     optional arguments:\n       -h, --help            show this help message and exit\n       --version             show program's version number and exit\n       --citation            how to cite\n\n     subcommands:\n       {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}\n         metadata            Fetch metadata for SRA project (SRPnnnn)\n         download            Download SRA project (SRPnnnn)\n         search              Search SRA for matching text\n         gse-to-gsm          Get GSM for a GSE\n         gse-to-srp          Get SRP for a GSE\n         gsm-to-gse          Get GSE for a GSM\n         gsm-to-srp          Get SRP for a GSM\n         gsm-to-srr          Get SRR for a GSM\n         gsm-to-srs          Get SRS for a GSM\n         gsm-to-srx          Get SRX for a GSM\n         srp-to-gse          Get GSE for a SRP\n         srp-to-srr          Get SRR for a SRP\n         srp-to-srs          Get SRS for a SRP\n         srp-to-srx          Get SRX for a SRP\n         srr-to-gsm          Get GSM for a SRR\n         srr-to-srp          Get SRP for a SRR\n         srr-to-srs          Get SRS for a SRR\n         srr-to-srx          Get SRX for a SRR\n         srs-to-gsm          Get GSM for a SRS\n         srs-to-srx          Get SRX for a SRS\n         srx-to-srp          Get SRP for a SRX\n         srx-to-srr          Get SRR for a SRX\n         srx-to-srs          Get SRS for a SRX\n\n## Quickstart\n\nA Google Colaboratory version of most used commands are available in\nthis [Colab\nNotebook](https://colab.research.google.com/drive/1C60V-jkcNZiaCra_V5iEyFs318jgVoUR)\n. Note that this requires only an active internet connection (no\nadditional downloads are made).\n\nThe following notebooks document all the possible features of\n\\`pysradb\\`:\n\n1.  [Python\n    API](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/01.Python-API_demo.ipynb)\n2.  [Downloading datasets from SRA - command\n    line](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/02.Commandline_download.ipynb)\n3.  [Parallely download multiple datasets - Python\n    API](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/03.ParallelDownload.ipynb)\n4.  [Converting SRA-to-fastq - command line (requires\n    conda)](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/04.SRA_to_fastq_conda.ipynb)\n5.  [Downloading subsets of a project - Python\n    API](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/05.Downloading_subsets_of_a_project.ipynb)\n6.  [Download\n    BAMs](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/06.Download_BAMs.ipynb)\n7.  [Metadata for multiple\n    SRPs](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/07.Multiple_SRPs.ipynb)\n8.  [Multithreaded fastq downloads using Aspera\n    Client](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/08.pysradb_ascp_multithreaded.ipynb)\n9.  [Searching\n    SRA/GEO/ENA](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/09.Query_Search.ipynb)\n\n## Installation\n\nTo install stable version using \\`pip\\`:\n\n```bash\npip install pysradb\n```\n\nAlternatively, if you use conda:\n\n```bash\nconda install -c bioconda pysradb\n```\n\nThis step will install all the dependencies. If you have an existing\nenvironment with a lot of pre-installed packages, conda might be\n[slow](https://github.com/bioconda/bioconda-recipes/issues/13774).\nPlease consider creating a new enviroment for `pysradb`:\n\n```bash\nconda create -c bioconda -n pysradb PYTHON=3.10 pysradb\n```\n\n### Dependencies\n\n    pandas\n    requests\n    tqdm\n    xmltodict\n\n### Installing pysradb in development mode\n\n    git clone https://github.com/saketkc/pysradb.git\n    cd pysradb \u0026\u0026 pip install -r requirements.txt\n    pip install -e .\n\n## Using pysradb\n\n### Obtaining SRA metadata\n\n    $ pysradb metadata SRP000941 | head\n\n    study_accession experiment_accession experiment_title                                                                                                                 experiment_desc                                                                                                                  organism_taxid  organism_name library_strategy library_source  library_selection sample_accession sample_title instrument                    total_spots total_size    run_accession run_total_spots run_total_bases\n    SRP000941       SRX056722                                                                         Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells                                                               Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC    ChIP            SRS184466                              Illumina HiSeq 2000    26900401     531654480   SRR179707     26900401         807012030\n    SRP000941       SRX027889                                                                            Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells                                                                  Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC    ChIP            SRS116481                      Illumina Genome Analyzer II    37528590     779578968   SRR067978     37528590        1351029240\n    SRP000941       SRX027888                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116483                      Illumina Genome Analyzer II    13603127    3232309537   SRR067977     13603127         489712572\n    SRP000941       SRX027887                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116562                      Illumina Genome Analyzer II    22430523     506327844   SRR067976     22430523         807498828\n    SRP000941       SRX027886                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116560                      Illumina Genome Analyzer II    15342951     301720436   SRR067975     15342951         552346236\n    SRP000941       SRX027885                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116482                      Illumina Genome Analyzer II    39725232     851429082   SRR067974     39725232        1430108352\n    SRP000941       SRX027884                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116481                      Illumina Genome Analyzer II    32633277     544478483   SRR067973     32633277        1174797972\n    SRP000941       SRX027883                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS004118                      Illumina Genome Analyzer II    22150965    3262293717   SRR067972      9357767         336879612\n    SRP000941       SRX027883                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS004118                      Illumina Genome Analyzer II    22150965    3262293717   SRR067971     12793198         460555128\n\n### Obtaining detailed SRA metadata\n\n    $ pysradb metadata SRP075720 --detailed | head\n\n    study_accession experiment_accession experiment_title                                  experiment_desc                                   organism_taxid  organism_name library_strategy library_source  library_selection sample_accession sample_title instrument           total_spots total_size run_accession run_total_spots run_total_bases\n    SRP075720       SRX1800476            GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq   GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467643                    Illumina HiSeq 2500  2547148      97658407  SRR3587912    2547148         127357400\n    SRP075720       SRX1800475            GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq   GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467642                    Illumina HiSeq 2500  2676053     101904264  SRR3587911    2676053         133802650\n    SRP075720       SRX1800474            GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq   GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467641                    Illumina HiSeq 2500  1603567      61729014  SRR3587910    1603567          80178350\n    SRP075720       SRX1800473            GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq   GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467640                    Illumina HiSeq 2500  2498920      94977329  SRR3587909    2498920         124946000\n    SRP075720       SRX1800472            GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq   GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467639                    Illumina HiSeq 2500  2226670      83473957  SRR3587908    2226670         111333500\n    SRP075720       SRX1800471            GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq   GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467638                    Illumina HiSeq 2500  2269546      87486278  SRR3587907    2269546         113477300\n    SRP075720       SRX1800470            GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq   GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467636                    Illumina HiSeq 2500  2333284      88669838  SRR3587906    2333284         116664200\n    SRP075720       SRX1800469            GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq   GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467637                    Illumina HiSeq 2500  2071159      79689296  SRR3587905    2071159         103557950\n    SRP075720       SRX1800468            GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq   GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467635                    Illumina HiSeq 2500  2321657      89307894  SRR3587904    2321657         116082850\n\n### Converting SRP to GSE\n\n    $ pysradb srp-to-gse SRP075720\n\n    study_accession study_alias\n    SRP075720       GSE81903\n\n### Converting GSM to SRP\n\n    $ pysradb gsm-to-srp GSM2177186\n\n    experiment_alias study_accession\n    GSM2177186       SRP075720\n\n### Converting GSM to GSE\n\n    $ pysradb gsm-to-gse GSM2177186\n\n    experiment_alias study_alias\n    GSM2177186       GSE81903\n\n### Converting GSM to SRX\n\n    $ pysradb gsm-to-srx GSM2177186\n\n    experiment_alias experiment_accession\n    GSM2177186       SRX1800089\n\n### Converting GSM to SRR\n\n    $ pysradb gsm-to-srr GSM2177186\n\n    experiment_alias run_accession\n    GSM2177186       SRR3587529\n\n### Downloading supplementary files from GEO\n\n    $ pysradb download -g GSE161707\n\n### Downloading an entire SRA/ENA project (multithreaded)\n\n`pysradb` makes it super easy to download datasets from SRA parallely:\nUsing 8 threads to download:\n\n    $ pysradb download -y -t 8 --out-dir ./pysradb_downloads -p SRP063852\n\nDownloads are organized by `SRP/SRX/SRR` mimicking the hierarchy of SRA\nprojects.\n\n### Downloading only certain samples of interest\n\n    $ pysradb metadata SRP000941 --detailed | grep 'study\\|RNA-Seq' | pysradb download\n\nThis will download all `RNA-seq` samples coming from this project.\n\n### Ultrafast fastq downloads\n\nWith\n[aspera-client](https://downloads.asperasoft.com/en/downloads/8?list)\ninstalled, [pysradb]{.title-ref} can perform ultra fast downloads:\n\nTo download all original fastqs with [aspera-client]{.title-ref}\ninstalled utilizing 8 threads:\n\n    $ pysradb download -t 8 --use_ascp -p SRP002605\n\nRefer to the notebook for [(shallow) time\nbenchmarks](https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/08.pysradb_ascp_multithreaded.ipynb).\n\n## Publication\n\n\u003e [pysradb: A Python package to query next-generation sequencing\n\u003e metadata and data from NCBI Sequence Read\n\u003e Archive](https://f1000research.com/articles/8-532/v1)\n\u003e\n\u003e Presentation slides from BOSC (ISMB-ECCB) 2019:\n\u003e \u003chttps://f1000research.com/slides/8-1183\u003e\n\n## Citation\n\nChoudhary, Saket. \\\"pysradb: A Python Package to Query next-Generation\nSequencing Metadata and Data from NCBI Sequence Read Archive.\\\"\nF1000Research, vol. 8, F1000 (Faculty of 1000 Ltd), Apr. 2019, p. 532\n(\u003chttps://f1000research.com/articles/8-532/v1\u003e)\n\n    @article{Choudhary2019,\n    doi = {10.12688/f1000research.18676.1},\n    url = {https://doi.org/10.12688/f1000research.18676.1},\n    year = {2019},\n    month = apr,\n    publisher = {F1000 (Faculty of 1000 Ltd)},\n    volume = {8},\n    pages = {532},\n    author = {Saket Choudhary},\n    title = {pysradb: A {P}ython package to query next-generation sequencing metadata and data from {NCBI} {S}equence {R}ead {A}rchive},\n    journal = {F1000Research}\n    }\n\nZenodo archive: \u003chttps://zenodo.org/badge/latestdoi/159590788\u003e\n\nZenodo DOI: 10.5281/zenodo.2306881\n\n## Questions?\n\nOpen an [issue](https://github.com/saketkc/pysradb/issues) or join our\n[Slack\nChannel](https://join.slack.com/t/pysradb/shared_invite/zt-f01jndpy-KflPu3Be5Aq3FzRh5wj1Ug).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaketkc%2Fpysradb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaketkc%2Fpysradb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaketkc%2Fpysradb/lists"}