https://github.com/louiejtaylor/grabseqs
A utility for easy downloading of reads from next-gen sequencing repositories like NCBI SRA
https://github.com/louiejtaylor/grabseqs
bioinformatics conda metagenomics ncbi-sra ngs python sra
Last synced: 5 months ago
JSON representation
A utility for easy downloading of reads from next-gen sequencing repositories like NCBI SRA
- Host: GitHub
- URL: https://github.com/louiejtaylor/grabseqs
- Owner: louiejtaylor
- License: mit
- Created: 2018-10-23T12:44:49.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2025-08-18T19:07:09.000Z (11 months ago)
- Last Synced: 2025-09-28T03:04:49.302Z (9 months ago)
- Topics: bioinformatics, conda, metagenomics, ncbi-sra, ngs, python, sra
- Language: Python
- Homepage:
- Size: 261 KB
- Stars: 107
- Watchers: 6
- Forks: 14
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# grabseqs
Utility for simplifying bulk downloading data from next-generation sequencing repositories, like [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra/), [MG-RAST](http://www.mg-rast.org/).
[](https://circleci.com/gh/louiejtaylor/grabseqs) [](https://anaconda.org/louiejtaylor/grabseqs) [](https://anaconda.org/louiejtaylor/grabseqs/files) [](https://doi.org/10.1093/bioinformatics/btaa167)
[iMicrobe](https://www.imicrobe.us/) is currently not supported--working to remedy this (2025/08/14)
## Install
Install grabseqs and all dependencies [via conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html):
conda install grabseqs -c louiejtaylor -c bioconda -c conda-forge
Or with pip (and install the non-Python [dependencies](https://github.com/louiejtaylor/grabseqs#dependencies) yourself):
pip install grabseqs
**Note:** If you're using SRA data, after you've installed sra-tools, run `vdb-config -i` and turn off local file caching unless you want extra copies of the downloaded sequences taking up space ([read more here](https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration)).
## Quick start
Download all samples from a single SRA Project:
grabseqs sra SRP#######
Or any combination of projects (S/ERP), runs (S/ERR), BioProjects (PRJNA):
grabseqs sra SRR######## ERP####### PRJNA######## ERR########
If you'd like to do a dry run and just get a list of samples that will be downloaded, pass `-l`:
grabseqs sra -l SRP########
Similar syntax works for MG-RAST:
grabseqs mgrast mgp##### mgm#######
## Detailed usage
See the [grabseqs FAQ](https://github.com/louiejtaylor/grabseqs/blob/master/faq/faq.md) for detailed troubleshooting tips.
Fun options:
grabseqs sra -t 10 -m metadata.csv -o proj/ -r 3 SRP#######
(translation: use 10 threads, save metadata to `proj/metadata.csv`, download to the dir `proj/`, retry failed downloads 3x, get all samples from SRP#######)
If you'd like to do a dry run and only get a list of samples that will be downloaded, pass `-l`:
grabseqs sra -l SRP########
If you'd like to pass your own arguments to `fasterq-dump` to get data in a slightly different format, you can do so like this:
grabseqs sra SRP####### -r 0 --custom_fqdump_args="--split-spot --progress"
Full usage:
grabseqs sra [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES] [-t THREADS]
[-f] [-l] [--no_parsing] [--parse_run_ids]
[--use_fastq_dump]
id [id ...]
positional arguments:
id One or more BioProject, ERR/SRR or ERP/SRP number(s)
optional arguments:
-h, --help show this help message and exit
-m METADATA filename in which to save SRA metadata (.csv format,
relative to OUTDIR)
-o OUTDIR directory in which to save output. created if it doesn't
exist
-r RETRIES number of times to retry download
-t THREADS threads to use (for fasterq-dump/pigz)
-f force re-download of files
-l list (but do not download) samples to be grabbed
--parse_run_ids parse SRR/ERR identifers (do not pass straight to fasterq-
dump)
--custom_fqdump_args CUSTOM_FQD_ARGS
"string" containing args to pass to fastq-dump
--use_fastq_dump use legacy fastq-dump instead of fasterq-dump (no
multithreaded downloading)
Downloads .fastq.gz files to `OUTDIR` (or the working directory if not specified). If the `-m` flag is passed, saves metadata to `OUTDIR` with filename `METADATA` in csv format.
Similar options are available for downloading from MG-RAST:
grabseqs mgrast [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
[-t THREADS] [-f] [-l]
rastid [rastid ...]
## Troubleshooting
See the [grabseqs FAQ](https://github.com/louiejtaylor/grabseqs/blob/master/faq/faq.md) for detailed troubleshooting tips. If the FAQs don't fix your problem, feel free to [open an issue](https://github.com/louiejtaylor/grabseqs/issues)!
## Dependencies
- Python 3 (external packages req'd: requests, requests-html, pandas, fake-useragent)
- sra-tools>3.2
- pigz
- wget
If you use conda (on Linux), these will be installed for you!
Grabseqs runs on Mac or Linux. We've tested on these specific OSes:
Linux (conda or pip):
- CentOS 6, 7, and 8
- Debian 9 and 10
- Ubuntu 16.04, 18.04, and 19.10
- Red Hat Enterprise 6, 7, and 8
- SUSE Enterprise 12 and 15
Mac (pip):
- MacOS 10.14
Grabseqs has been tested and works with the following version of the Python dependencies (though these are neither minimal nor pinned version numbers):
- requests 2.22.0
- pandas>2
## Citation
If you use grabseqs in your work, please cite:
Louis J Taylor, Arwa Abbas, Frederic D Bushman. "grabseqs: Simple downloading of reads and metadata from multiple next-generation sequencing data repositories." *Bioinformatics*, (2020), btaa167, https://doi.org/10.1093/bioinformatics/btaa167
Please also cite the researchers who generated the data (and the repository, if appropriate)!
------------
## Changelog
**1.0.0** (2025-08-14)
- Added a walk-through for adding a new repo using `template.py`
- Better handling for invalid SRA accession numbers
- Update endpoint for NCBI for SRA downloads
- Temporarily remove iMicrobe--needs rewrite to use a different tool
**0.7.0** (2020-01-29)
- Allow users to pass custom args to fast(er)q-dump
- Minor re-writes of download handling code for easier readability
**0.6.1** (2019-12-20)
- Validate compressed files (fix #8 and #34)
**0.6.0** (2019-12-12)
- Gracefully handle incomplete or missing dependencies
- Major rewrite of test suite
**0.5.2** (2019-12-05)
- Improvements to work with multiple versions of Python 3
**0.5.1** (2019-11-23)
- Hotfix handling outdated versions of sra-tools
**0.5.0** (2019-04-11)
- Metadata available for all sources in .csv format
## History
This project spawned out of/incorporates code from [hisss](https://github.com/louiejtaylor/hisss); many thanks to [ArwaAbbas](https://github.com/ArwaAbbas) for helping make this work!