An open API service indexing awesome lists of open source software.

https://github.com/snitkin-lab-umich/genome_downloading_snakemake

Snakemake pipeline to download genomes from NCBI using Biosample Accession Numbers and fasterq-dump
https://github.com/snitkin-lab-umich/genome_downloading_snakemake

Last synced: about 2 months ago
JSON representation

Snakemake pipeline to download genomes from NCBI using Biosample Accession Numbers and fasterq-dump

Awesome Lists containing this project

README

        

# Snakemake pipeline to download genomes from NCBI using Biosample Accession Numbers and fasterq-dump

Written by Sophie Hoffman & Zena Lapp

## Useful links:
NCBI:
- [NCBI website](https://www.ncbi.nlm.nih.gov/)
- [Fasterq-dump info](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump)
- [SRA number info](https://www.ncbi.nlm.nih.gov/sra/docs/)
- [Biosample Accession Number info](https://www.ncbi.nlm.nih.gov/biosample/docs/submission/faq/)

[Snakemake setup](https://github.com/Snitkin-Lab-Umich/Snakemake_setup)

## Fasterq-dump and NCBI
### Preparation
NCBI API Key:
- An API key is necessary if you are downloading a large number of genomes from NCBI.
- To get an API key, register for an NCBI account [here](https://www.ncbi.nlm.nih.gov/account/?back_url=https%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fmyncbi%2F). Go to the "Settings" page in your account, then click "Create an API Key" under "API Key Management".
- To use the API key, create an environment variable called `NCBI_API_KEY` in your `~/.bashrc` or conda environment (see Conda section).
```
NCBI_API_KEY={your key}
```

Caching:
- NCBI caches data in your home directory by default, so if you are downloading a large amount of data you'll want to change the cache location to scratch instead.
- Information on how to do that through the SRA toolkit is [here](https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration).
- Alternatively, you can go into your `~/.ncbi/user-settings.mkfg` file and edit the `/repository/user/default-path = "/home/uniquename/ncbi"` line by replacing `"/home/uniquename/ncbi"` with a path to your desired scratch location.

### Downloading genomes with fasterq-dump
- Fasterq-dump uses an SRA number to find a genome in the NCBI database to download. The following command shows how to download a single genome from a single SRA number (-O specifies the output file location).
```
/nfs/esnitkin/bin_group/sratoolkit.2.9.1-1-centos_linux64/bin/fasterq-dump {SRA number} -O output_files
```

- I recommend making an alias in your `~/.bashrc` for fasterq-dump so that you don't need to use the file path every time.
- If you don't have an SRA number but only have a Biosample Accession Number then you can use the following command to find the SRA number associated with that Biosample number.
```
esearch -db sra -query {Biosample number}