https://github.com/snitkin-lab-umich/genome_downloading_snakemake
Snakemake pipeline to download genomes from NCBI using Biosample Accession Numbers and fasterq-dump
https://github.com/snitkin-lab-umich/genome_downloading_snakemake
Last synced: about 2 months ago
JSON representation
Snakemake pipeline to download genomes from NCBI using Biosample Accession Numbers and fasterq-dump
- Host: GitHub
- URL: https://github.com/snitkin-lab-umich/genome_downloading_snakemake
- Owner: Snitkin-Lab-Umich
- Created: 2020-07-06T13:17:20.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-07-22T20:16:55.000Z (almost 5 years ago)
- Last Synced: 2025-01-26T17:36:15.222Z (4 months ago)
- Language: R
- Size: 49.8 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Snakemake pipeline to download genomes from NCBI using Biosample Accession Numbers and fasterq-dump
Written by Sophie Hoffman & Zena Lapp
## Useful links:
NCBI:
- [NCBI website](https://www.ncbi.nlm.nih.gov/)
- [Fasterq-dump info](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump)
- [SRA number info](https://www.ncbi.nlm.nih.gov/sra/docs/)
- [Biosample Accession Number info](https://www.ncbi.nlm.nih.gov/biosample/docs/submission/faq/)[Snakemake setup](https://github.com/Snitkin-Lab-Umich/Snakemake_setup)
## Fasterq-dump and NCBI
### Preparation
NCBI API Key:
- An API key is necessary if you are downloading a large number of genomes from NCBI.
- To get an API key, register for an NCBI account [here](https://www.ncbi.nlm.nih.gov/account/?back_url=https%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fmyncbi%2F). Go to the "Settings" page in your account, then click "Create an API Key" under "API Key Management".
- To use the API key, create an environment variable called `NCBI_API_KEY` in your `~/.bashrc` or conda environment (see Conda section).
```
NCBI_API_KEY={your key}
```Caching:
- NCBI caches data in your home directory by default, so if you are downloading a large amount of data you'll want to change the cache location to scratch instead.
- Information on how to do that through the SRA toolkit is [here](https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration).
- Alternatively, you can go into your `~/.ncbi/user-settings.mkfg` file and edit the `/repository/user/default-path = "/home/uniquename/ncbi"` line by replacing `"/home/uniquename/ncbi"` with a path to your desired scratch location.### Downloading genomes with fasterq-dump
- Fasterq-dump uses an SRA number to find a genome in the NCBI database to download. The following command shows how to download a single genome from a single SRA number (-O specifies the output file location).
```
/nfs/esnitkin/bin_group/sratoolkit.2.9.1-1-centos_linux64/bin/fasterq-dump {SRA number} -O output_files
```
- I recommend making an alias in your `~/.bashrc` for fasterq-dump so that you don't need to use the file path every time.
- If you don't have an SRA number but only have a Biosample Accession Number then you can use the following command to find the SRA number associated with that Biosample number.
```
esearch -db sra -query {Biosample number}