{"id":13602923,"url":"https://github.com/pirovc/genome_updater","last_synced_at":"2026-02-28T20:02:25.622Z","repository":{"id":20814939,"uuid":"90990028","full_name":"pirovc/genome_updater","owner":"pirovc","description":"Bash script to download/update snapshots of files from NCBI genomes repository (refseq/genbank) with track of changes and without redundancy","archived":false,"fork":false,"pushed_at":"2025-11-14T16:25:38.000Z","size":1423,"stargazers_count":161,"open_issues_count":7,"forks_count":15,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-11-14T18:18:12.428Z","etag":null,"topics":["bash","bioinformatics","database","download","genbank","genome","genomes","genomics","ncbi","refseq","sequence"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pirovc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-05-11T14:42:59.000Z","updated_at":"2025-10-20T11:55:41.000Z","dependencies_parsed_at":"2025-04-12T01:00:01.297Z","dependency_job_id":null,"html_url":"https://github.com/pirovc/genome_updater","commit_stats":null,"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"purl":"pkg:github/pirovc/genome_updater","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirovc%2Fgenome_updater","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirovc%2Fgenome_updater/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirovc%2Fgenome_updater/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirovc%2Fgenome_updater/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pirovc","download_url":"https://codeload.github.com/pirovc/genome_updater/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pirovc%2Fgenome_updater/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29951070,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-28T18:42:55.706Z","status":"ssl_error","status_checked_at":"2026-02-28T18:42:48.811Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash","bioinformatics","database","download","genbank","genome","genomes","genomics","ncbi","refseq","sequence"],"created_at":"2024-08-01T18:01:43.168Z","updated_at":"2026-02-28T20:02:25.611Z","avatar_url":"https://github.com/pirovc.png","language":"Shell","funding_links":[],"categories":["Shell","bioinformatics"],"sub_categories":[],"readme":"# genome_updater [![Build Status](https://app.travis-ci.com/pirovc/genome_updater.svg?branch=main)](https://app.travis-ci.com/pirovc/genome_updater) [![codecov](https://codecov.io/gh/pirovc/genome_updater/branch/master/graph/badge.svg)](https://codecov.io/gh/pirovc/genome_updater) [![Anaconda-Server Badge](https://anaconda.org/bioconda/genome_updater/badges/downloads.svg)](https://anaconda.org/bioconda/genome_updater)\n\ngenome_updater is a bash script that downloads and updates (non-redundant) snapshots of the NCBI Genomes repository (RefSeq/GenBank) [[1](https://ftp.ncbi.nlm.nih.gov/genomes/)] with advanced filters, detailed logs and reports, file integrity checks (MD5), NCBI taxonomy and GTDB [[2](https://gtdb.ecogenomic.org/)] integration and support for parallel [[3](https://doi.org/10.5281/zenodo.1146014)] downloads. genome_updater uses the [assembly_summary.txt](https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt) to retrieve data.\n\n## Quick usage guide\n\n### Download \n\n```bash\nwget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh\nchmod +x genome_updater.sh\n```\n\n### Usage\n\nDownloading archaeal complete genome genomic sequences from RefSeq:\n\n```bash\n./genome_updater.sh -o \"arc_refseq_cg\" -d \"refseq\" -g \"archaea\" -l \"complete genome\" -f \"genomic.fna.gz\" -t 12\n```\n\nSome days later, update the local repository to download newly added files:\n\n```bash\n./genome_updater.sh -o \"arc_refseq_cg\"\n```\n - `-t` number of downloads in parallel.\n - `-k` can be used to perform a dry-run, showing how many files will be downloaded/updated.\n\n## Important parameters\n\nA list of all parameters can be found [here](#genome_updater--h)\n\n### Database/Organism/Taxa\n\n- `-d`: Database/repository\n  - Options: `refseq`, `genbank`\n- `-g`: Whole organims groups\n  - Options: `archaea`, `bacteria`, `fungi`, `human`, `invertebrate`, `metagenomes`, `other`, `plant`, `protozoa`, `vertebrate_mammalian`, `vertebrate_other`, `viral`\n- `-T`: for taxonomy groups with optional negation using the `^` prefix\n  - Examples: `-T '562'`, `-T '543,^562'`, `-T 'f__Enterobacteriaceae,^s__Escherichia coli'` (with `-M gtdb`)\n\n### Output\n\n- `-o`: Output directory\n  - Every run generates a snapshot, which can be named using the `-b {snapshot}` option (a timestamp is used by default).\n  - Downloaded files are stored in a single folder (`{working_dir}/{snapshot}/files/`), but the NCBI FTP file structure can be enforced using the `-N` option (e.g. `{working_dir}/{snapshot}/files/GCF/019/968/985/`).\n- `-f`: File types. All file types are listed [here](ttps://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt).\n  - Example: `-f 'genomic.fna.gz,assembly_report.txt'`. \n\n### Filters\n\n- `-c`: RefSeq category\n  - Options: `reference genome`, `na`\n- `-l`: Assembly level\n  - Options: `Complete Genome`, `Chromosome`, `Scaffold`, `Contig`\n- `-D`/`-E`: Start and end sequence release dates, respectivelly\n  - Example: `-D 20201231 -E 20251231`\n- `-F`: Custom filters for the [assembly_summary.txt](https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt). Can be applied by column (e.g. `$4`) or in the whole file (`$0`). Uses [awk](https://www.gnu.org/software/gawk/manual/gawk.html) conditionals syntax.\n  - Examples:\n    - Single: `-F '$14 == \"Full\"'`\n    - Multi:  `-F '($2 == \"PRJNA12377\" || $2 == \"PRJNA670754\") \u0026\u0026 $4 != \"Partial\"'`\n    - Regex:  `-F '$8 ~ /bacterium/'`\n    - Whole-file: `-F '$0 ~ \"plasmid\"'`\n\n### Taxonomy\n\n- `-A`: limits the number of assemblies for a specific taxonomy rank. [More infos](#Top-assemblies).\n  - `-A 3` to keep 3 assemblies for each taxonomic leaf.\n  - `-A 'genus:3'` 3 assemblies for each genus.\n- `-M`: taxonomy\n  - Options: `ncbi` (default), `gtdb`\n  - The `-M gtdb` option enables GTDB compatibility, keeping only assemblies from the [most recent GTDB release](https://data.gtdb.aau.ecogenomic.org/releases/latest/). The taxonomy filter uses the GTDB format (e.g. `-T 's__Escherichia coli'`).\n  \n## Update details\n\nWhen updating an existing local repository:\n\n - Newly added sequences will be downloaded, creating a new version (`-b`, timestamp by default).\n - Removed or old sequences will be retained, but not transferred to the new version.\n - Repeated/unchanged files are linked to the new version.\n - Arguments can be added to or changed in the update. For example, use the command `./genome_updater.sh -o \"arc_refseq_cg\" -t 2` to specify a different number of threads, or use the command `./genome_updater.sh -o \"arc_refseq_cg\" -l \"\"` to remove the `complete genome` filter.\n - The file `history.tsv` will be created in the output folder (`-o`), tracking the versions and arguments used. Please note that boolean flags/arguments are not tracked (e.g. `-m`).\n\n## Installation\n\n### conda/mamba\n\n```bash\nconda install -c bioconda genome_updater \n```\n\n### direct file download\n\n```bash\nwget https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh\nchmod +x genome_updater.sh\n```\n\n- genome_updater is portable and depends on the GNU Core Utilities + few additional tools (`awk` `bc` `find` `join` `md5sum` `parallel` `sed` `tar` `wget`/`curl`) which are commonly available and installed in most distributions. \n\n- If you are not sure if you have them all, just run `genome_updater.sh` and it will tell you if something is missing (otherwise the it will show the help page).\n\n### tests\n\nTo test if all genome_updater functions are running properly on your system:\n\n```bash\ngit clone --recurse-submodules https://github.com/pirovc/genome_updater.git\ncd genome_updater\ntests/test.sh\n```\n\n## Examples\n\n### Archaea, Bacteria, Fungi and Viral complete genome sequences (RefSeq)\n\n```bash\n# Download (-m to check integrity of downloaded files)\n./genome_updater.sh -d \"refseq\" -g \"archaea,bacteria,fungi,viral\" -f \"genomic.fna.gz\" -o \"arc_bac_fun_vir_refseq_cg\" -t 12 -m\n\n# Update (e.g. some days later)\n./genome_updater.sh -o \"arc_bac_fun_vir_refseq_cg\" -m\n```\n\n### All Riboviria RNA Viruses txid:2559587\n\n```bash\n# -t 12 for using 12 threads to download in parallel\n./genome_updater.sh -d \"refseq\" -T \"2559587\" -f \"genomic.fna.gz\" -o \"all_rna_virus\" -t 12 -m\n```\n\n### One genome assembly for each bacterial taxonomic leaf node\n\n```bash   \n./genome_updater.sh -d \"genbank\" -g \"bacteria\" -f \"genomic.fna.gz\" -o \"top1_bacteria_genbank\" -A 1 -t 12 -m \n```\n\n### One genome assembly for each bacterial species\n\n```bash   \n./genome_updater.sh -d \"genbank\" -g \"bacteria\" -f \"genomic.fna.gz\" -o \"top1species_bacteria_genbank\" -A \"species:1\" -t 12 -m \n```\n\n### All genomes for the latests GTDB release\n\n```bash \n./genome_updater.sh -d \"refseq,genbank\" -g \"archaea,bacteria\" -f \"genomic.fna.gz\" -o \"GTDB_complete\" -M \"gtdb\" -t 12 -m\n```\n\n### Two genome assemblies for every genus in GTDB\n\n```bash \n./genome_updater.sh -d \"refseq,genbank\" -g \"archaea,bacteria\" -f \"genomic.fna.gz\" -o \"GTDB_top2genus\" -M \"gtdb\" -A \"genus:2\" -t 12 -m\n```\n\n### All assemblies from a specific family in GTDB\n\n```bash \n./genome_updater.sh -d \"refseq,genbank\" -g \"archaea,bacteria\" -f \"genomic.fna.gz\" -o \"GTDB_family_Gastranaerophilaceae\" -M \"gtdb\" -T \"f__Gastranaerophilaceae\" -t 12 -m\n```\n\n### All assemblies from a specific family (excluding a genus) in GTDB\n\n```bash \n./genome_updater.sh -d \"refseq,genbank\" -g \"archaea,bacteria\" -f \"genomic.fna.gz\" -o \"GTDB_Mycobacteriacea_minus_Mycobacterium\" -M \"gtdb\" -T \"f__Mycobacteriacea,^g__Mycobacterium\" -t 12 -m\n```\n\n## Advanced examples\n\n### Download, change and update a repository\n\n```bash \n# Dry-run to check files available\n./genome_updater.sh -d \"refseq\" -g \"archaea,bacteria\" -l \"complete genome\" -f \"genomic.fna.gz\" -k\n\n# Download (-o output folder, -t threads, -m checking md5, -u extended assembly accession report)\n./genome_updater.sh -d \"refseq\" -g \"archaea,bacteria\" -l \"complete genome\" -f \"genomic.fna.gz\" -o \"arc_bac_refseq_cg\" -t 12 -u -m\n\n# Downloading additional .gbff files for the current snapshot (adding genomic.gbff.gz to -f , -i to just add files and not update)\n./genome_updater.sh -f \"genomic.fna.gz,genomic.gbff.gz\" -o \"arc_bac_refseq_cg\" -i\n\n# Some days later, just check for updates but do not update\n./genome_updater.sh -o \"arc_bac_refseq_cg\" -k\n\n# Perform update\n./genome_updater.sh -o \"arc_bac_refseq_cg\" -u -m\n```\n\n### Branch from base version with specific filters\n\n```bash \n# Download the complete bacterial refseq\n./genome_updater.sh -d \"refseq\" -g \"bacteria\" -f \"genomic.fna.gz\" -o \"bac_refseq\" -t 12 -m -b \"all\"\n\n# Branch the main files into two sub-versions (no new files will be downloaded or copied)\n./genome_updater.sh -o \"bac_refseq\" -B \"all\" -b \"complete\" -l \"complete genome\"\n./genome_updater.sh -o \"bac_refseq\" -B \"all\" -b \"reference\" -c \"reference genome\"\n```\n\n### Generate sequence reports and URLs\n\n```bash \n./genome_updater.sh -d \"refseq\" -g \"fungi\" -f \"assembly_report.txt\" -o \"fungi\" -t 12 -rpu\n```\n\n### Recovering genomic assemblies from an external assembly_summary.txt\n\n```bash \n./genome_updater.sh -e /my/path/assembly_summary.txt -f \"genomic.fna.gz\" -o \"recovered_sequences\"\n```\n\n### Use curl instead of wget, change timeout and retries for download, increase retries\n\n```bash \nretries=10 timeout=600 ./genome_updater.sh -g \"fungi\" -o fungi -t 12 -f \"genomic.fna.gz,assembly_report.txt\" -L curl -R 10\n```\n\n### Use a local taxdump file\n\n```bash \nnew_taxdump_file=\"my/local/new_taxdump.tar.gz\" ./genome_updater.sh -T 562 -o 562assemblies -t 12\n```\n\n- the [new_taxdump](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) is required.\n\n### Alternative download URL\n\n```bash\n# NCBI\nncbi_base_url=\"https://ftp.ncbi.nih.gov/\" ./genome_updater.sh -d refseq -g bacteria\n\n# GTDB\ngtdb_base_url=\"https://data.gtdb.ecogenomic.org/releases/latest/\" ./genome_updater.sh -d refseq,genbank -g bacteria,archaea\n```\n\n## Reports\n\n### assembly accessions\n\nThe `-u` parameter activates the output of a list of updated assembly accessions for entries where all files have been successfully downloaded. The file `{timestamp}_assembly_accession.txt` contains the following tab-separated fields:\n\n    Added [A] or Removed [R], assembly accession, url\n\nExample:\n\n    A\tGCF_000146045.2\tftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64\n    A\tGCF_000002515.2\tftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/515/GCF_000002515.2_ASM251v1\n    R\tGCF_000091025.4\tftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/091/025/GCF_000091025.4_ASM9102v4\n\n### sequence accessions\n\nThe `-r` parameter activates the output of a list of updated sequence accessions for entries for which all files have been successfully downloaded. This option is only available when the file type contains `assembly_report.txt` . The file `{timestamp}_sequence_accession.txt` contains the following tab-separated fields:\n\n    Added [A] or Removed [R], assembly accession, genbank accession, refseq accession, sequence length, taxonomic id\n\nExample:\n\n    A\tGCA_000243255.1\tCM001436.1\tNZ_CM001436.1\t3200946\t937775\n    R\tGCA_000275865.1\tCM001555.1\tNZ_CM001555.1\t2475100\t28892\n\n- Note: if genome_updater breaks or does not finish completely, some files may be missing from the assembly and sequence accession reports.\n\n### URLs (and files)\n\nThe `-p` parameter activates the output of a list of failed and successfully downloaded URLs to the files `{timestamp}_url_downloaded.txt` and `{timestamp}_url_failed.txt`. The failed list will only be complete if the command runs to completion without errors or interruptions.\n\nTo obtain a list of successfully downloaded files from this report, use the command below to get only new files after updating.\n\n```bash\nsed 's#.*/##' {timestamp}_url_list_downloaded.txt   \n#or\nfind output_folder/version/files/ -type f\n```\n\n## Top assemblies\n\nThe `-A`  option will select the 'best' assemblies for each taxonomic node (leaf or specific rank) according to four categories (A–D), in order of importance:\n\n    A) refseq Category: \n        1) reference genome\n        2) na\n    B) Assembly level:\n        3) Complete Genome\n        4) Chromosome\n        5) Scaffold\n        6) Contig\n    C) Relation to type material:\n        7) assembly from type material\n        8) assembly from synonym type material\n        9) assembly from pathotype material\n        10) assembly designated as neotype\n        11) assembly designated as reftype\n        12) ICTV species exemplar\n        13) ICTV additional isolate\n    D) Date:\n        14) Most recent first\n\n## `genome_updater -h`\n\n```\n\n┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐    ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐\n│ ┬├┤ ││││ ││││├┤     │ │├─┘ ││├─┤ │ ├┤ ├┬┘\n└─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴  ─┴┘┴ ┴ ┴ └─┘┴└─\n                                     v0.7.0 \n\nDatabase options:\n -d Database (comma-separated entries)\n        [genbank, refseq]\n\nOrganism options:\n -g Organism group(s) (comma-separated entries, empty for all)\n        [archaea, bacteria, fungi, human, invertebrate, metagenomes, \n        other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]\n        Default: \"\"\n -T Taxonomic identifier(s) with optional negation using the ^ prefix (comma-separated entries, empty for all).\n        Example: \"543,^562\" (for -M ncbi) or \"f__Enterobacteriaceae,^s__Escherichia coli\" (for -M gtdb)\n        Default: \"\"\n\nFile options:\n -f file type(s) (comma-separated entries)\n        [genomic.fna.gz, assembly_report.txt, protein.faa.gz, genomic.gbff.gz]\n        More formats at https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt\n        Default: assembly_report.txt\n\nFilter options:\n -c refseq category (comma-separated entries, empty for all)\n        [reference genome, na]\n        Default: \"\"\n -l assembly level (comma-separated entries, empty for all)\n        [Complete Genome, Chromosome, Scaffold, Contig]\n        Default: \"\"\n -D Start date (\u003e=), based on the sequence release date. Format YYYYMMDD.\n        Default: \"\"\n -E End date (\u003c=), based on the sequence release date. Format YYYYMMDD.\n        Default: \"\"\n -F Custom filter for the assembly summary. \n        Examples:\n          Single: -F '$14 == \"Full\"'\n          Multi:  -F '($2 == \"PRJNA12377\" || $2 == \"PRJNA670754\") \u0026\u0026 $4 != \"Partial\"'\n          Regex:  -F '$8 ~ /bacterium/'\n          Whole-file: -F '$0 ~ \"plasmid\"'\n        Uses awk syntax: $ for column index, || \"or\", \u0026\u0026 \"and\", ! \"not\", parentheses for nesting. Case sensitive.\n        Columns info at https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt\n        Default: \"\"\n\nTaxonomy options:\n -M Taxonomy. gtdb keeps only assemblies in the latest GTDB release. ncbi keeps only latest assemblies (version_status=latest). \n        [ncbi, gtdb]\n        Default: \"ncbi\"\n -A Keep a limited number of assemblies for each selected taxa (leaf nodes). 0 for all. \n        Selection by ranks are also supported with rank:number (e.g genus:3)\n        [species, genus, family, order, class, phylum, kingdom, superkingdom]\n        Selection order based on: RefSeq Category, Assembly level, Relation to type material, Date.\n        Default: 0\n -a Keep the current version of the taxonomy database in the output folder\n\nRun options:\n -o Output/Working directory \n        Default: ./tmp.XXXXXXXXXX\n -t Threads to parallelize download and some file operations\n        Default: 1\n -k Dry-run mode. No sequence data is downloaded or updated - just checks for available sequences and changes\n -i Fix only mode. Re-downloads incomplete or failed data from a previous run. Can also be used to change files (-f).\n -m Check MD5 of downloaded files\n\nReport options:\n -u Updated assembly accessions report\n        (Added/Removed, assembly accession, url)\n -r Updated sequence accessions report\n        (Added/Removed, assembly accession, genbank accession, refseq accession, sequence length, taxid)\n        Only available when file format assembly_report.txt is selected and successfully downloaded\n -p Reports URLs successfuly downloaded and failed (url_failed.txt url_downloaded.txt)\n\nMisc. options:\n -b Version label\n        Default: current timestamp (YYYY-MM-DD_HH-MM-SS)\n -e External \"assembly_summary.txt\" file to recover data from. Mutually exclusive with -d / -g \n        Default: \"\"\n -B Alternative version label to use as the current version. Mutually exclusive with -i.\n        Can be used to rollback to an older version or to create multiple branches from a base version.\n        Default: \"\"\n -R Number of attempts to retry to download files in batches \n        Default: 5\n -n Conditional exit status based on number of failures accepted, otherwise will Exit Code = 1.\n        Example: -n 10 will exit code 1 if 10 or more files failed to download\n        [integer for file number, float for percentage, 0 = off]\n        Default: 0\n -N Output files in folders like NCBI ftp structure (e.g. files/GCF/000/499/605/GCF_000499605.1_EMW001_assembly_report.txt)\n -L Downloader\n        [wget, curl]\n        Default: wget\n -x Allow the deletion of regular extra files (not symbolic links) found in the output folder\n -s Silent output\n -w Silent output with download progress only\n -V Verbose log\n -Z Print debug information and run in debug mode\n\n```\n\n## References:\n\n[1] https://ftp.ncbi.nlm.nih.gov/genomes/\n\n[2] https://gtdb.ecogenomic.org/\n\n[3] O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpirovc%2Fgenome_updater","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpirovc%2Fgenome_updater","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpirovc%2Fgenome_updater/lists"}