{"id":46632387,"url":"https://github.com/glarue/jgi-query","last_synced_at":"2026-03-08T00:11:56.154Z","repository":{"id":35166193,"uuid":"39414544","full_name":"glarue/jgi-query","owner":"glarue","description":"A simple command-line tool to download data from Joint Genome Institute databases","archived":false,"fork":false,"pushed_at":"2023-01-19T19:18:36.000Z","size":225,"stargazers_count":43,"open_issues_count":0,"forks_count":16,"subscribers_count":4,"default_branch":"main","last_synced_at":"2023-10-26T08:51:33.466Z","etag":null,"topics":["bioinformatics","cli","genomes","genomics","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/glarue.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-21T00:08:50.000Z","updated_at":"2023-10-22T09:26:34.000Z","dependencies_parsed_at":"2023-02-11T18:00:44.365Z","dependency_job_id":null,"html_url":"https://github.com/glarue/jgi-query","commit_stats":null,"previous_names":[],"tags_count":6,"template":null,"template_full_name":null,"purl":"pkg:github/glarue/jgi-query","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glarue%2Fjgi-query","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glarue%2Fjgi-query/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glarue%2Fjgi-query/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glarue%2Fjgi-query/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/glarue","download_url":"https://codeload.github.com/glarue/jgi-query/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/glarue%2Fjgi-query/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30238301,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T23:52:25.683Z","status":"ssl_error","status_checked_at":"2026-03-07T23:52:25.373Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","cli","genomes","genomics","python"],"created_at":"2026-03-08T00:11:55.511Z","updated_at":"2026-03-08T00:11:56.142Z","avatar_url":"https://github.com/glarue.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# jgi-query\n\nA command-line tool for querying and downloading from databases hosted by the [Joint Genome Institute (JGI)](https://jgi.doe.gov/). Useful for accessing JGI data from command-line-only resources such as remote servers, or as a lightweight alternative to JGI's other [GUI-based download tools](https://genome.jgi.doe.gov/portal/help/download.jsf).\n\n### Dependencies\n\n- A [user account with JGI](https://contacts.jgi.doe.gov/registration/new) (free)\n- [cURL](http://curl.haxx.se/), required by the JGI download API\n- [Python](https://www.python.org/downloads/) 3.x (current development) or 2.7.x (deprecated but provided -- now *significantly outdated*)\n\n### Installation\n\n1. Download `jgi-query.py`\n2. Ensure that you're running the correct version of Python with `python --version`. If this reports Python 2.x, run the script using `python3` instead of `python`\n3. From the command line, run the script with the command `python jgi-query.py` to show usage information and further instructions\n\n#### Usage information\n\n```\nusage: jgi-query.py [-h] [-x [XML]] [-c] [-s] [-f] [-u] [-n RETRY_N]\n                    [-l logfile] [-r REGEX] [-a]\n                    [organism_abbreviation]\n\nThis script will list and retrieve files from JGI using the curl API. It will\nreturn a list of all files available for download for a given query organism.\n\npositional arguments:\n  organism_abbreviation\n                        organism name formatted per JGI's abbreviation. For\n                        example, 'Nematostella vectensis' is abbreviated by\n                        JGI as 'Nemve1'. The appropriate abbreviation may be\n                        found by searching for the organism on JGI; the name\n                        used in the URL of the 'Info' page for that organism\n                        is the correct abbreviation. The full URL may also be\n                        used for this argument (default: None)\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -x [XML], --xml [XML]\n                        specify a local xml file for the query instead of\n                        retrieving a new copy from JGI (default: None)\n  -c, --configure       initiate configuration dialog to overwrite existing\n                        user/password configuration (default: False)\n  -s, --syntax_help\n  -f, --filter_files    filter organism results by config categories instead\n                        of reporting all files listed by JGI for the query\n                        (work in progress) (default: False)\n  -u, --usage           print verbose usage information and exit (default:\n                        False)\n  -n RETRY_N, --retry_n RETRY_N\n                        number of times to retry downloading files with errors\n                        (0 to skip such files) (default: 4)\n  -l logfile, --load_failed logfile\n                        retry downloading from URLs listed in log file\n                        (default: None)\n  -r REGEX, --regex REGEX\n                        Regex pattern to use to auto-select and download files\n                        (no interactive prompt) (default: None)\n  -a, --all             Auto-select and download all files for query (no\n                        interactive prompt) (default: False)\n```\n\n### Author's note\n\nThis is a somewhat better-commented (emphasis on \"somewhat\") version of a script I wrote for grabbing various datasets using a headless Linux server. For a lot of my lab's bioinformatics work, we don't store/manipulate data on our local computers, and I was not able to find a good tool that allowed for convenient queries of the JGI database without additional software.\n\nJGI also no longer allows simple downloading of many of their datasets (via `wget`, for example), which is another reason behind the creation of this script.\n\nI highly encourage anyone with more advanced Python skills (read: almost everyone) to fork and submit pull requests.\n\n### General overview\n\nJGI uses a [cURL-based API](https://docs.google.com/document/d/1UXovE52y1ab8dZVa-LYNJtgUVgK55nHSQR3HQEJJ5-A/view) to provide information/download links to files in their database.\n\nIn brief, `jgi-query` begins by using cURL to grab an XML file for the query text. The XML file describes all of the available files and their parent categories. For example, the file for *Aureobasidium subglaciale* (JGI abbreviation \"Aurpu_var_sub1\") begins:\n\n![Aurpu_var_sub1_xml_example](http://i.imgur.com/4nImnxx.png)\n\n`jgi-query` will parse the XML file to find entries with a `filename` attribute and, depending on command-line arguments, a parent category from the list of categories in `jgi-query.config`. It then displays the available files with minimal metadata, and prompts the user to enter their selection.\n\n### File selection\n\nMain file categories in the report are numbered, as are files within each category. The selection syntax is `category_number`:`file_selection`, where `file_selection` is either a comma-separated list (e.g. `file1`, `file2`) or a contiguous range (e.g. `file1`-`file4`). For multiple parent categories and associated files, category/file list groupings are linked with semicolons (e.g. `category1`:`file1`,`file2`;`category2`:`file5`-`file8`).\n\n### Bulk file downloading\n\nAdditionally, there is a regex-based file selection option (enter \"r\" at the file selection prompt) which may be useful for selecting a large number of related files (see the [Python regex documentation](https://docs.python.org/3/library/re.html#re-syntax) for syntax information). For example, to retrieve all files with \"AllModels\" in their names, the regex to enter at the regex prompt would be `.*AllModels.*`.\n\n### Use in a larger pipeline\n\nFor programmatic use, `jgi-query` also has command-line arguments, `-a` and `-r`, that allow retrieval of either complete or regex-filtered datasets, respectively, while bypassing interactive prompts. For example, to retrieve all gzipped GFF3 files with \"FilteredModels1\" for _Schizophyllum commune_:\n\n`python3 jgi-query.py Schco3 -r 'FilteredModels1.*\\.gff3\\.gz$'`\n\n### Sample output for _Nematostella vectensis_ ('Nemve1')\n\n```\n➜ python3 jgi-query.py Nemve1                                  \nRetrieving information from JGI for query 'Nemve1' using command 'curl 'https://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=Nemve1' -L -b cookies \u003e Nemve1_jgi_index.xml'\n\n  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n100   379  100   379    0     0   1857      0 --:--:-- --:--:-- --:--:--  1857\n100  4350    0  4350    0     0   3958      0 --:--:--  0:00:01 --:--:-- 4248k\n\n\nQUERY RESULTS FOR 'Nemve1'\n\n======================= 1: All models, Filtered and Not ========================\nGenes:\n 1:[1] Nemve1.AllModels.gff.gz-----------------------------------[20 MB|03/2012]\nProteins:\n 1:[2] proteins.Nemve1AllModels.fasta.gz-------------------------[29 MB|03/2012]\nTranscripts:\n 1:[3] transcripts.Nemve1AllModels.fasta.gz----------------------[55 MB|03/2012]\n\n=================================== 2: Files ===================================\nAdditional Files:\n 2:[1] N.vectensis_ABAV.modified.scflds.p2g.gz------------------[261 KB|03/2012]\n 2:[2] Nemve1.FilteredModels1.txt.gz------------------------------[2 MB|03/2012]\n 2:[3] Nemve1.fasta.gz-------------------------------------------[81 MB|10/2005]\n 2:[4] Nemve_JGIest.fasta.gz-------------------------------------[30 MB|03/2012]\n 2:[5] Nemve_JGIestCL.fasta.gz------------------------------------[8 MB|03/2012]\n 2:[6] NvTRjug.fasta.gz-------------------------------------------[4 KB|03/2012]\n\n========================= 3: Filtered Models (\"best\") ==========================\nGenes:\n 3:[1] Nemve1.FilteredModels1.gff.gz------------------------------[3 MB|03/2012]\n 3:[2] Nvectensis_19_PAC2_0.GFF3.gz-------------------------------[2 MB|03/2012]\nProteins:\n 3:[3] proteins.Nemve1FilteredModels1.fasta.gz--------------------[5 MB|03/2012]\nTranscripts:\n 3:[4] transcripts.Nemve1FilteredModels1.fasta.gz-----------------[8 MB|03/2012]\n\nEnter file selection ('q' to quit, 'usage' to review syntax, 'a' for all, 'r' for regex-based filename matching):\n\u003e 2:3;3:1\nTotal download size for 2 files: 84.02 MB\nContinue? (y/n/[p]review files): y\nDownloading 'Nemve1.FilteredModels1.gff.gz' using command:\ncurl -m 120 'https://genome.jgi.doe.gov/portal/Nemve1/download/Nemve1.FilteredModels1.gff.gz' -b cookies \u003e Nemve1.FilteredModels1.gff.gz\n  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n100 3078k  100 3078k    0     0  4918k      0 --:--:-- --:--:-- --:--:-- 4918k\nDownloading 'Nemve1.fasta.gz' using command:\ncurl -m 120 'https://genome.jgi.doe.gov/portal/Nemve1/download/Nemve1.fasta.gz' -b cookies \u003e Nemve1.fasta.gz\n  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n100 81.0M  100 81.0M    0     0  5320k      0  0:00:15  0:00:15 --:--:-- 2881k\nFinished downloading 2 files.\nDecompress all downloaded files? (y/n/k=decompress and keep original): y\nFinished decompressing all files.\nKeep temporary files ('Nemve1_jgi_index.xml' and 'cookies')? (y/n): n\nRemoving temp files and exiting\n\n~ took 1m 17s \n➜ \n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglarue%2Fjgi-query","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fglarue%2Fjgi-query","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglarue%2Fjgi-query/lists"}