{"id":23787778,"url":"https://github.com/schultzm/entrez_direct_tut","last_synced_at":"2026-04-15T00:30:18.450Z","repository":{"id":201480951,"uuid":"175515248","full_name":"schultzm/entrez_direct_tut","owner":"schultzm","description":"Tutorial on using E-utilities","archived":false,"fork":false,"pushed_at":"2024-11-25T00:54:56.000Z","size":61,"stargazers_count":20,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-01T15:17:36.771Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/schultzm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-03-13T23:39:20.000Z","updated_at":"2024-12-30T23:07:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"3a6ce53f-8df7-4a23-83c0-84d0c49d4fed","html_url":"https://github.com/schultzm/entrez_direct_tut","commit_stats":null,"previous_names":["schultzm/entrez_direct_tut"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schultzm%2Fentrez_direct_tut","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schultzm%2Fentrez_direct_tut/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schultzm%2Fentrez_direct_tut/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schultzm%2Fentrez_direct_tut/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/schultzm","download_url":"https://codeload.github.com/schultzm/entrez_direct_tut/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240004722,"owners_count":19732631,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-01T15:17:40.391Z","updated_at":"2026-04-15T00:30:18.327Z","avatar_url":"https://github.com/schultzm.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Extended tutorial from NCBI\n\nFor an in-depth tutorial on the edirect software, refer to [https://www.ncbi.nlm.nih.gov/books/NBK179288/](https://www.ncbi.nlm.nih.gov/books/NBK179288/).\n\nFor an in-depth tutorial on the xtract software, refer to [https://dataguide.nlm.nih.gov/edirect/xtract.html](https://dataguide.nlm.nih.gov/edirect/xtract.html)\n\n# Custom entrez_direct tutorial\n\nThe tutorial below aims to give a basic overview of the Entrez Direct (edirect) Representational State Transfer (REST) Application Programming Interface (API).  This system, known as `edirect`, is used to access the National Center for Biotechnology Information (NCBI) Entrez database.  Entrez is a web-accessible molecular biology database that provides integrated access to nucleotide and protein sequence data, gene-centered and genomic mapping information, 3D structure data, PubMed MEDLINE, and more.\n[https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html)\n\n\n\n## Install software\n### edirect\n\nFirst up, we need to install the edirect suite of tools.  The software is written in the [Perl](https://www.perl.org/) programming language.  [Instructions for installation](https://www.ncbi.nlm.nih.gov/books/NBK179288/) are copied below.  Paste these commands into a terminal window and hit enter.\n\n\n```\nsh -c \"$(wget -q ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)\"\n```\n\nEither add the path of user-specified install dir, or `mv` the folder to e.g., `/usr/local/bin`\n\nSuccesful results of any edirect query are returned to stdout in human readable text as [xml](https://www.sitepoint.com/really-good-introduction-xml/), [json](https://en.wikipedia.org/wiki/JSON) and [asn.1](https://www.ncbi.nlm.nih.gov/Structure/asn1.html) formats.  Errors are returned to standard error (stderr).  We can use the tool `xtract` to parse xml output.\n\n### xtract\n\n`xtract` is installed as part of edirect suite.\n\n## edirect Functions\n\nThe edirect functions allow you to query the NCBI entrez database from the command line.  This approach is powerful in that customised queries can be automated and reproduced by writing them into scripts.\n\nThe following navigation functions support exploration within the Entrez databases:\n\n`esearch` performs a new Entrez search using terms in indexed fields.\n\n`elink` looks up neighbors (within a database) or links (between databases).\n\n`efilter` filters or restricts the results of a previous query.\n\nRecords can be retrieved in specified formats or as document summaries:\n\n`efetch` downloads records or reports in a designated format.\n\nDesired fields from XML results can be extracted without writing a program:\n\n`xtract` converts EDirect XML output into a table of data values.\n\nSeveral additional functions are also provided:\n\n`einfo` obtains information on indexed fields in an Entrez database.\n\n`epost` uploads unique identifiers (UIDs) or sequence accession numbers.\n\n`nquire` sends a URL request to a web page or CGI service.\n\nstdout of the functions can be piped to standard in (stdin) of another function, allowing creativty on the part of the end user.  To get help on any function do `functionname -help`, example `efetch -help`.\n\n\n## Example 1: Get RefSeq assemblies from a BioProject\n\n### Check existence of BioProject and return a DocumentSummary of this record\nIn this example we will examine bioproject PRJNA429695.  We will build up the command to connect the stdout of `esearch`, to the stdin of `efetch`, from the which the stdout goes to stdin of `elink`, from which the returned stdout is passed to `xtract` to grab desired fields.  Some knowledge of `bash` will help out here.\n\nFirst, check that the target bioproject exists by searching for the PRJ accession using `esearch`:\n\n```\nesearch -db bioproject -query PRJNA429695\n```\n\nThe stdout shows a count of `1`, indicating a single hit, after going in `1` step:\n```\n\u003cENTREZ_DIRECT\u003e\n  \u003cDb\u003ebioproject\u003c/Db\u003e\n  \u003cWebEnv\u003eNCID_1_34836435_130.14.22.76_9001_1552959566_1290483952_0MetA0_S_MegaStore\u003c/WebEnv\u003e\n  \u003cQueryKey\u003e1\u003c/QueryKey\u003e\n  \u003cCount\u003e1\u003c/Count\u003e\n  \u003cStep\u003e1\u003c/Step\u003e\n\u003c/ENTREZ_DIRECT\u003e\n```\n\nTo fetch the document summary from this source, pipe the stdout of `esearch` to stdin of `efetch` using the pipe character\n```\nesearch -db bioproject -query PRJNA429695 | efetch -format docsum\n```\n\nThe result should be\n\n```\n\u003c?xml version=\"1.0\" encoding=\"UTF-8\" ?\u003e\n\u003c!DOCTYPE DocumentSummarySet PUBLIC \"-//NLM//DTD esummary bioproject 20140903//EN\" \"https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20140903/esummary_bioproject.dtd\"\u003e\n\n\u003cDocumentSummarySet status=\"OK\"\u003e\n\u003cDbBuild\u003eBuild190318-0700.1\u003c/DbBuild\u003e\n\n\u003cDocumentSummary\u003e\u003cId\u003e429695\u003c/Id\u003e\n\t\u003cTaxId\u003e0\u003c/TaxId\u003e\n\t\u003cProject_Id\u003e429695\u003c/Project_Id\u003e\n\t\u003cProject_Acc\u003ePRJNA429695\u003c/Project_Acc\u003e\n\t\u003cProject_Type\u003ePrimary submission\u003c/Project_Type\u003e\n\t\u003cProject_Data_Type\u003eGenome sequencing\u003c/Project_Data_Type\u003e\n\t\u003cSort_By_ProjectType\u003e78095\u003c/Sort_By_ProjectType\u003e\n\t\u003cSort_By_DataType\u003e99665\u003c/Sort_By_DataType\u003e\n\t\u003cSort_By_Organism\u003e311278\u003c/Sort_By_Organism\u003e\n\t\u003cProject_Subtype\u003e\u003c/Project_Subtype\u003e\n\t\u003cProject_Target_Scope\u003eMultispecies\u003c/Project_Target_Scope\u003e\n\t\u003cProject_Target_Material\u003eGenome\u003c/Project_Target_Material\u003e\n\t\u003cProject_Target_Capture\u003eWhole\u003c/Project_Target_Capture\u003e\n\t\u003cProject_MethodType\u003eSequencing\u003c/Project_MethodType\u003e\n\t\u003cProject_Method\u003e\u003c/Project_Method\u003e\n\t\u003cProject_Objectives_List\u003e\n\t\t\u003cProject_Objectives_Struct\u003e\n\t\t\t\u003cProject_ObjectivesType\u003eSequence\u003c/Project_ObjectivesType\u003e\n\t\t\t\u003cProject_Objectives\u003e\u003c/Project_Objectives\u003e\n\t\t\u003c/Project_Objectives_Struct\u003e\n\t\u003c/Project_Objectives_List\u003e\n\t\u003cRegistration_Date\u003e2018/01/12 00:00\u003c/Registration_Date\u003e\n\t\u003cProject_Name\u003e\u003c/Project_Name\u003e\n\t\u003cProject_Title\u003eStenotrophomonas maltophilia complex genospecies 1 and genospecies 2\u003c/Project_Title\u003e\n\t\u003cProject_Description\u003eComplete genome sequences of ten Mexican environmental isolates of the Stenotrophomonas maltophilia complex classified as genospecies 1 and genospecies 2 by multilocus sequence analysis.\u003c/Project_Description\u003e\n\t\u003cKeyword\u003e\u003c/Keyword\u003e\n\t\u003cRelevance_Agricultural\u003e\u003c/Relevance_Agricultural\u003e\n\t\u003cRelevance_Medical\u003e\u003c/Relevance_Medical\u003e\n\t\u003cRelevance_Industrial\u003e\u003c/Relevance_Industrial\u003e\n\t\u003cRelevance_Environmental\u003eyes\u003c/Relevance_Environmental\u003e\n\t\u003cRelevance_Evolution\u003e\u003c/Relevance_Evolution\u003e\n\t\u003cRelevance_Model\u003e\u003c/Relevance_Model\u003e\n\t\u003cRelevance_Other\u003e\u003c/Relevance_Other\u003e\n\t\u003cOrganism_Name\u003e\u003c/Organism_Name\u003e\n\t\u003cOrganism_Strain\u003e\u003c/Organism_Strain\u003e\n\t\u003cOrganism_Label\u003e\u003c/Organism_Label\u003e\n\t\u003cSequencing_Status\u003eChromosome(s)\u003c/Sequencing_Status\u003e\n\t\u003cSubmitter_Organization\u003eCentro de Ciencias Genomicas - UNAM\u003c/Submitter_Organization\u003e\n\t\u003cSubmitter_Organization_List\u003e\n\t\t\u003cstring\u003eCentro de Ciencias Genomicas - UNAM\u003c/string\u003e\n\t\u003c/Submitter_Organization_List\u003e\n\t\u003cSupergroup\u003e\u003c/Supergroup\u003e\n\u003c/DocumentSummary\u003e\n\n\u003c/DocumentSummarySet\u003e\n```\n\nOptionally format the result in `json` (to be parsed using json parsers instead of xml parsers as described in this tutorial).\n\n```\nesearch -db bioproject -query PRJNA42969 | efetch -format docsum -mode json\n```\n\n### Find accession numbers of BioSamples in the Bioproject\n\nNow that we know the BioProject accession is valid, let's get the BioSample accession numbers from the BioProject to work with in our downstream searches. To this end, we will link the results of the `esearch` on BioProject to the BioSample database using `elink`.\n\n```\nesearch -db bioproject -query PRJNA429695 | elink -target biosample\n```\n\nThe result shows a count of `10` hits accessed by going in `2` steps.\n\n```\n\u003cENTREZ_DIRECT\u003e\n  \u003cDb\u003ebiosample\u003c/Db\u003e\n  \u003cWebEnv\u003eNCID_1_34881429_130.14.18.97_9001_1552960223_1142337136_0MetA0_S_MegaStore\u003c/WebEnv\u003e\n  \u003cQueryKey\u003e3\u003c/QueryKey\u003e\n  \u003cCount\u003e10\u003c/Count\u003e\n  \u003cStep\u003e2\u003c/Step\u003e\n\u003c/ENTREZ_DIRECT\u003e\n```\n\nWe will now parse the document summaries of the above 10 hits to get the accessions using `xtract`.  Note, substitute the `xtract.Linux` command to whatever is required to get `xtract` to run on your system.\n```\nesearch -db bioproject -query PRJNA429695 | elink -target biosample | efetch -format docsum | xtract.Linux -pattern DocumentSummary -block Accession -element Accession\n```\n\nThe result is\n```\nSAMN08357826\nSAMN08357825\nSAMN08357824\nSAMN08357823\nSAMN08357822\nSAMN08357821\nSAMN08357820\nSAMN08357819\nSAMN08357818\nSAMN08357817\n```\n\nGreat, we now have the BioSample accessions.  Our next problem is getting and keeping a record of which RefSeq assembly accessions and strains align with these BioSamples.  Let's store the BioSample accessions from the search in the variable BISOAMPLES using a bash subshell (`$(dostuff)`).\n\n`BIOSAMPLES=$(esearch -db bioproject -query PRJNA429695 | elink -target biosample | efetch -format docsum | xtract.Linux -pattern DocumentSummary -block Accession -element Accession | xargs)`\n\nLook at the variable with `echo ${BIOSAMPLES}`.\n\nNow iterate through the variable BIOSAMPLES, query entrez for each BIOSAMPLE using the edirect functions.  At each iteration, capture the DocumentSummary in the variable DOCSUM (so that we can just extract from this variable rather than having to use the slower method of re-querying entrez).  Using `xtract` we will parse DOCSUM and extract `STRAIN` and `ASSEMBLY` info, storing the metadata in the file `MDATA`.\n\n```\nMDATA=\"mdata.tab\"\necho -e \"BioSample\\tStrain\\tAssembly\" \u003e\u003e ${MDATA} #Put a header in the file\nfor BIOSAMPLE in ${BIOSAMPLES[@]}\ndo\n    DOCSUM=$(esearch -db assembly -query ${BIOSAMPLE} | efetch -format docsum)\n    STRAIN=$(echo ${DOCSUM} | xtract.Linux -pattern DocumentSummary -block Infraspecie -element Sub_value)\n    ASSEMBLY=$(echo ${DOCSUM} | xtract.Linux -pattern DocumentSummary -block Synonym -element RefSeq)\n    echo -e ${BIOSAMPLE}'\\t'${STRAIN}'\\t'${ASSEMBLY} \u003e\u003e ${MDATA}\ndone\n```\n\nFinally, look in and iterate through the lines in the MDATA file, `efetch` a genbank `ASSEMBLY` for each iteration and save each result in `[ASSEMBLY].gbk`.  This operation will be split up into three parallel operations using [GNU Parallel](https://www.gnu.org/software/parallel/).  \n\n```\ncat ${MDATA} | while read BS ST AS\ndo\n    echo \"esearch -db nucleotide -query ${AS} | efetch -format gbwithparts \u003e ${AS}.gbk\"\ndone | parallel -j 3 --bar {}\n```\n\nPutting it all together, we would run the following block of commands:\n\n```\nBIOSAMPLES=$(esearch -db bioproject -query PRJNA429695 | elink -target biosample | efetch -format docsum | xtract.Linux -pattern DocumentSummary -block Accession -element Accession | xargs)\nMDATA=\"mdata.tab\"\necho -e \"BioSample\\tStrain\\tAssembly\" \u003e\u003e ${MDATA}\nfor BIOSAMPLE in ${BIOSAMPLES[@]}\ndo\n    DOCSUM=$(esearch -db assembly -query ${BIOSAMPLE} | efetch -format docsum)\n    STRAIN=$(echo ${DOCSUM} | xtract.Linux -pattern DocumentSummary -block Infraspecie -element Sub_value)\n    ASSEMBLY=$(echo ${DOCSUM} | xtract.Linux -pattern DocumentSummary -block Synonym -element RefSeq)\n    echo -e ${BIOSAMPLE}'\\t'${STRAIN}'\\t'${ASSEMBLY} \u003e\u003e ${MDATA}\ndone\n\ncat ${MDATA} | while read BS ST AS\ndo\n    echo \"esearch -db nucleotide -query ${AS} | efetch -format gbwithparts \u003e ${AS}.gbk\"\ndone | parallel -j 3 --bar {}\n```\n\n## Example 2: Given a set of ENA sample accessions, get RefSeq assemblies from NCBI\n\nIn this example, we only have a list of ERS accessions from the ENA.  But we want the GenBank formatted RefSeq assemblies from NCBI.  Here is one way to achieve this goal:\n\n```\n#ERS from the query table\narr=$(echo \"ERS381042\nERS380926\nERS381069\nERS380950\nERS381034\nERS381155\nERS381123\nERS381163\nERS381278\nERS380992\nERS381092\nERS380985\nERS381012\nERS381151\nERS380916\nERS381212\nERS381145\nERS380996\nERS380936\nERS380970\nERS381180\nERS381255\nERS381142\nERS380986\nERS380973\nERS381264\nERS381025\nERS381003\nERS380984\nERS380962\nERS380975\nERS381005\nERS380951\nERS381013\nERS381153\nERS381156\nERS381096\nERS380915\nERS380935\nERS381081\nERS381082\" | xargs);\n#BioProject number\nPRJ=\"PRJEB5065\"\n\nfor ERS in ${arr[@]}\ndo BIOSAMPLE=$(esearch -db biosample -query ${ERS} | efetch -format docsum | xtract.Linux -pattern DocumentSummary -block Accession -element Accession)\n#Assembly for BioSample\nASSEMBLY=$(esearch -db assembly -query ${BIOSAMPLE} | efetch -format docsum | xtract.Linux -pattern DocumentSummary -element AssemblyAccession)\necho -e ${ERS}'\\t'${BIOSAMPLE}'\\t'${ASSEMBLY}\ndone \u003e ${PRJ}.mdata.tab\n\ncat ${PRJ}.mdata.tab | while read ERSAMPLE BIOSAMPLE ASSEMBLY;\ndo echo \"esearch -db nucleotide -query ${ASSEMBLY} | efetch -format gbwithparts \u003e ${ERSAMPLE}.gbk\" #swap gbwithparts for fasta if you prefer a fasta file\ndone | parallel -j 3 --bar {}\n```\n\n\n## Example 3: Given an NCBI BioProject accession, get the read sets from NCBI SRA\n\n### Understanding SRA\n\n[help!](https://www.ncbi.nlm.nih.gov/books/NBK56913/)\n\n### The example  \n\nDownload read sets using `fastq-dump` (but consider using [ascp](https://www.ncbi.nlm.nih.gov/books/NBK158899/) since fastq-dump is slow)\n```\nMDATA=\"accessions_demo.txt\"\nesearch -db bioproject -query PRJNA383436 | elink -target biosample | efetch -format docsum | xtract.Linux -pattern DocumentSummary -block Ids -element Id -group SRA \u003e ${MDATA}\nSRSs=$(cat ${MDATA} | while read LINE\ndo\n  NCOL=$(echo ${LINE} | wc -w)\n  ACC=$(echo ${LINE} | cut -d ' ' -f ${NCOL})\n  echo ${ACC}\ndone | xargs)\n\nfor SRS in ${SRSs[@]}\ndo\n  SRR=$(esearch -db SRA -query ${SRS} | efetch -format runinfo -mode xml | xtract.Linux -pattern Run -element Run)\n  echo \"fastq-dump --split-3 --gzip ${SRR}\"\ndone \u003e fastqdump.txt\nparallel -j 3 --bar {} :::: fastqdump.txt\n```\n\n## Example 3.1:  Given a BioProject accession, download certain parts of the read metadata\n\n```{bash}\nesearch -db bioproject -query PRJNA613958 | elink -db sra -target sra | efetch -format docsum| xtract.Linux -pattern DocumentSummary -element 'Bioproject,Biosample,Submitter@acc,Study@acc,Sample@acc,Run@acc,Experiment@acc,Platform@instrument_model'\n```\n\n## Example 3.2: Given a BioProject accession, download Biosample attributes\n\n```{bash}\nesearch -db bioproject -query PRJNA613958 | elink -target biosample | efetch -format docsum | xtract -pattern SampleData -element Attribute\n```\n\n## Example 4: Given some accessions from a browse of patricbrc.org, get the assemblies in genbank format\n### What is patric?\n\nSee [here](patricbrc.org)\n\n### The example  \n\nGrab the accessions from a downloaded metadata table for _E. coli_ ST405, download only the chromosomes (i.e., only the first accession for each row).  Do this using `gnu parallel`:\n\n```\nparallel --bar -j 3 \"esearch -db nucleotide -query {} | efetch -format gbwithparts \u003e ref_genomes/{}.gbk\" ::: $(echo \"CP021202,CP021203,CP021204,CP021205,CP021206\nCP023960,CP023959,CP023961,CP023957,CP023958\nCP027134,CP027130,CP027132,CP027131,CP027133,CP027129\nCP029579,CP029580,CP029581\nCP032261,CP032258,CP032259,CP032260,CP032262\" | cut -d ',' -f 1)\n```\n\n## Example 5: Given a species name, find all available SRR data\n\n`esearch -db sra -query 'Bifidobacterium longum' | efetch -format docsum | grep SRR | cut -d '\"' -f 2`\n\nOutput should look something like this (truncated):  \n\n```\nSRR8949028\nSRR8834650\nSRR8832718\nSRR4380096\nSRR4380095\nSRR4380094\n...\n```\n\n## Example 6: Given an SRA accession, get the download links for the reads in their submitted format (if submitted originally to NCBI)\n\n`efetch -db sra -id SRR14311695 | xtract -pattern Alternatives -block Alternatives -if Alternatives@url -ends-with '.gz' -element Alternatives@url`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschultzm%2Fentrez_direct_tut","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fschultzm%2Fentrez_direct_tut","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschultzm%2Fentrez_direct_tut/lists"}