{"id":20425051,"url":"https://github.com/databio/igd","last_synced_at":"2025-04-12T18:54:30.416Z","repository":{"id":54453931,"uuid":"132895807","full_name":"databio/IGD","owner":"databio","description":"A high-performance search engine for large-scale genomic interval datasets","archived":false,"fork":false,"pushed_at":"2021-09-01T13:40:50.000Z","size":928,"stargazers_count":17,"open_issues_count":7,"forks_count":4,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-26T13:11:22.224Z","etag":null,"topics":["genomic-intervals"],"latest_commit_sha":null,"homepage":"http://dx.doi.org/10.1093/bioinformatics/btaa1062 ","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-05-10T12:15:34.000Z","updated_at":"2023-05-15T15:37:22.000Z","dependencies_parsed_at":"2022-08-13T16:10:33.997Z","dependency_job_id":null,"html_url":"https://github.com/databio/IGD","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FIGD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FIGD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FIGD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FIGD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databio","download_url":"https://codeload.github.com/databio/IGD/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248618262,"owners_count":21134200,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genomic-intervals"],"created_at":"2024-11-15T07:12:08.062Z","updated_at":"2025-04-12T18:54:30.393Z","avatar_url":"https://github.com/databio.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IGD: A high-performance search engine for large-scale genomic interval datasets\n\n## Summary\n\nDatabases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. \n\n## Citation\n\nIf you use IGD in your research, please cite:\n\n\nJianglin Feng, Nathan C Sheffield. IGD: high-performance search for large-scale genomic interval datasets. *Bioinformatics*, Volume 37, Issue 1, 1 January 2021, Pages 118–120, https://doi.org/10.1093/bioinformatics/btaa1062\n\nPreprint: https://www.biorxiv.org/content/10.1101/2020.06.08.139758v1\n\n\n## How to build iGD\n\nIf zlib is not already installed, install it:\n```\nsudo apt-get install libpng12-0\n```\nThen:\n```\ngit clone https://github.com/databio/iGD.git\ncd iGD\nmake\n```\nthe executable `igd` is in the subfolder `bin`. And then copy it to /usr/local/bin.\n\n## How to run iGD\n\n### 1. Create iGD database\n \n#### 1.1 Create iGD database from a genome data source folder\n```\nigd create \"/path/to/data_source_folder/*\" \"/path/to/igd_folder/\" \"databaseName\" [option]\n\nwhere:\n\n- \"path/to/data_source_folder/\" is the path of the folder that contains .bed.gz or .bed data files.\n\n- \"path/to/igd_folder/\" is the path to the output igd folder;\n\n- \"databaseName\" is the name you give to the database, for eaxmple, \"roadmap\"\n\noption:\n\n-b: bin-size (power of 2; default 14, which is 16384 bp)\n```\n#### 1.2 Create iGD database from a list of source files\n \n```\nigd create \"/path/to/source-list file\" \"/path/to/igd_folder/\" \"databaseName\" -f [option]\n\nwhere:\n\n- \"/path/to/source-list file\" is the path to the file that lists the source files\n\n- \"path/to/igd_folder/\" is the path to the output igd folder;\n\n- \"databaseName\" is the name you give to the database, for eaxmple, \"roadmap\"\n\noption:\n\n-b: bin-size (power of 2; default 14, which is 16384 bp)\n```\n\n\n### 2. Search iGD for overlaps\n```\nigd search \"path/to/igd_data_file\" -q \"path/to/query_file\"\n\nwhere:\n\n- path/to/igd_data_file is the path to the igd data\n\n- path/to/query_file is the path to the query file (.bed or .bed.gz)\n\nother options:\n\n-r \u003cchrN start end\u003e (a single query)\n\n-v \u003csignal value 0-1000\u003e (signal value \u003e v)\n\n-o \u003coutput file Name\u003e\n\n-s (output Seqpare similarity)\n\n-f (output full overlaps, for -q and -r only)\n\n-m (hitsmap of igd datasets)\n\n```\n\nFor a detailed example, please check out the `vignettes`.\n\n## R-wrapper of IGD\n\n### 1. Create iGD database \n#### 1.1  from a genome data source\n```\n\u003e library(IGDr)\n\u003e createIGD(\"/path/to/data_source_folder/*\" \"/path/to/igd_folder/\" \"databaseName\" [option]\n\nwhere:\n\n- \"path/to/data_source_folder/\" is the path of the folder that contains .bed.gz or .bed data files.\n\n- \"path/to/igd_folder/\" is the path to the output igd folder;\n\n- \"databaseName\" is the name you give to the database, for eaxmple, \"roadmap\"\n\noptions:\n\n-b: bin size in bp (default 16384)\n```\n#### 1.2  from a file that contains the list of genome data source files \n```\n\u003e library(IGDr)\n\u003e createIGD_f(\"/path/to/source-list file\" \"/path/to/igd_folder/\" \"databaseName\" [option]\n\nwhere:\n\n- \"path/to/the list file/\" is the path to the file that contains the .bed.gz or .bed data files.\n\n- \"path/to/igd_folder/\" is the path to the output igd folder;\n\n- \"databaseName\" is the name you give to the database, for eaxmple, \"roadmap\"\n\noptions:\n\n-b: bin size in bp (default 16384)\n```\n\n### 2. search the igd database in R (an example for a created igd file)\n\nSearch the igd database with a single query:\n```\n\u003e igd_file = \"igdr_b14/roadmap.igd\"\n\u003e library(IGDr)\n\u003e igd \u003c- IGDr::IGDr(igd_file)\n\u003e hits \u003c- search_1r(igd, \"chr6\", 1000000, 10000000)\n\u003e hits\n```\nSearch the igd database with n queries:\n```\n\u003e igd_file = \"igdr_b14/roadmap.igd\"\n\u003e library(IGDr)\n\u003e igd \u003c- IGDr::IGDr(igd_file)\n\u003e chrms = c(\"chr6\", \"chr1\", \"chr2\")\n\u003e starts = c(10000, 100000, 1000000)\n\u003e ends = (100000, 1000000, 10000000)\n\u003e hits \u003c- search_nr(igd, 3, chrms, starts, ends)\n\u003e hits\n```\nSearch a whole query file chainRn4.bed\n```\n\u003e igd_file = \"igdr_b14/roadmap.igd\"\n\u003e query_file = \"r10000.bed\"\n\u003e library(bit64)\n\u003e library(IGDr)\n\u003e fi = IGDr::getFInfo(igd_file)\n\u003e hits = integer64(fi$nFiles)\n\u003e ret = IGDr::search_all(igd_file, query_file, hits)\n\u003e for(i in 1:fi$nFiles){\n  cat(i, \"\\t\", toString(ret[i]), \"\\t\", toString(fi$fInfo[i,2]), \"\\n\")\n  }\n\u003e\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Figd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabio%2Figd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Figd/lists"}