{"id":13752161,"url":"https://github.com/lh3/cgranges","last_synced_at":"2025-05-08T00:08:59.280Z","repository":{"id":34658549,"uuid":"181992573","full_name":"lh3/cgranges","owner":"lh3","description":"A C/C++ library for fast interval overlap queries (with a \"bedtools coverage\" example)","archived":false,"fork":false,"pushed_at":"2024-05-28T21:47:37.000Z","size":83,"stargazers_count":167,"open_issues_count":4,"forks_count":18,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-05-08T00:08:50.638Z","etag":null,"topics":["algorithm","bioinformatics","genomics"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-04-18T01:11:13.000Z","updated_at":"2025-05-07T05:35:17.000Z","dependencies_parsed_at":"2022-08-08T01:16:08.623Z","dependency_job_id":"51e51c5c-49cc-4671-8f60-2db202d706ae","html_url":"https://github.com/lh3/cgranges","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fcgranges","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fcgranges/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fcgranges/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fcgranges/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/cgranges/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252973690,"owners_count":21834108,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","bioinformatics","genomics"],"created_at":"2024-08-03T09:01:00.602Z","updated_at":"2025-05-08T00:08:59.228Z","avatar_url":"https://github.com/lh3.png","language":"C","readme":"## Introduction\n\ncgranges is a small C library for genomic interval overlap queries: given a\ngenomic region *r* and a set of regions *R*, finding all regions in *R* that\noverlaps *r*. Although this library is based on [interval tree][itree], a well\nknown data structure, the core algorithm of cgranges is distinct from all\nexisting implementations to the best of our knowledge.  Specifically, the\ninterval tree in cgranges is implicitly encoded as a plain sorted array\n(similar to [binary heap][bheap] but packed differently). Tree\ntraversal is achieved by jumping between array indices. This treatment makes\ncgranges very efficient and compact in memory. The core algorithm can be\nimplemented in ~50 lines of C++ code, much shorter than others as well. Please\nsee the code comments in [cpp/IITree.h](cpp/IITree.h) for details.\n\n## Usage\n\n### Test with BED coverage\n\nFor testing purposes, this repo implements the [bedtools coverage][bedcov] tool\nwith cgranges. The source code is located in the [test/](test) directory. You\ncan compile and run the test with:\n```sh\ncd test \u0026\u0026 make\n./bedcov-cr test1.bed test2.bed\n```\nThe first BED file is loaded into RAM and indexed. The depth and the breadth of\ncoverage of each region in the second file is computed by query against the\nindex of the first file.\n\nThe [test/](test) directory also contains a few other implementations based on\n[IntervalTree.h][ekg-itree] in C++, [quicksect][quicksect] in Cython and\n[ncls][ncls] in Cython. The table below shows timing and peak memory on two\ntest BEDs available in the release page. The first BED contains GenCode\nannotations with ~1.2 million lines, mixing all types of features. The second\ncontains ~10 million direct-RNA mappings. Time1a/Mem1a indexes the GenCode BED\ninto memory. Time1b adds whole chromosome intervals to the GenCode BED when\nindexing. Time2/Mem2 indexes the RNA-mapping BED into memory. Numbers are\naveraged over 5 runs.\n\n|Algo.   |Lang. |Cov|Program         |Time1a|Time1b|Mem1a   |Time2 |Mem2    |\n|:-------|:-----|:-:|:---------------|-----:|-----:|-------:|-----:|-------:|\n|IAITree |C     |Y  |cgranges        |9.0s  |13.9s |19.1MB  |4.6s  |138.4MB |\n|IAITree |C++   |Y  |cpp/iitree.h    |11.1s |24.5s |22.4MB  |5.8s  |160.4MB |\n|CITree  |C++   |Y  |IntervalTree.h  |17.4s |17.4s |27.2MB  |10.5s |179.5MB |\n|IAITree |C     |N  |cgranges        |7.6s  |13.0s |19.1MB  |4.1s  |138.4MB |\n|AIList  |C     |N  |3rd-party/AIList|7.9s  |8.1s  |14.4MB  |6.5s  |104.8MB |\n|NCList  |C     |N  |3rd-party/NCList|13.0s |13.4s |21.4MB  |10.6s |183.0MB |\n|AITree  |C     |N  |3rd-party/AITree|16.8s |18.4s |73.4MB  |27.3s |546.4MB |\n|IAITree |Cython|N  |cgranges        |56.6s |63.9s |23.4MB  |43.9s |143.1MB |\n|binning |C++   |Y  |bedtools        |201.9s|280.4s|478.5MB |149.1s|3438.1MB|\n\nHere, IAITree = implicit augmented interval tree, used by cgranges;\nCITree = centered interval tree, used by [Erik Garrison's\nIntervalTree][itree]; AIList = augmented interval list, by [Feng et\nal][ailist]; NCList = nested containment list, taken from [ncls][ncls] by Feng\net al; AITree = augmented interval tree, from [kerneltree][kerneltree].\n\"Cov\" indicates whether the program calculates breadth of coverage.\nComments:\n\n* AIList keeps start and end only. IAITree and CITree addtionally store a\n  4-byte \"ID\" field per interval to reference the source of interval. This is\n  partly why AIList uses the least memory.\n\n* IAITree is more sensitive to the worse case: the presence of an interval\n  spanning the whole chromosome.\n\n* IAITree uses an efficient radix sort. CITree uses std::sort from STL, which\n  is ok. AIList and NCList use qsort from libc, which is slow. Faster sorting\n  leads to faster indexing.\n\n* IAITree in C++ uses identical core algorithm to the C version, but limited by\n  its APIs, it wastes time on memory locality and management. CITree has a\n  similar issue.\n\n* Computing coverage is better done when the returned list of intervals are\n  start sorted. IAITree returns sorted list. CITree doesn't. Not sure about\n  others. Computing coverage takes a couple of seconds. Sorting will be slower.\n\n* Printing intervals also takes a noticeable fraction of time. Custom printf\n  equivalent would be faster.\n\n* IAITree+Cython is a wrapper around the C version of cgranges. Cython adds\n  significant overhead.\n\n* Bedtools is designed for a variety of applications in addition to computing\n  coverage. It may keep other information in its internal data structure. This\n  micro-benchmark may be unfair to bedtools.\n\n* In general, the performance is affected a lot by subtle implementation\n  details. CITree, IAITree, NCList and AIList are all broadly comparable in\n  performance. AITree is not recommended when indexed intervals are immutable.\n\n### Use cgranges as a C library\n\n```c\ncgranges_t *cr = cr_init(); // initialize a cgranges_t object\ncr_add(cr, \"chr1\", 20, 30, 0); // add a genomic interval\ncr_add(cr, \"chr2\", 10, 30, 1);\ncr_add(cr, \"chr1\", 10, 25, 2);\ncr_index(cr); // index\n\nint64_t i, n, *b = 0, max_b = 0;\nn = cr_overlap(cr, \"chr1\", 15, 22, \u0026b, \u0026max_b); // overlap query; output array b[] can be reused\nfor (i = 0; i \u003c n; ++i) // traverse overlapping intervals\n\tprintf(\"%d\\t%d\\t%d\\n\", cr_start(cr, b[i]), cr_end(cr, b[i]), cr_label(cr, b[i]));\nfree(b); // b[] is allocated by malloc() inside cr_overlap(), so needs to be freed with free()\n\ncr_destroy(cr);\n```\n\n### Use IITree as a C++ library\n\n```cpp\nIITree\u003cint, int\u003e tree;\ntree.add(12, 34, 0); // add an interval\ntree.add(0, 23, 1);\ntree.add(34, 56, 2);\ntree.index(); // index\nstd::vector\u003csize_t\u003e a;\ntree.overlap(22, 25, a); // retrieve overlaps\nfor (size_t i = 0; i \u003c a.size(); ++i)\n\tprintf(\"%d\\t%d\\t%d\\n\", tree.start(a[i]), tree.end(a[i]), tree.data(a[i]));\n```\n\n## Cite cgranges\n\nThis library is integrated into [bedtk][bedtk], which is published in:\n\u003e Li H and Rong J (2021) Bedtk: finding interval overlap with implicit interval tree.\n\u003e *Bioinformatics*, **37**:1315-1316\n\n[bedcov]: https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html\n[ekg-itree]: https://github.com/ekg/intervaltree\n[quicksect]: https://github.com/brentp/quicksect\n[ncls]: https://github.com/hunt-genes/ncls\n[citree]: https://en.wikipedia.org/wiki/Interval_tree#Centered_interval_tree\n[itree]: https://en.wikipedia.org/wiki/Interval_tree\n[bheap]: https://en.wikipedia.org/wiki/Binary_heap\n[ailist]: https://www.biorxiv.org/content/10.1101/593657v1\n[kerneltree]: https://github.com/biocore-ntnu/kerneltree\n[bedtk]: https://github.com/lh3/bedtk\n","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fcgranges","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fcgranges","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fcgranges/lists"}