{"id":21962516,"url":"https://github.com/biocpy/genomicranges","last_synced_at":"2025-08-21T09:32:36.408Z","repository":{"id":37883904,"uuid":"503531476","full_name":"BiocPy/GenomicRanges","owner":"BiocPy","description":"Container class to represent genomic locations and support genomic analysis","archived":false,"fork":false,"pushed_at":"2024-12-16T16:29:37.000Z","size":4698,"stargazers_count":17,"open_issues_count":14,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-12-16T17:39:35.803Z","etag":null,"topics":["genomicranges"],"latest_commit_sha":null,"homepage":"https://biocpy.github.io/GenomicRanges/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BiocPy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-14T21:54:24.000Z","updated_at":"2024-10-23T19:00:57.000Z","dependencies_parsed_at":"2023-01-31T18:45:24.403Z","dependency_job_id":"a3e5b93d-40b7-4320-9254-fd124e2ac665","html_url":"https://github.com/BiocPy/GenomicRanges","commit_stats":null,"previous_names":[],"tags_count":54,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2FGenomicRanges","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2FGenomicRanges/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2FGenomicRanges/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2FGenomicRanges/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BiocPy","download_url":"https://codeload.github.com/BiocPy/GenomicRanges/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230507048,"owners_count":18236944,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genomicranges"],"created_at":"2024-11-29T10:42:50.091Z","updated_at":"2025-08-21T09:32:36.401Z","avatar_url":"https://github.com/BiocPy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Project generated with PyScaffold](https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold)](https://pyscaffold.org/)\n[![PyPI-Server](https://img.shields.io/pypi/v/GenomicRanges.svg)](https://pypi.org/project/GenomicRanges/)\n![Unit tests](https://github.com/BiocPy/GenomicRanges/actions/workflows/run-tests.yml/badge.svg)\n\n# GenomicRanges\n\nGenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).\n\nTo get started, install the package from [PyPI](https://pypi.org/project/genomicranges/)\n\n```shell\npip install genomicranges\n```\n\nSome of the methods like `read_ucsc` require optional packages to be installed, e.g. `joblib` and can be installed by:\n\n```sh\npip install genomicranges[optional]\n```\n\n## `GenomicRanges`\n\n`GenomicRanges` is the base class to represent and operate over genomic regions and annotations.\n\n### From Bioinformatic file formats\n\n\u003e [!NOTE]\n\u003e When reading genomic formats, `ends` are expected to be inclusive to be consistent with Bioconductor representations (\u0026 gff). If they are not, we recommend subtracting 1 from the `ends`.\n\n#### From `biobear`\n\nAlthough the parsing capabilities in this package are limited, the [biobear](https://github.com/wheretrue/biobear) library is designed for reading and searching various bioinformatics file formats, including FASTA, FASTQ, VCF, BAM, and GFF, or from an object store like S3. Users can esily convert these representations to `GenomicRanges` (or [read more here](https://www.wheretrue.dev/docs/exon/biobear/genomicranges-integration)):\n\n```python\nfrom genomicranges import GenomicRanges\nimport biobear as bb\n\nsession = bb.new_session()\n\ndf = session.read_gtf_file(\"path/to/test.gtf\").to_polars()\ndf = df.rename({\"seqname\": \"seqnames\", \"start\": \"starts\", \"end\": \"ends\"})\n\ngg = GenomicRanges.from_polars(df)\n\n# do stuff w/ a genomic ranges\nprint(len(gg), len(df))\n```\n\n    ## output\n    ## 77 77\u003e [!NOTE]\n\u003e `ends` are expected to be inclusive to be consistent with Bioconductor representations. If they are not, we recommend subtracting 1 from the `ends`.\n\n#### UCSC or GTF file\n\nYou can easily download and parse genome annotations from UCSC or load a genome annotation from a GTF file,\n\n```python\nimport genomicranges\n\ngr = genomicranges.read_gtf(\u003cPATH TO GTF\u003e)\n# OR\ngr = genomicranges.read_ucsc(genome=\"hg19\")\n\nprint(gr)\n```\n\n    ## output\n    ## GenomicRanges with 1760959 intervals \u0026 10 metadata columns.\n    ## ... truncating the console print ...\n\n### From `IRanges` (Preferred way)\n\nIf you have all relevant information to create a GenomicRanges object\n\n```python\nfrom genomicranges import GenomicRanges\nfrom iranges import IRanges\nfrom biocframe import BiocFrame\nfrom random import random\n\ngr = GenomicRanges(\n    seqnames=[\n        \"chr1\",\n        \"chr2\",\n        \"chr3\",\n        \"chr2\",\n        \"chr3\",\n    ],\n    ranges=IRanges(start=[x for x in range(101, 106)], width=[11, 21, 25, 30, 5]),\n    strand=[\"*\", \"-\", \"*\", \"+\", \"-\"],\n    mcols=BiocFrame(\n        {\n            \"score\": range(0, 5),\n            \"GC\": [random() for _ in range(5)],\n        }\n    ),\n)\n\nprint(gr)\n```\n\n    ## output\n    GenomicRanges with 5 ranges and 5 metadata columns\n        seqnames    ranges           strand     score                  GC\n           \u003cstr\u003e \u003cIRanges\u003e \u003cndarray[int64]\u003e   \u003crange\u003e              \u003clist\u003e\n    [0]     chr1 101 - 111                * |       0  0.2593301003406461\n    [1]     chr2 102 - 122                - |       1  0.7207993213776644\n    [2]     chr3 103 - 127                * |       2 0.23391468067222065\n    [3]     chr2 104 - 133                + |       3  0.7671026589720187\n    [4]     chr3 105 - 109                - |       4 0.03355777784472458\n    ------\n    seqinfo(3 sequences): chr1 chr2 chr3\n\n### Pandas `DataFrame`\n\nA common representation in Python is a pandas `DataFrame` for all tabular datasets. `DataFrame` must contain columns \"seqnames\", \"starts\", and \"ends\" to represent genomic intervals. Here's an example:\n\n```python\nfrom genomicranges import GenomicRanges\nimport pandas as pd\nfrom random import random\n\ndf = pd.DataFrame(\n    {\n        \"seqnames\": [\"chr1\", \"chr2\", \"chr1\", \"chr3\", \"chr2\"],\n        \"starts\": [101, 102, 103, 104, 109],\n        \"ends\": [112, 103, 128, 134, 111],\n        \"strand\": [\"*\", \"-\", \"*\", \"+\", \"-\"],\n        \"score\": range(0, 5),\n        \"GC\": [random() for _ in range(5)],\n    }\n)\n\ngr = GenomicRanges.from_pandas(df)\nprint(gr)\n```\n\n    ## output\n    GenomicRanges with 5 ranges and 5 metadata columns\n      seqnames    ranges           strand    score                  GC\n         \u003cstr\u003e \u003cIRanges\u003e \u003cndarray[int64]\u003e   \u003clist\u003e              \u003clist\u003e\n    0     chr1 101 - 111                * |      0  0.4862658925128007\n    1     chr2 102 - 102                - |      1 0.27948386889389953\n    2     chr1 103 - 127                * |      2  0.5162697718607901\n    3     chr3 104 - 133                + |      3  0.5979843806415466\n    4     chr2 109 - 110                - |      4 0.04740781186083798\n    ------\n    seqinfo(3 sequences): chr1 chr2 chr3\n\n### Polars `DataFrame`\n\nSimilarly, To initialize from a polars `DataFrame`:\n\n```python\nfrom genomicranges import GenomicRanges\nimport polars as pl\nfrom random import random\n\ndf = pl.DataFrame(\n    {\n        \"seqnames\": [\"chr1\", \"chr2\", \"chr1\", \"chr3\", \"chr2\"],\n        \"starts\": [101, 102, 103, 104, 109],\n        \"ends\": [112, 103, 128, 134, 111],\n        \"strand\": [\"*\", \"-\", \"*\", \"+\", \"-\"],\n        \"score\": range(0, 5),\n        \"GC\": [random() for _ in range(5)],\n    }\n)\n\ngr = GenomicRanges.from_polars(df)\nprint(gr)\n```\n\n    ## output\n    GenomicRanges with 5 ranges and 5 metadata columns\n      seqnames    ranges           strand    score                  GC\n         \u003cstr\u003e \u003cIRanges\u003e \u003cndarray[int64]\u003e   \u003clist\u003e              \u003clist\u003e\n    0     chr1 101 - 112                * |      0  0.4862658925128007\n    1     chr2 102 - 103                - |      1 0.27948386889389953\n    2     chr1 103 - 128                * |      2  0.5162697718607901\n    3     chr3 104 - 134                + |      3  0.5979843806415466\n    4     chr2 109 - 111                - |      4 0.04740781186083798\n    ------\n    seqinfo(3 sequences): chr1 chr2 chr3\n\n### Interval Operations\n\n`GenomicRanges` supports most [interval based operations](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).\n\n```python\nsubject = genomicranges.read_ucsc(genome=\"hg38\")\n\nquery = genomicranges.from_pandas(\n    pd.DataFrame(\n        {\n            \"seqnames\": [\"chr1\", \"chr2\", \"chr3\"],\n            \"starts\": [100, 115, 119],\n            \"ends\": [103, 116, 120],\n        }\n    )\n)\n\nhits = subject.nearest(query, ignore_strand=True, select=\"all\")\nprint(hits)\n```\n\n    ## output\n    BiocFrame with 3 rows and 2 columns\n            query_hits        self_hits\n        \u003cndarray[int32]\u003e \u003cndarray[int32]\u003e\n    [0]                0                0\n    [1]                1          1677082\n    [2]                2          1003411\n\n## `GenomicRangesList`\n\nJust as it sounds, a `GenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `GenomicRangesList` allows us to represent this nested structure.\n\n**Currently, this class is limited in functionality.**\n\nTo construct a GenomicRangesList\n\n```python\nfrom genomicranges import GenomicRanges, GenomicRangesList\nfrom iranges import IRanges\nfrom biocframe import BiocFrame\n\ngr1 = GenomicRanges(\n    seqnames=[\"chr1\", \"chr2\", \"chr1\", \"chr3\"],\n    ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),\n    strand=[\"-\", \"+\", \"*\", \"+\"],\n    mcols=BiocFrame({\"score\": [1, 2, 3, 4]}),\n)\n\ngr2 = GenomicRanges(\n    seqnames=[\"chr2\", \"chr4\", \"chr5\"],\n    ranges=IRanges([3, 6, 4], [30, 50, 60]),\n    strand=[\"-\", \"+\", \"*\"],\n    mcols=BiocFrame({\"score\": [2, 3, 4]}),\n)\ngrl = GenomicRangesList(ranges=[gr1, gr2], names=[\"gene1\", \"gene2\"])\nprint(grl)\n```\n\n    ## output\n    GenomicRangesList with 2 ranges and 2 metadata columns\n\n    Name: gene1\n    GenomicRanges with 4 ranges and 4 metadata columns\n        seqnames    ranges           strand    score\n           \u003cstr\u003e \u003cIRanges\u003e \u003cndarray[int64]\u003e   \u003clist\u003e\n    [0]     chr1    1 - 10                - |      1\n    [1]     chr2    3 - 32                + |      2\n    [2]     chr1    2 - 51                * |      3\n    [3]     chr3    4 - 63                + |      4\n    ------\n    seqinfo(3 sequences): chr1 chr2 chr3\n\n    Name: gene2\n    GenomicRanges with 3 ranges and 3 metadata columns\n        seqnames    ranges           strand    score\n           \u003cstr\u003e \u003cIRanges\u003e \u003cndarray[int64]\u003e   \u003clist\u003e\n    [0]     chr2    3 - 32                - |      2\n    [1]     chr4    6 - 55                + |      3\n    [2]     chr5    4 - 63                * |      4\n    ------\n    seqinfo(3 sequences): chr2 chr4 chr5\n\n## Performance\n\nPerformance comparison between Python and R GenomicRanges implementations. The query dataset contains approximately 564,000 intervals, while the subject dataset contains approximately 71 million intervals.\n\n| Operation | Python/GenomicRanges | Python/GenomicRanges (5 threads) | R/GenomicRanges |\n|-----------|---------------------|-----------------------------------|-----------------|\n| Overlap | 3.02s | 2.13s | 4.40s |\n| Overlap (single chromosome) | 6.98s | 5.36s | 10.06s |\n| Nearest | 50.1s | 32.3s | 42.16s |\n| Nearest (single chromosome) | 15.5s | 11.4s | 11.01s |\n\n\u003e [!NOTE]\n\u003e The single chromosome benchmark ignores chromosome/sequence information and performs overlap operations solely on intervals.\n\nFor details, see the scripts in the [benchmark directory](./perf).\n\n## Further information\n\n- [Tutorial](https://biocpy.github.io/GenomicRanges/tutorial.html)\n- [API documentation](https://biocpy.github.io/GenomicRanges/api/modules.html)\n- [Bioc/GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html)\n\n\u003c!-- pyscaffold-notes --\u003e\n\n## Note\n\nThis project has been set up using PyScaffold 4.1.1. For details and usage\ninformation on PyScaffold see https://pyscaffold.org/.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiocpy%2Fgenomicranges","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbiocpy%2Fgenomicranges","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiocpy%2Fgenomicranges/lists"}