{"id":21962514,"url":"https://github.com/biocpy/biocframe","last_synced_at":"2025-04-23T21:24:29.574Z","repository":{"id":57745555,"uuid":"519037986","full_name":"BiocPy/BiocFrame","owner":"BiocPy","description":"Bioconductor-like data frames","archived":false,"fork":false,"pushed_at":"2025-04-21T16:28:25.000Z","size":3756,"stargazers_count":4,"open_issues_count":1,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-21T17:35:32.729Z","etag":null,"topics":["dataframe"],"latest_commit_sha":null,"homepage":"https://biocpy.github.io/BiocFrame/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BiocPy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-07-29T01:02:26.000Z","updated_at":"2025-03-29T02:19:19.000Z","dependencies_parsed_at":"2023-10-13T08:31:45.888Z","dependency_job_id":"c00e99e8-b150-48c5-af8e-e2c5cae11e6b","html_url":"https://github.com/BiocPy/BiocFrame","commit_stats":{"total_commits":12,"total_committers":1,"mean_commits":12.0,"dds":0.0,"last_synced_commit":"57c494813dbb85da3c3dd411a1c6f86841decb3e"},"previous_names":[],"tags_count":57,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2FBiocFrame","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2FBiocFrame/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2FBiocFrame/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BiocPy%2FBiocFrame/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BiocPy","download_url":"https://codeload.github.com/BiocPy/BiocFrame/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250515793,"owners_count":21443487,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataframe"],"created_at":"2024-11-29T10:42:47.872Z","updated_at":"2025-04-23T21:24:29.554Z","avatar_url":"https://github.com/BiocPy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- These are examples of badges you might want to add to your README:\n     please update the URLs accordingly\n\n[![Built Status](https://api.cirrus-ci.com/github/\u003cUSER\u003e/BiocFrame.svg?branch=main)](https://cirrus-ci.com/github/\u003cUSER\u003e/BiocFrame)\n[![ReadTheDocs](https://readthedocs.org/projects/BiocFrame/badge/?version=latest)](https://BiocFrame.readthedocs.io/en/stable/)\n[![Coveralls](https://img.shields.io/coveralls/github/\u003cUSER\u003e/BiocFrame/main.svg)](https://coveralls.io/r/\u003cUSER\u003e/BiocFrame)\n[![PyPI-Server](https://img.shields.io/pypi/v/BiocFrame.svg)](https://pypi.org/project/BiocFrame/)\n[![Conda-Forge](https://img.shields.io/conda/vn/conda-forge/BiocFrame.svg)](https://anaconda.org/conda-forge/BiocFrame)\n[![Monthly Downloads](https://pepy.tech/badge/BiocFrame/month)](https://pepy.tech/project/BiocFrame)\n[![Twitter](https://img.shields.io/twitter/url/http/shields.io.svg?style=social\u0026label=Twitter)](https://twitter.com/BiocFrame)\n--\u003e\n\n[![Project generated with PyScaffold](https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold)](https://pyscaffold.org/)\n[![PyPI-Server](https://img.shields.io/pypi/v/BiocFrame.svg)](https://pypi.org/project/BiocFrame/)\n![Unit tests](https://github.com/BiocPy/BiocFrame/actions/workflows/run-tests.yml/badge.svg)\n\n# Bioconductor-like data frames\n\n## Overview\n\nThis package implements the `BiocFrame` class, a Bioconductor-friendly alternative to Pandas `DataFrame`.\nThe main advantage is that the `BiocFrame` makes no assumption on the types of the columns -\nas long as an object has a length (`__len__`) and slicing methods (`__getitem__`), it can be used inside a `BiocFrame`.\nThis allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects.\n\nTo get started, install the package from [PyPI](https://pypi.org/project/biocframe/):\n\n```shell\npip install biocframe\n\n# To install optional dependencies\npip install biocframe[optional]\n```\n\n## Construction\n\nTo construct a `BiocFrame` object, simply provide the data as a dictionary.\n\n```python\nfrom biocframe import BiocFrame\n\nobj = {\n    \"ensembl\": [\"ENS00001\", \"ENS00002\", \"ENS00003\"],\n    \"symbol\": [\"MAP1A\", \"BIN1\", \"ESR1\"],\n}\nbframe = BiocFrame(obj)\nprint(bframe)\n## BiocFrame with 3 rows and 2 columns\n##      ensembl symbol\n##       \u003clist\u003e \u003clist\u003e\n## [0] ENS00001  MAP1A\n## [1] ENS00002   BIN1\n## [2] ENS00003   ESR1\n```\n\nYou can specify complex objects as columns, as long as they have some \"length\" equal to the number of rows.\nFor example, we can nest a `BiocFrame` inside another `BiocFrame`:\n\n```python\nobj = {\n    \"ensembl\": [\"ENS00001\", \"ENS00002\", \"ENS00002\"],\n    \"symbol\": [\"MAP1A\", \"BIN1\", \"ESR1\"],\n    \"ranges\": BiocFrame({\n        \"chr\": [\"chr1\", \"chr2\", \"chr3\"],\n        \"start\": [1000, 1100, 5000],\n        \"end\": [1100, 4000, 5500]\n    }),\n}\n\nbframe2 = BiocFrame(obj, row_names=[\"row1\", \"row2\", \"row3\"])\nprint(bframe2)\n## BiocFrame with 3 rows and 3 columns\n##       ensembl symbol         ranges\n##        \u003clist\u003e \u003clist\u003e    \u003cBiocFrame\u003e\n## row1 ENS00001  MAP1A chr1:1000:1100\n## row2 ENS00002   BIN1 chr2:1100:4000\n## row3 ENS00002   ESR1 chr3:5000:5500\n```\n\n## Extracting data\n\nProperties can be accessed directly from the object:\n\n```python\nprint(bframe.shape)\n## (3, 2)\n\nprint(bframe.get_column_names())\n## ['ensembl', 'symbol']\n\nprint(bframe.column_names) # same as above\n## ['ensembl', 'symbol']\n```\n\nWe can fetch individual columns:\n\n```python\nbframe.get_column(\"ensembl\")\n## ['ENS00001', 'ENS00002', 'ENS00003']\n\nbframe[\"ensembl\"]\n## ['ENS00001', 'ENS00002', 'ENS00003']\n```\n\nAnd we can get individual rows as a dictionary:\n\n```python\nbframe.get_row(2)\n## {'ensembl': 'ENS00003', 'symbol': 'ESR1'}\n```\n\nTo extract a subset of the data in the `BiocFrame`, we use the subset (`[]`) operator.\nThis accepts different subsetting arguments like a boolean vector, a `slice` object, a sequence of indices, or row/column names.\n\n```python\nsliced = bframe[1:2, [True, False, False]]\nprint(sliced)\n## BiocFrame with 1 row and 1 column\n##      column1\n##       \u003clist\u003e\n## [0] ENS00002\n\nsliced = bframe[[0,2], [\"symbol\", \"ensembl\"]]\nprint(sliced)\n## BiocFrame with 2 rows and 2 columns\n##     symbol  ensembl\n##     \u003clist\u003e   \u003clist\u003e\n## [0]  MAP1A ENS00001\n## [1]   ESR1 ENS00003\n\n# Short-hand to get a single column:\nbframe[\"ensembl\"]\n## ['ENS00001', 'ENS00002', 'ENS00003']\n```\n\n## Setting data\n\n### Preferred approach\n\nTo set `BiocFrame` properties, we encourage a functional style of programming that avoids mutating the object.\nThis avoids inadvertent modification of `BiocFrame`s that are part of larger data structures.\n\n```python\nmodified = bframe.set_column_names([\"column1\", \"column2\"])\nprint(modified)\n## BiocFrame with 3 rows and 2 columns\n##      column1 column2\n##       \u003clist\u003e  \u003clist\u003e\n## [0] ENS00001   MAP1A\n## [1] ENS00002    BIN1\n## [2] ENS00003    ESR1\n\n# Original is unchanged:\nprint(bframe.get_column_names())\n## ['ensembl', 'symbol']\n```\n\nTo add new columns, or replace existing columns:\n\n```python\nmodified = bframe.set_column(\"symbol\", [\"A\", \"B\", \"C\"])\nprint(modified)\n## BiocFrame with 3 rows and 2 columns\n##      ensembl symbol\n##       \u003clist\u003e \u003clist\u003e\n## [0] ENS00001      A\n## [1] ENS00002      B\n## [2] ENS00003      C\n\nmodified = bframe.set_column(\"new_col_name\", range(2, 5))\nprint(modified)\n## BiocFrame with 3 rows and 3 columns\n##      ensembl symbol new_col_name\n##       \u003clist\u003e \u003clist\u003e      \u003crange\u003e\n## [0] ENS00001  MAP1A            2\n## [1] ENS00002   BIN1            3\n## [2] ENS00003   ESR1            4\n```\n\nChange the row or column names:\n\n```python\nmodified = bframe.\\\n    set_column_names([\"FOO\", \"BAR\"]).\\\n    set_row_names(['alpha', 'bravo', 'charlie'])\nprint(modified)\n## BiocFrame with 3 rows and 2 columns\n##              FOO    BAR\n##           \u003clist\u003e \u003clist\u003e\n##   alpha ENS00001  MAP1A\n##   bravo ENS00002   BIN1\n## charlie ENS00003   ESR1\n```\n\nWe also support Bioconductor's metadata concepts, either along the columns or for the entire object:\n\n```python\nmodified = bframe.\\\n    set_metadata({ \"author\": \"Jayaram Kancherla\" }).\\\n    set_column_data(BiocFrame({\"column_source\": [\"Ensembl\", \"HGNC\" ]}))\nprint(modified)\n## BiocFrame with 3 rows and 2 columns\n##      ensembl symbol\n##       \u003clist\u003e \u003clist\u003e\n## [0] ENS00001  MAP1A\n## [1] ENS00002   BIN1\n## [2] ENS00003   ESR1\n## ------\n## column_data(1): column_source\n## metadata(1): author\n```\n\n### The other way\n\nProperties can also be set by direct assignment for in-place modification.\nWe prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures.\nNonetheless:\n\n```python\ntestframe = BiocFrame({ \"A\": [1,2,3], \"B\": [4,5,6] })\ntestframe.column_names = [\"column1\", \"column2\" ]\nprint(testframe)\n## BiocFrame with 3 rows and 2 columns\n##     column1 column2\n##      \u003clist\u003e  \u003clist\u003e\n## [0]       1       4\n## [1]       2       5\n## [2]       3       6\n```\n\nSimilarly, we could set or replace columns directly:\n\n```python\ntestframe[\"column2\"] = [\"A\", \"B\", \"C\"]\ntestframe[1:3, [\"column1\",\"column2\"]] = BiocFrame({\"x\":[4, 5], \"y\":[\"E\", \"F\"]})\n## BiocFrame with 3 rows and 2 columns\n##     column1 column2\n##      \u003clist\u003e  \u003clist\u003e\n## [0]       1       A\n## [1]       4       E\n## [2]       5       F\n```\n\nThese assignments are the same as calling the corresponding `set_*()` methods with `in_place = True`.\nIt is best to do this only if the `BiocFrame` object is not being used anywhere else;\notherwise, it is safer to just create a (shallow) copy via the default `in_place = False`.\n\n## Combining objects\n\n**BiocFrame** implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils).\nSo, for example, to combine by row:\n\n```python\nimport biocutils\n\nbframe1 = BiocFrame(\n    {\n        \"odd\": [1, 3, 5, 7, 9],\n        \"even\": [0, 2, 4, 6, 8],\n    }\n)\n\nbframe2 = BiocFrame(\n    {\n        \"odd\": [11, 33, 55, 77, 99],\n        \"even\": [0, 22, 44, 66, 88],\n    }\n)\n\ncombined = biocutils.combine_rows(bframe1, bframe2)\nprint(combined)\n## BiocFrame with 10 rows and 2 columns\n##        odd   even\n##     \u003clist\u003e \u003clist\u003e\n## [0]      1      0\n## [1]      3      2\n## [2]      5      4\n## [3]      7      6\n## [4]      9      8\n## [5]     11      0\n## [6]     33     22\n## [7]     55     44\n## [8]     77     66\n## [9]     99     88\n```\n\nSimilarly, to combine by column:\n\n```python\nbframe3 = BiocFrame(\n    {\n        \"foo\": [\"A\", \"B\", \"C\", \"D\", \"E\"],\n        \"bar\": [True, False, True, False, True]\n    }\n)\n\ncombined = biocutils.combine_columns(bframe1, bframe3)\nprint(combined)\nBiocFrame with 5 rows and 4 columns\n       odd   even    foo    bar\n    \u003clist\u003e \u003clist\u003e \u003clist\u003e \u003clist\u003e\n[0]      1      0      A   True\n[1]      3      2      B  False\n[2]      5      4      C   True\n[3]      7      6      D  False\n[4]      9      8      E   True\n```\n\nBy default, both methods above assume that the number and identity of columns (for `combine_rows()`) or rows (for `combine_columns()`) are the same across objects.\nIf this is not the case, e.g., with different columns across objects, we can use **BiocFrame**'s `relaxed_combine_rows()` instead:\n\n```python\nfrom biocframe import relaxed_combine_rows\nmodified2 = bframe2.set_column(\"foo\", [\"A\", \"B\", \"C\", \"D\", \"E\"])\ncombined = relaxed_combine_rows(bframe1, modified2)\nprint(combined)\n## BiocFrame with 10 rows and 3 columns\n##        odd   even    foo\n##     \u003clist\u003e \u003clist\u003e \u003clist\u003e\n## [0]      1      0   None\n## [1]      3      2   None\n## [2]      5      4   None\n## [3]      7      6   None\n## [4]      9      8   None\n## [5]     11      0      A\n## [6]     33     22      B\n## [7]     55     44      C\n## [8]     77     66      D\n## [9]     99     88      E\n```\n\nSimilarly, if the rows are different, we can use **BiocFrame**'s `merge` function:\n\n```python\nfrom biocframe import merge\nmodified1 = bframe1.set_row_names([\"A\", \"B\", \"C\", \"D\", \"E\"])\nmodified3 = bframe3.set_row_names([\"C\", \"D\", \"E\", \"F\", \"G\"])\ncombined = merge([modified1, modified3], by=None, join=\"outer\")\n## BiocFrame with 7 rows and 4 columns\n##      odd   even    foo    bar\n##   \u003clist\u003e \u003clist\u003e \u003clist\u003e \u003clist\u003e\n## A      1      0   None   None\n## B      3      2   None   None\n## C      5      4      A   True\n## D      7      6      B  False\n## E      9      8      C   True\n## F   None   None      D  False\n## G   None   None      E   True\n```\n\n## Playing nice with pandas\n\n`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R.\nMost users will probably prefer to work with **pandas** `DataFrame` objects for their actual analyses.\nThis conversion is easily achieved:\n\n```python\nfrom biocframe import BiocFrame\nbframe = BiocFrame(\n    {\n        \"foo\": [\"A\", \"B\", \"C\", \"D\", \"E\"],\n        \"bar\": [True, False, True, False, True]\n    }\n)\n\npd = bframe.to_pandas()\nprint(pd)\n##   foo    bar\n## 0   A   True\n## 1   B  False\n## 2   C   True\n## 3   D  False\n## 4   E   True\n```\n\nConversion back to a ``BiocFrame`` is similarly easy:\n\n```python\nout = BiocFrame.from_pandas(pd)\nprint(out)\n## BiocFrame with 5 rows and 2 columns\n##      foo    bar\n##   \u003clist\u003e \u003clist\u003e\n## 0      A   True\n## 1      B  False\n## 2      C   True\n## 3      D  False\n## 4      E   True\n```\n\n## Further reading\n\nCheck out [the reference documentation](https://biocpy.github.io/BiocFrame/) for more details.\n\nAlso see check out Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package,\nwhich implements the `DFrame` class on which `BiocFrame` was based.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiocpy%2Fbiocframe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbiocpy%2Fbiocframe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiocpy%2Fbiocframe/lists"}