{"id":40509025,"url":"https://github.com/psmyth94/biosets","last_synced_at":"2026-01-20T19:50:46.660Z","repository":{"id":257823858,"uuid":"871346193","full_name":"psmyth94/biosets","owner":"psmyth94","description":"A bioinformatics extension of 🤗 Datasets library, built for ML applications on biological and omics data, offering easy integration of metadata and low-code data management tools.","archived":false,"fork":false,"pushed_at":"2024-11-16T15:42:50.000Z","size":286,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-03T15:18:30.137Z","etag":null,"topics":["big-data","bioinfo","classification","data-preprocessing","data-processing","data-science","datasets","genomics","high-performance","huggingface","machine-learning","metadata","omics","open-source","pandas","polars","proteomics","pyarrow","python","regression"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/psmyth94.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"docs/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-11T19:08:31.000Z","updated_at":"2025-02-10T13:13:19.000Z","dependencies_parsed_at":"2024-10-28T14:49:38.558Z","dependency_job_id":"4023d70f-1bfa-4267-a433-c87fe9080e09","html_url":"https://github.com/psmyth94/biosets","commit_stats":null,"previous_names":["psmyth94/biosets"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/psmyth94/biosets","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmyth94%2Fbiosets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmyth94%2Fbiosets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmyth94%2Fbiosets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmyth94%2Fbiosets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/psmyth94","download_url":"https://codeload.github.com/psmyth94/biosets/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psmyth94%2Fbiosets/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28611973,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T18:56:40.769Z","status":"ssl_error","status_checked_at":"2026-01-20T18:54:26.653Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","bioinfo","classification","data-preprocessing","data-processing","data-science","datasets","genomics","high-performance","huggingface","machine-learning","metadata","omics","open-source","pandas","polars","proteomics","pyarrow","python","regression"],"created_at":"2026-01-20T19:50:45.896Z","updated_at":"2026-01-20T19:50:46.654Z","avatar_url":"https://github.com/psmyth94.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    $${\\Huge{\\textbf{\\textsf{\\color{#2E8B57}Bio\\color{#4682B4}sets}}}}$$\n    \u003cbr/\u003e\n    \u003cbr/\u003e\n\u003c/p\u003e \n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/psmyth94/biosets/actions/workflows/ci_cd_pipeline.yml?query=branch%3Amain\"\u003e\u003cimg alt=\"Build\" src=\"https://github.com/psmyth94/biosets/actions/workflows/ci_cd_pipeline.yml/badge.svg?branch=main\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/psmyth94/biosets/blob/main/LICENSE\"\u003e\u003cimg alt=\"GitHub\" src=\"https://img.shields.io/github/license/psmyth94/biosets.svg?color=blue\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/psmyth94/biosets/tree/main/docs\"\u003e\u003cimg alt=\"Documentation\" src=\"https://img.shields.io/website/http/github/psmyth94/biosets/tree/main/docs.svg?down_color=red\u0026down_message=offline\u0026up_message=online\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/psmyth94/biosets/releases\"\u003e\u003cimg alt=\"GitHub release\" src=\"https://img.shields.io/github/release/psmyth94/biosets.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"CODE_OF_CONDUCT.md\"\u003e\u003cimg alt=\"Contributor Covenant\" src=\"https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://zenodo.org/records/14028772\"\u003e\u003cimg src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.14028772.svg\" alt=\"DOI\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n**Biosets** is a specialized library that extends 🤗 [Datasets](https://github.com/huggingface/datasets) for bioinformatics data, providing the following main features:\n\n- **Bioinformatics Specialization**: Streamlines data management specific to bioinformatics, such as handling samples, features, batches, and associated metadata.\n- **Automatic Column Detection**: Infers sample, batch, input features, and target columns, simplifying downstream preprocessing.\n- **Custom Data Classes**: Leverages specialized data classes (`ValueWithMetadata`, `Sample`, `Batch`, `RegressionTarget`, etc.) to manage metadata-rich bioinformatics data.\n- **Polars Integration**: Optional [Polars](https://github.com/pola-rs/polars) integration enables high-performance data manipulation, ideal for large datasets.\n- **Flexible Task Support**: Native support for binary classification, multiclass classification, multiclass-to-binary classification, and regression, adapting to diverse bioinformatics tasks.\n- **Integration with 🤗 Datasets**: `load_dataset` function supports loading various bioinformatics formats like CSV, JSON, NPZ, and more, including metadata integration.\n- **Arrow File Caching**: Uses [Apache Arrow](https://github.com/apache/arrow) for efficient on-disk caching, enabling fast access to large datasets without memory limitations.\n\nBiosets helps bioinformatics researchers focus on analysis rather than data handling, with seamless compatibility with 🤗 Datasets.\n\n## Installation\n\n### With pip\n\nYou can install **Biosets** from PyPI:\n\n```bash\npip install biosets\n```\n\n### With conda\n\nInstall **Biosets** via conda:\n\n```bash\nconda install -c patrico49 biosets\n```\n\n## Usage\n\n**Biosets** provides a straightforward API for handling bioinformatics datasets with integrated metadata management. Here's a quick example:\n\n```python\nfrom biosets import load_biodata\n\nbio_data = load_dataset(\n    data_files=\"data_with_samples.csv\",\n    sample_metadata_files=\"sample_metadata.csv\",\n    feature_metadata_files=\"feature_metadata.csv\",\n    target_column=\"metadata1\",\n    experiment_type=\"metagenomics\",\n    batch_column=\"batch\",\n    sample_column=\"sample\",\n    metadata_columns=[\"metadata1\", \"metadata2\"],\n    drop_samples=False\n)[\"train\"]\n```\n\nFor further details, check the [advance usage documentation](./docs/DATA_LOADING.md).\n\n## Main Differences Between Biosets and 🤗 Datasets\n\n- **Bioinformatics Focus**: While 🤗 Datasets is a general-purpose library, Biosets is tailored for the bioinformatics domain.\n- **Seamless Metadata Integration**: Biosets is built for datasets with metadata dependencies, like sample and feature metadata.\n- **Automatic Column Detection**: Reduces preprocessing time with automatic inference of sample, batch, feature, and label columns.\n- **Specialized Data Classes**: Biosets introduces custom classes (e.g., `Sample`, `Batch`, `ValueWithMetadata`) to enable richer data representation.\n\n## Disclaimers\n\nBiosets may run Python code from custom `datasets` scripts to handle specific data formats. For security, users should:\n\n- Inspect dataset scripts prior to execution.\n- Use pinned versions for any repository dependencies.\n\nIf you manage a dataset and wish to update or remove it, please open a discussion or pull request on the Community tab of 🤗's datasets page.\n\n## BibTeX\n\nIf you'd like to cite **Biosets**, please use the following:\n\n```bibtex\n@misc{smyth2024biosets,\n    title = {psmyth94/biosets: 1.1.0},\n    author = {Patrick Smyth},\n    year = {2024},\n    url = {https://github.com/psmyth94/biosets},\n    note = {A library designed to support bioinformatics data with custom features, metadata integration, and compatibility with 🤗 Datasets.}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsmyth94%2Fbiosets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpsmyth94%2Fbiosets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsmyth94%2Fbiosets/lists"}