{"id":40518505,"url":"https://github.com/wheretrue/biobear","last_synced_at":"2026-01-20T21:00:55.513Z","repository":{"id":154011001,"uuid":"631317732","full_name":"wheretrue/biobear","owner":"wheretrue","description":"Work with bioinformatic files using Arrow, Polars, and/or DuckDB","archived":false,"fork":false,"pushed_at":"2025-03-10T21:20:41.000Z","size":2111,"stargazers_count":193,"open_issues_count":15,"forks_count":12,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-01-03T14:05:51.516Z","etag":null,"topics":["arrow","bioinformatics","biology","biopython","duckdb","polars","pyarrow","python","rust-bio","samtools"],"latest_commit_sha":null,"homepage":"https://www.wheretrue.dev/docs/exon/biobear/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wheretrue.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"wheretrue"}},"created_at":"2023-04-22T16:30:38.000Z","updated_at":"2025-12-25T09:34:33.000Z","dependencies_parsed_at":"2023-10-14T18:37:55.347Z","dependency_job_id":"bae868db-a9b9-4e2d-85d2-d5c45c16a39a","html_url":"https://github.com/wheretrue/biobear","commit_stats":{"total_commits":170,"total_committers":1,"mean_commits":170.0,"dds":0.0,"last_synced_commit":"56a284c7800a1a18527328fdc576102813945f8c"},"previous_names":[],"tags_count":92,"template":false,"template_full_name":null,"purl":"pkg:github/wheretrue/biobear","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wheretrue%2Fbiobear","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wheretrue%2Fbiobear/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wheretrue%2Fbiobear/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wheretrue%2Fbiobear/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wheretrue","download_url":"https://codeload.github.com/wheretrue/biobear/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wheretrue%2Fbiobear/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28613659,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T18:56:40.769Z","status":"ssl_error","status_checked_at":"2026-01-20T18:54:26.653Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","bioinformatics","biology","biopython","duckdb","polars","pyarrow","python","rust-bio","samtools"],"created_at":"2026-01-20T21:00:27.379Z","updated_at":"2026-01-20T21:00:55.495Z","avatar_url":"https://github.com/wheretrue.png","language":"Rust","funding_links":["https://github.com/sponsors/wheretrue"],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/wheretrue/biobear/main/.github/biobear.svg\" width=\"450px\" alt=\"biobear\" /\u003e\n\u003c/h1\u003e\n\nbiobear is a Python library designed for reading and searching bioinformatic file formats, using Rust as its backend and producing Arrow Batch Readers and other downstream formats (like polars or duckdb).\n\nThe python package has minimal dependencies and only requires Polars. Biobear can be used to read various bioinformatic file formats, including FASTA, FASTQ, VCF, BAM, and GFF locally or from an object store like S3. It can also query some indexed file formats locally like VCF and BAM.\n\n[![Release](https://github.com/wheretrue/biobear/actions/workflows/release.yml/badge.svg)](https://github.com/wheretrue/biobear/actions/workflows/release.yml)\n\nPlease see the [documentation] for information on how to get started using biobear.\n\n## Quickstart\n\nTo install biobear, run:\n\n```bash\npip install biobear\npip install polars # needed for `to_polars` method\n```\n\nCreate a file with some GFF data:\n\n```bash\necho \"chr1\\t.\\tgene\\t1\\t100\\t.\\t+\\t.\\tgene_id=1;gene_name=foo\" \u003e test.gff\necho \"chr1\\t.\\tgene\\t200\\t300\\t.\\t+\\t.\\tgene_id=2;gene_name=bar\" \u003e\u003e test.gff\n```\n\nThen you can use biobear to read a file:\n\n```python\nimport biobear as bb\n\nsession = bb.connect()\ndf = session.sql(\"\"\"\n    SELECT * FROM gff_scan('test.gff')\n\"\"\").to_polars()\n\nprint(df)\n```\n\nThis will print:\n\n```text\n┌─────────┬────────┬──────┬───────┬───┬───────┬────────┬───────┬───────────────────────────────────┐\n│ seqname ┆ source ┆ type ┆ start ┆ … ┆ score ┆ strand ┆ phase ┆ attributes                        │\n│ ---     ┆ ---    ┆ ---  ┆ ---   ┆   ┆ ---   ┆ ---    ┆ ---   ┆ ---                               │\n│ str     ┆ str    ┆ str  ┆ i64   ┆   ┆ f32   ┆ str    ┆ str   ┆ list[struct[2]]                   │\n╞═════════╪════════╪══════╪═══════╪═══╪═══════╪════════╪═══════╪═══════════════════════════════════╡\n│ chr1    ┆ .      ┆ gene ┆ 1     ┆ … ┆ null  ┆ +      ┆ null  ┆ [{\"gene_id\",\"1\"}, {\"gene_name\",\"… │\n│ chr1    ┆ .      ┆ gene ┆ 200   ┆ … ┆ null  ┆ +      ┆ null  ┆ [{\"gene_id\",\"2\"}, {\"gene_name\",\"… │\n└─────────┴────────┴──────┴───────┴───┴───────┴────────┴───────┴───────────────────────────────────┘\n```\n\n### Using a Session w/ Exon\n\nBioBear exposes a session object that can be used with [exon][] to work with files directly in SQL, then eventually convert them to a DataFrame if needed.\n\nSee the [BioBear Docs][documentation] for more information, but in short, you can use the session like this:\n\n```python\nimport biobear as bb\n\nsession = bb.connect()\n\nsession.sql(\"\"\"\nCREATE EXTERNAL TABLE gene_annotations_s3 STORED AS GFF LOCATION 's3://BUCKET/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'\n\"\"\")\n\ndf = session.sql(\"\"\"\n    SELECT * FROM gene_annotations_s3 WHERE score \u003e 50\n\"\"\").to_polars()\ndf.head()\n# shape: (5, 9)\n# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐\n# │ seqname      ┆ source          ┆ type ┆ start ┆ … ┆ score      ┆ strand ┆ phase ┆ attributes                        │\n# │ ---          ┆ ---             ┆ ---  ┆ ---   ┆   ┆ ---        ┆ ---    ┆ ---   ┆ ---                               │\n# │ str          ┆ str             ┆ str  ┆ i64   ┆   ┆ f32        ┆ str    ┆ str   ┆ list[struct[2]]                   │\n# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡\n# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2     ┆ … ┆ 54.5       ┆ -      ┆ 0     ┆ [{\"ID\",[\"Ga0451106_01_2_238\"]}, … │\n# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 228   ┆ … ┆ 114.0      ┆ -      ┆ 0     ┆ [{\"ID\",[\"Ga0451106_01_228_941\"]}… │\n# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 1097  ┆ … ┆ 224.399994 ┆ +      ┆ 0     ┆ [{\"ID\",[\"Ga0451106_01_1097_2257\"… │\n# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2261  ┆ … ┆ 237.699997 ┆ +      ┆ 0     ┆ [{\"ID\",[\"Ga0451106_01_2261_3787\"… │\n# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 3784  ┆ … ┆ 114.400002 ┆ +      ┆ 0     ┆ [{\"ID\",[\"Ga0451106_01_3784_4548\"… │\n# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘\n```\n\n## Ecosystem\n\nBioBear aims to make it simple to move easily to and from different prominent data tools in Python. Generally, if the tool can read Arrow or Polars, it can read BioBear's output. To see examples of how to use BioBear with other tools, see:\n\n* [DuckDB][DuckDB Integration]\n* [GenomicRanges]\n* [DeltaLake]\n\n\n## Performance\n\nPlease see the [exon][]'s performance metrics for thorough benchmarks, but in short, biobear is generally faster than other Python libraries for reading bioinformatic file formats.\n\nFor example, here's quick benchmarks for reading one FASTA file with 1 million records and reading 5 FASTA files each with 1 million records for the local file system on an M1 MacBook Pro:\n\n| Library   | 1 file (s)         | 5 files (s)         |\n|-----------|--------------------|---------------------|\n| BioBear   | 4.605 s ±  0.166 s | 6.420 s ±  0.113 s  |\n| BioPython | 6.654 s ±  0.003 s | 34.254 s ±  0.053 s |\n\nThe larger difference multiple files is due to biobear's ability to read multiple files in parallel.\n\n[exon]: https://github.com/wheretrue/exon/tree/main/exon-benchmarks\n[duckdb]: https://duckdb.org/\n[documentation]: https://www.wheretrue.dev/docs/exon/biobear/.\n[DuckDB Integration]: https://www.wheretrue.dev/docs/exon/biobear/duckdb-integration\n[DeltaLake]: https://www.wheretrue.dev/docs/exon/biobear/delta-lake-integration/\n[GenomicRanges]: https://www.wheretrue.dev/docs/exon/biobear/genomicranges-integration\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwheretrue%2Fbiobear","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwheretrue%2Fbiobear","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwheretrue%2Fbiobear/lists"}