{"id":13754166,"url":"https://github.com/bigscience-workshop/biomedical","last_synced_at":"2025-05-15T09:07:54.264Z","repository":{"id":36964038,"uuid":"407183160","full_name":"bigscience-workshop/biomedical","owner":"bigscience-workshop","description":"Tools for curating biomedical training data for large-scale language modeling ","archived":false,"fork":false,"pushed_at":"2024-12-09T17:05:14.000Z","size":26747,"stargazers_count":477,"open_issues_count":179,"forks_count":116,"subscribers_count":30,"default_branch":"main","last_synced_at":"2025-05-10T05:42:14.220Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigscience-workshop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-09-16T13:50:24.000Z","updated_at":"2025-04-21T13:08:15.000Z","dependencies_parsed_at":"2024-01-17T12:33:16.041Z","dependency_job_id":"bc1823ec-15d4-4eca-8d94-52df97092c32","html_url":"https://github.com/bigscience-workshop/biomedical","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fbiomedical","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fbiomedical/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fbiomedical/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fbiomedical/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigscience-workshop","download_url":"https://codeload.github.com/bigscience-workshop/biomedical/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254310515,"owners_count":22049469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:46.406Z","updated_at":"2025-05-15T09:07:49.255Z","avatar_url":"https://github.com/bigscience-workshop.png","language":"Python","readme":"# BigBIO: Biomedical Dataset Library\n\n`BigBIO` (BigScience Biomedical) is an open library of biomedical dataloaders built using Huggingface's (🤗) [`datasets` library](https://huggingface.co/docs/datasets/) for data-centric machine learning. \n\nOur goals include:\n\n- Lightweight, programmatic access to biomedical datasets at scale\n- Promoting reproducibility in data processing\n- Better documentation for dataset provenance, licensing, and other key attributes\n- Easier generation of meta-datasets for natural language prompting, multi-task learning\n\nCurrently `BigBIO` provides support for:\n\n- 126+ biomedical datasets\n- 10+ languages\n- 12 task categories\n- Harmonized dataset schemas by task type\n- Metadata on *licensing*, *coarse/fine-grained task types*, *domain*, and more!\n\n## How to Use `BigBIO`\n\nThe preferred way to use these datasets is to access them from the [Official `BigBIO` Hub](https://huggingface.co/bigbio). \n\n\nMinimally, ensure you have the `datasets` library installed. Preferably, install the requirements as follows:\n\n`pip install -r requirements.txt`.\n\n\u003cbr\u003e\n\nYou can access `BigBIO` datasets as follows:\n\n```python\nfrom datasets import load_dataset\ndata = load_dataset(\"bigbio/biosses\")\n```\n\nIn most cases, scripts load the original schema of the dataset by default. You can also access the `BigBIO` split that streamlines access to key information in datasets given a particular task. \n\n\u003cbr\u003e\n\nFor example, the `biosses` dataset follows a `pairs` based schema, where text-based inputs (sentences, paragraphs) are assigned a \"translated\" pair. \n\n```python\nfrom datasets import load_dataset\ndata = load_dataset(\"bigbio/biosses\", name=\"biosses_bigbio_pairs\")\n```\n\nGenerally, you can load your datasets as follows:\n\n```python\n# Load original schema\ndata = load_dataset(\"bigbio/\u003cyour_dataset\u003e\")\n\n# Load BigBIO schema\ndata = load_dataset(\"bigbio/\u003cyour_dataset_here\u003e\", name=\"\u003cyour_dataset\u003e_bigbio_\u003cschema_name\u003e\")\n```\n\nCheck the datacards on the Hub to see what splits are available to you. You can find more information about [schemas](task_schemas.md) in [Documentation](##Documentation) below.\n\n## Benchmark Support\n\n`BigBIO` includes support for almost all datasets included in other popular English biomedical benchmarks.\n\n| Task Type | Dataset       | [`BigBIO` (ours)](https://arxiv.org/abs/2206.15076) | [BLUE](https://arxiv.org/abs/1906.05474)  | [BLURB](https://microsoft.github.io/BLURB/) | [BoX](https://arxiv.org/abs/2204.07600) | DUA needed |\n|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|\n| NER       | BC2GM         | ✓          |   | ✓  | ✓       |             |\n| NER       | BC5-chem      | ✓          | ✓  | ✓  | ✓       |          |\n| NER       | BC5-disease   | ✓          | ✓  | ✓  | ✓       |          |\n| NER       | EBM PICO      | ✓          |   | ✓  |        |             |\n| NER       | JNLPBA        | ✓          |   | ✓  | ✓       |             |\n| NER       | NCBI-disease  | ✓          |   | ✓  | ✓       |          |\n| RE        | ChemProt      | ✓          | ✓  | ✓  | ✓       |          |\n| RE        | DDI           | ✓          | ✓  | ✓  | ✓       |          |\n| RE        | GAD           | ✓          |   | ✓  |        |             |\n| QA        | PubMedQA      | ✓          |   | ✓  |    ✓    |          |\n| QA        | BioASQ        | ✓          |   | ✓  |  ✓       | ✓         |\n| DC        | HoC           | ✓          | ✓  |   ✓  | ✓       |          |\n| STS       | BIOSSES       | ✓          | ✓  |   ✓  |        |          |\n| STS       | MedSTS        | *                | ✓  |   |        |   ✓          |\n| NER       | n2c2 2010     | ✓          | ✓  |   |  ✓      | ✓         |\n| NER       | ShARe/CLEF 2013   | *          | ✓  |   |        |   ✓          |\n| NLI       | MedNLI        | ✓          | ✓  |   |        |    ✓         | \n| NER        | n2c2 deid 2006  | ✓          |   |   | ✓       |    ✓           |\n| DC       | n2c2 RFHD 2014     | ✓       |   |   | ✓       |   ✓           |\n| NER       | AnatEM        | ✓          |   |   | ✓       |             |\n| NER       | BC4CHEMD      | ✓          |   |   | ✓       |             |\n| NER       | BioNLP09      | ✓          |   |   | ✓       |             |\n| NER       | BioNLP11EPI   | ✓          |   |   | ✓       |             |\n| NER       | BioNLP11ID    | ✓          |   |   | ✓       |             |\n| NER       | BioNLP13CG    | ✓          |   |   | ✓       |             |\n| NER       | BioNLP13GE    | ✓          |   |   | ✓       |             |\n| NER       | BioNLP13PC    | ✓          |   |   | ✓       |             |\n| NER       | CRAFT         | *                |   |   | ✓       |             |\n| NER       | Ex-PTM        | ✓          |   |   | ✓       |             |\n| NER       | Linnaeus      | ✓          |   |   | ✓       |             |\n| POS       | GENIA         | *                |   |   | ✓       |             |\n| SA        | Medical Drugs | ✓          |   |   | ✓       |  |\n| SR        | COVID         |          |   |   | private       |             |\n| SR        | Cooking       |          |   |   | private      |             |\n| SR        | HRT           |          |   |   | private      |             |\n| SR        | Accelerometer |          |   |   | private       |             |\n| SR        | Acromegaly    |          |   |   | private      |             |\n\n\\* denotes dataset implementation in-progress\n\n## Documentation\n\n- [Task Schema Overview](task_schemas.md) is an indepth explanation of `BigBIO` schemas implemented.\n\n- [Streamlit Visualization Demo](https://github.com/bigscience-workshop/biomedical/tree/master/streamlit_demo)\n\n- [BigBIO Data Cards](https://github.com/bigscience-workshop/biomedical/tree/master/figures/data_card) report on statistics around each dataset in the library.\n\n\n## Tutorials\n\nTBA - Links may not be applicable yet!\n\n- Tutorials\n  - [Materializing Meta-datasets](https://github.com/bigscience-workshop/biomedical/blob/master/notebooks/materializing_meta_datasets/materializing-meta-datasets.ipynb)   \n  - [Prompt Engineering and Evaluation](https://github.com/bigscience-workshop/biomedical/tree/master/notebooks/promptengineering)  \n  - [Prompt Engineering with BLOOM](notebooks/bloomprompting/bloompipeline.md)\n\n## Contributing\n\n`BigBIO` is an open source project - your involvement is warmly welcome! If you're excited to join us, we recommend the following steps:\n\n- Looking for ideas? See our [Volunteer Project Board](https://github.com/orgs/bigscience-workshop/projects/6) to see what we may need help with.\n\n- Have your own idea? Contact an admin in the form of an [issue](https://github.com/bigscience-workshop/biomedical/issues/new?assignees=\u0026labels=\u0026template=add-dataset.md\u0026title=).\n\n- Implement your idea following guidelines set by the [official contributing guide](CONTRIBUTING.md)\n\n- Wait for admin approval; approval is iterative, but if accepted will belong to the main repository.\n\nCurrently, only admins will be merging all accepted changes to the Hub.\n\nFeel free to join our [Discord](https://discord.com/invite/Cwf3nT3ajP)!\n\n## Citing\nIf you use BigBIO in your work, please cite\n\n```\n@article{fries2022bigbio,\n\ttitle = {\n\t\tBigBIO: A Framework for Data-Centric Biomedical Natural Language\n\t\tProcessing\n\t},\n\tauthor = {\n\t\tFries, Jason Alan and Weber, Leon and Seelam, Natasha and Altay,\n\t\tGabriel and Datta, Debajyoti and Garda, Samuele and Kang, Myungsun\n\t\tand Su, Ruisi and Kusa, Wojciech and Cahyawijaya, Samuel and others\n\t},\n\tjournal = {arXiv preprint arXiv:2206.15076},\n\tyear = 2022\n}\n```\n\n## Acknowledgements\n\n`BigBIO` is a open source, community effort made possible through the efforts of many volunteers as part of BigScience and the [Biomedical Hackathon](HACKATHON.md).\n","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Fbiomedical","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigscience-workshop%2Fbiomedical","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Fbiomedical/lists"}