https://github.com/allenai/medicat
Dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references
https://github.com/allenai/medicat
Last synced: 19 days ago
JSON representation
Dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references
- Host: GitHub
- URL: https://github.com/allenai/medicat
- Owner: allenai
- License: apache-2.0
- Created: 2020-09-30T23:27:58.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-11-21T18:15:00.000Z (almost 2 years ago)
- Last Synced: 2024-11-04T11:38:49.146Z (12 months ago)
- Language: Python
- Size: 2.52 MB
- Stars: 124
- Watchers: 8
- Forks: 14
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-multimodal-in-medical-imaging - MedICaT
- awesome-latest-LLM - MedICaT
- Awesome-CLIP-in-Medical-Imaging - MedICaT
README
# MedICaT
MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Instructions for access are provided here.
Figures and captions are extracted from open access articles in PubMed Central and corresponding reference text is derived from [S2ORC](https://github.com/allenai/s2orc).
The dataset consists of:
* 217,060 figures from 131,410 open access papers
* 7507 subcaption and subfigure annotations for 2069 compound figures
* Inline references for ~25K figures in the [ROCO dataset](https://github.com/razorx89/roco-dataset)
A sample of the data is available in `sample/`.
An example data entry:
```
{
"pdf_hash": "57c9ad0f4aab133f96d40992c46926fabc901ffa",
"fig_key": "Figure1",
"fig_uri": "2-Figure1-1.png",
"s2_caption": "Figure 1. (A) Barium enema and (B) endoscopic image of the high-grade distal colonic obstruction caused by a 5-cm anastomotic stricture.",
"s2orc_caption": "Figure 1. (A) Barium enema and (B) endoscopic image of the high-grade distal colonic obstruction caused by a 5-cm anastomotic stricture.",
"s2orc_references": [
"Computed tomography (CT) showed a distal large bowel obstruction, and a barium enema revealed a high-grade stenosis proximal to the anastomotic site in the recto-sigmoid region (Figure 1 ).",
"Flexible sigmoidoscopy revealed a tight, fibrotic, benign-appearing anastomotic stricture 15 cm from the anal verge ( Figure 1) ."
],
"radiology": false,
"scope": true,
"predicted_type": "Medical images",
"oa_info": {
"doi": "10.14309/crj.2014.54",
"doi_url": "https://doi.org/10.14309/crj.2014.54",
"oa": {
"is_oa": true,
"oa_status": "gold",
"journal_is_oa": true,
"journal_is_in_doaj": true,
"license": "cc-by-nc-nd",
"provenance": "unpaywall"
}
}
}
```
The corresponding figure is located at `figures/57c9ad0f4aab133f96d40992c46926fabc901ffa_2-Figure1-1.png` (`{pdf_hash}_{fig_uri}`).
### To download:
Figure, caption, and reference data (104 Gb): https://ai2-s2-medicat.s3.us-west-2.amazonaws.com/2020-10-05/medicat_release.tar.gz
Subcaption/subfigure annotations (14 Mb): https://ai2-s2-medicat.s3.us-west-2.amazonaws.com/2020-10-05/subcaptions_public.jsonl
Inline references for ROCO dataset (3 Mb): https://ai2-s2-medicat.s3.us-west-2.amazonaws.com/2020-10-05/roco_references.zip
This data can only be used for research purposes. Please abide by the licenses for reuse.
### Code
Please see the `code` directory for the code associated with our paper. The `code/README.md` includes additional information about how you can use this code.
### To cite:
If using this dataset, please cite:
```
@inproceedings{subramanian-2020-medicat,
title={{MedICaT: A Dataset of Medical Images, Captions, and Textual References}},
author={Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, and Hannaneh Hajishirzi},
year={2020},
booktitle={Findings of EMNLP},
}
```
### License
Each source document in MedICaT is licensed differently. Articles included in MedICaT have open access licenses (see [CC](https://creativecommons.org/licenses/) and [UPW](https://support.unpaywall.org/support/solutions/folders/44000384007)) or are in the public domain. The license for each article is provided in the associated entry in the dataset. Please abide by these licenses when using. The MedICaT dataset is available for non-commercial use only.
## Contact information for questions
**Email:** `lucylw@uw.edu`