{"id":13416182,"url":"https://github.com/basveeling/pcam","last_synced_at":"2025-04-04T21:11:35.125Z","repository":{"id":33272045,"uuid":"135140095","full_name":"basveeling/pcam","owner":"basveeling","description":"The PatchCamelyon (PCam) deep learning classification benchmark.","archived":false,"fork":false,"pushed_at":"2024-01-31T14:06:01.000Z","size":527,"stargazers_count":474,"open_issues_count":8,"forks_count":107,"subscribers_count":17,"default_branch":"master","last_synced_at":"2024-10-30T00:36:18.623Z","etag":null,"topics":["benchmark","dataset","deep-learning","deep-learning-datasets","pathology"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/basveeling.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2018-05-28T09:30:52.000Z","updated_at":"2024-10-18T07:44:25.000Z","dependencies_parsed_at":"2024-04-19T03:35:27.710Z","dependency_job_id":null,"html_url":"https://github.com/basveeling/pcam","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basveeling%2Fpcam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basveeling%2Fpcam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basveeling%2Fpcam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basveeling%2Fpcam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/basveeling","download_url":"https://codeload.github.com/basveeling/pcam/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247249536,"owners_count":20908212,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","dataset","deep-learning","deep-learning-datasets","pathology"],"created_at":"2024-07-30T21:00:55.108Z","updated_at":"2025-04-04T21:11:35.108Z","avatar_url":"https://github.com/basveeling.png","language":"Python","funding_links":[],"categories":["Datasets","Imaging Data"],"sub_categories":["Breast Cancer","Pathology"],"readme":"# PatchCamelyon (PCam)\n_That which is measured, improves._ - Karl Pearson\n\nThe PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annoted with a binary label indicating presence of metastatic tissue. PCam provides a new benchmark for machine learning models: bigger than CIFAR10, smaller than imagenet, trainable on a single GPU.\n\n![PCam example images. Green boxes indicate positive labels.](https://github.com/basveeling/pcam/blob/master/pcam.jpg)\n*Example images from PCam. Green boxes indicate tumor tissue in center region, which dictates a positive label.*\n\n\u003cdetails\u003e\u003csummary\u003eTable of Contents\u003c/summary\u003e\u003cp\u003e\n\n* [Why PCam](#why-pcam)\n* [Download](#download)\n* [Details](#details)\n* [Usage and Tips](#usage-and-tips)\n* [Benchmark](#benchmark)\n* [Visualization](#visualization)\n* [Contributing](#contributing)\n* [Contact](#contact)\n* [Citing PCam](#citing-pcam)\n* [License](#license)\n\u003c/p\u003e\u003c/details\u003e\u003cp\u003e\u003c/p\u003e\n\n## Why PCam\nFundamental machine learning advancements are predominantly evaluated on straight-forward natural-image classification datasets. Think MNIST, CIFAR, SVHN. Medical imaging is becoming one of the major applications of ML and we believe it deserves a spot on the list of _go-to_ ML datasets. Both to challenge future work, and to steer developments into directions that are beneficial for this domain.\n\nWe think PCam can play a role in this. It packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. Models can easily be trained on a single GPU in a couple hours, and achieve competitive scores in the Camelyon16 tasks of tumor detection and WSI diagnosis. Furthermore, the balance between task-difficulty and tractability makes it a prime suspect for fundamental machine learning research on topics as active learning, model uncertainty and explainability.\n\n\n## Download\nThe data is stored in gzipped HDF5 files and can be downloaded using the following links. Each set consist of a data and target file. An additional meta csv file is provided which describes from which Camelyon16 slide the patches were extracted from, but this information is not used in training for or evaluating the benchmark. Please report any downloading problems via a github issue.\n\nDownload all at once from [Google Drive](https://drive.google.com/drive/folders/1gHou49cA1s5vua2V5L98Lt8TiWA3FrKB?usp=sharing).\n\n| Name  | Content | Size | Link | MD5 Checksum|\n| --- | --- |--- | --- |--- |\n| `camelyonpatch_level_2_split_train_x.h5.gz` | training images | 6.1 GB | [Download](https://drive.google.com/uc?export=download\u0026id=1Ka0XfEMiwgCYPdTI-vv6eUElOBnKFKQ2)|`1571f514728f59376b705fc836ff4b63`|\n| `camelyonpatch_level_2_split_train_y.h5.gz` | training labels | 21 KB | [Download](https://drive.google.com/uc?export=download\u0026id=1269yhu3pZDP8UYFQs-NYs3FPwuK-nGSG)|`35c2d7259d906cfc8143347bb8e05be7`|\n| `camelyonpatch_level_2_split_valid_x.h5.gz` | valid images | 0.8 GB | [Download](https://drive.google.com/uc?export=download\u0026id=1hgshYGWK8V-eGRy8LToWJJgDU_rXWVJ3)|`d8c2d60d490dbd479f8199bdfa0cf6ec`|\n| `camelyonpatch_level_2_split_valid_y.h5.gz` | valid labels | 3.0 KB | [Download](https://drive.google.com/uc?export=download\u0026id=1bH8ZRbhSVAhScTS0p9-ZzGnX91cHT3uO)|`60a7035772fbdb7f34eb86d4420cf66a`|\n| `camelyonpatch_level_2_split_test_x.h5.gz`  | test images  | 0.8 GB | [Download](https://drive.google.com/uc?export=download\u0026id=1qV65ZqZvWzuIVthK8eVDhIwrbnsJdbg_)|`d5b63470df7cfa627aeec8b9dc0c066e`|\n| `camelyonpatch_level_2_split_test_y.h5.gz`  | test labels  | 3.0 KB | [Download](https://drive.google.com/uc?export=download\u0026id=17BHrSrwWKjYsOgTMmoqrIjDy6Fa2o_gP)|`2b85f58b927af9964a4c15b8f7e8f179`|\n| `camelyonpatch_level_2_split_train_meta.csv` | training meta |  | [Download](https://drive.google.com/uc?export=download\u0026id=1XoaGG3ek26YLFvGzmkKeOz54INW0fruR)|`5a3dd671e465cfd74b5b822125e65b0a`|\n| `camelyonpatch_level_2_split_valid_meta.csv` | valid meta | | [Download](https://drive.google.com/uc?export=download\u0026id=16hJfGFCZEcvR3lr38v3XCaD5iH1Bnclg)|`3455fd69135b66734e1008f3af684566`|\n| `camelyonpatch_level_2_split_test_meta.csv`  | test meta |  | [Download](https://drive.google.com/uc?export=download\u0026id=19tj7fBlQQrd4DapCjhZrom_fA4QlHqN4)|`67589e00a4a37ec317f2d1932c7502ca`|\n\n#### Mirror Zenodo:\nhttps://zenodo.org/record/2546921\n\n#### Baidu AI Studio:\nhttps://aistudio.baidu.com/aistudio/datasetdetail/30060\n\n## Usage and Tips\n### Keras Example\n[General dataloader for keras](https://github.com/basveeling/pcam/blob/master/keras_pcam/dataset/pcam.py)\n\n```python\nfrom keras.utils import HDF5Matrix\nfrom keras.preprocessing.image import ImageDataGenerator\n\nx_train = HDF5Matrix('camelyonpatch_level_2_split_train_x.h5', 'x')\ny_train = HDF5Matrix('camelyonpatch_level_2_split_train_y.h5', 'y')\n\ndatagen = ImageDataGenerator(\n              preprocessing_function=lambda x: x/255.,\n              width_shift_range=4,  # randomly shift images horizontally\n              height_shift_range=4,  # randomly shift images vertically \n              horizontal_flip=True,  # randomly flip images\n              vertical_flip=True)  # randomly flip images\n              \nmodel.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),\n                    steps_per_epoch=len(x_train) // batch_size\n                    epochs=1024,\n                    )\n```\n\n## Details\n### Numbers\nThe dataset is divided into a training set of 262.144 (2^18) examples, and a validation and test set both of 32.768 (2^15) examples. There is no overlap in WSIs between the splits, and all splits have a 50/50 balance between positive and negative examples.\n\n### Labeling\nA positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable the design of fully-convolutional models that do not use any zero-padding, to ensure consistent behavior when applied to a whole-slide image. This is however not a requirement for the PCam benchmark.\n\n### Patch selection \nPCam is derived from the Camelyon16 Challenge [2], which contains 400 H\\\u0026E stained WSIs of sentinel lymph node sections. The slides were acquired and digitized at 2 different centers  using a 40x objective (resultant pixel resolution of 0.243 microns). We undersample this at 10x to increase the field of view.\nWe follow the train/test split from the Camelyon16 challenge [2], and further hold-out 20% of the train WSIs for the validation set. To prevent selecting background patches, slides are converted to HSV, blurred, and patches filtered out if maximum pixel saturation lies below 0.07 (which was validated to not throw out tumor data in the training set).\nThe patch-based dataset is sampled by iteratively choosing a WSI and selecting a positive or negative patch with probability _p_. Patches are rejected following a stochastic hard-negative mining scheme with a small CNN, and _p_ is adjusted to retain a balance close to 50/50.\n\n### Statistics\n_Coming soon_\n\n## Contact\nFor problems and questions not fit for a github issue, please email [Bas Veeling](mailto:basveeling+pcam@gmail.com).\n## Citing PCam\nIf you use PCam in a scientific publication, we would appreciate references to the following paper:\n\n\n**[1] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling. \"Rotation Equivariant CNNs for Digital Pathology\". [arXiv:1806.03962](http://arxiv.org/abs/1806.03962)**\n\nA citation of the original Camelyon16 dataset paper is appreciated as well:\n\n**[2] Ehteshami Bejnordi et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA: The Journal of the American Medical Association, 318(22), 2199–2210. [doi:jama.2017.14585](https://doi.org/10.1001/jama.2017.14585)**\n\n\nBiblatex entry:\n```bibtex\n@ARTICLE{Veeling2018-qh,\n  title         = \"Rotation Equivariant {CNNs} for Digital Pathology\",\n  author        = \"Veeling, Bastiaan S and Linmans, Jasper and Winkens, Jim and\n                   Cohen, Taco and Welling, Max\",\n  month         =  jun,\n  year          =  2018,\n  archivePrefix = \"arXiv\",\n  primaryClass  = \"cs.CV\",\n  eprint        = \"1806.03962\"\n}\n```\n\n\u003c!-- [Who is citing PCam?](https://scholar.google.de/scholar?hl=en\u0026as_sdt=0%2C5\u0026q=pcam\u0026btnG=\u0026oq=fas) --\u003e\n\n\n## Benchmark\n| Name  | Reference | Augmentations | Acc | AUC|  NLL | FROC* |\n| --- | --- | --- | --- | --- | --- | --- |\n| GDensenet | [1] | Following Liu et al. | 89.8 | 96.3 |  0.260 |75.8 (64.3, 87.2)|\n| [Add yours](https://github.com/basveeling/pcam/edit/master/README.md) | |\n\n\\* Performance on Camelyon16 tumor detection task, not part of the PCam benchmark.\n\n\n## Contributing\nContributions with example scripts for other frameworks are welcome!\n\n## License\nThe data is provided under the [CC0 License](https://choosealicense.com/licenses/cc0-1.0/), following the license of Camelyon16.\n\nThe rest of this repository is under the [MIT License](https://choosealicense.com/licenses/mit/).\n\n## Acknowledgements\n* Babak Ehteshami Bejnordi, Geert Litjens, Jeroen van der Laak for their input on the configuration of this dataset.\n* README derived from [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbasveeling%2Fpcam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbasveeling%2Fpcam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbasveeling%2Fpcam/lists"}