{"id":29704575,"url":"https://github.com/adaptinfer/compbiodatasetsformachinelearning","last_synced_at":"2026-03-04T05:01:21.649Z","repository":{"id":114002712,"uuid":"71814342","full_name":"AdaptInfer/CompBioDatasetsForMachineLearning","owner":"AdaptInfer","description":"A Curated List of Computational Biology Datasets Suitable for Machine Learning","archived":false,"fork":false,"pushed_at":"2024-04-19T20:47:37.000Z","size":45,"stargazers_count":196,"open_issues_count":0,"forks_count":26,"subscribers_count":6,"default_branch":"master","last_synced_at":"2026-02-22T13:52:19.174Z","etag":null,"topics":["biomedical-data-science","computational-biology","computational-biology-datasets","curated-list","dataset","datasets","machine-learning"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AdaptInfer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-10-24T17:33:59.000Z","updated_at":"2026-01-31T11:40:11.000Z","dependencies_parsed_at":"2023-12-20T11:43:16.945Z","dependency_job_id":"b40ea21d-f12d-4194-ad40-995b33beaaa0","html_url":"https://github.com/AdaptInfer/CompBioDatasetsForMachineLearning","commit_stats":null,"previous_names":["lengerichlab/compbiodatasetsformachinelearning","adaptinfer/compbiodatasetsformachinelearning"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AdaptInfer/CompBioDatasetsForMachineLearning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdaptInfer%2FCompBioDatasetsForMachineLearning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdaptInfer%2FCompBioDatasetsForMachineLearning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdaptInfer%2FCompBioDatasetsForMachineLearning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdaptInfer%2FCompBioDatasetsForMachineLearning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AdaptInfer","download_url":"https://codeload.github.com/AdaptInfer/CompBioDatasetsForMachineLearning/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdaptInfer%2FCompBioDatasetsForMachineLearning/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30071895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T03:25:38.285Z","status":"ssl_error","status_checked_at":"2026-03-04T03:25:05.086Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biomedical-data-science","computational-biology","computational-biology-datasets","curated-list","dataset","datasets","machine-learning"],"created_at":"2025-07-23T14:11:36.150Z","updated_at":"2026-03-04T05:01:20.929Z","avatar_url":"https://github.com/AdaptInfer.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Computational Biology Datasets Suitable For Machine Learning\nThis is a curated list of computational biology datasets that have been pre-processed for machine learning.\nThis list is a work in progress, please submit a pull request for any dataset you would like to advertise!\n\n## Genotyping\n|Name | Description | Comments |\n|:-:|---|---|\n|[The Cancer Genome Atlas](https://cancergenome.nih.gov/)| Variety of Cancer Data  | most cancer types have 100-1000 samples  |\n|[NIH GDC](https://gdc-portal.nci.nih.gov/)| Cancer, many types of genomic data  |   |\n|[UK Biobank](http://www.ukbiobank.ac.uk/about-biobank-uk/) |   |   |\n|[European Genome-Phenome Archive](https://www.ebi.ac.uk/ega/datasets)| |  |\n|[METABRIC](http://www.cbioportal.org/study?id=brca_metabric#summary)| The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.| |\n|[HapMap](https://www.genome.gov/hapmap/)| | |\n|[23andMe](http://www.biorxiv.org/content/early/2017/04/19/127241)| 2280 Public Domain Curated Genotypes | |\n|[Mice](http://wp.cs.ucl.ac.uk/outbredmice/heterogeneous-stock-mice/) | SNPs, 2000+ samples | 4 generations. It might be possible to learn a family structure out of the data.  |\n|[Arabidopsis](https://www.arabidopsis.org/download/) | SNPs, 100+ phenotypes | |\n\n## Promoter-Enhancer Pairs\n|Name | Description | Comments |\n|:-:|---|---|\n|[TargetFinder](https://github.com/shwhalen/targetfinder)|~100,000 DNA-DNA interaction pairs | |\n\n## Gene/Protein Expression\n|Name | Description | Comments |\n|:-:|---|---|\n|[GEO](http://www.ncbi.nlm.nih.gov/geo/) | Main place for NCBI data |  |\n|[ENCODE](http://www.encodeproject.org/) | Variety of assays to identify functional elements | |\n|[ArrayExpress](http://www.ebi.ac.uk/arrayexpress/) | DNA sequencing, gene/protein expression, epigenetics | |\n|[Cytometry\tContinuous](http://science.sciencemag.org/content/308/5721/523) | flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline\t| Classical benchmark dataset for learning graphical models; contains known errors |\n|[Transcription factor binding](http://www.pnas.org/content/106/51/21521.abstract?tab=ds) |\tChIP-Seq data on 12 TFs |\t |\n|[GTEx](http://www.gtexportal.org/home/) | Landmark study for EQTL analysis | |\n|[PharmacoGenomics DB](https://www.pharmgkb.org/)\t|\t| |\n|[ProteomeXChange](http://www.proteomexchange.org/)| | |\n|[BeatAML](https://www.nature.com/articles/s41586-018-0623-z)| whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity | 672 tumour specimens collected from 562 patients |\n\n## Single-cell Data\n|Name | Description | Comments |\n|:-:|---|---|\n|[Single-cell expression atlas](https://www.ebi.ac.uk/gxa/sc/) | | |\n|[scPerturb](https://www.nature.com/articles/s41592-023-02144-y) | single-cell perturbation-response datasets | harmonized and preprocessed across 44 original datasets |\n\n## Regulatory Networks\n|Name | Description | Comments |\n|:-:|---|---|\n|[TRRUST](http://www.grnpedia.org/trrust/)| manually curated database of human transcriptional regulatory network |  |\n|[Yeast Network](http://science.sciencemag.org/content/353/6306/aaf1420/tab-pdf)| 23-million yeast 2-hybrid experiments to investigate genetic interactions |  |\n|[Perturb-Seq](http://www.sciencedirect.com/science/article/pii/S0092867416316105)| Integrated model of perturbations, single cell phenotypes, and epistatic interactions |  |\n|[KEGG Metabolic Regulatory Network (Undirected)](https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Reaction+Network+%28Undirected%29) | 65554 instances, 29 attributes each |  |\n|[KEGG Metabolic Regulatory Network (Directed)](https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Relation+Network+%28Directed%29) |53414 instance, 24 attributes each |  |\n\n## Images\n|Name | Description | Comments |\n|:-:|---|---|\n|[The Cancer Imaging Archive](http://www.cancerimagingarchive.net/)| Extracts the images from the TCGA data | |\n|[Multiple Myeloma DREAM Challenge](https://www.synapse.org/#!Synapse:syn6187098/wiki/401884)| Challenge to identify Multiple Myeloma Patients |  |\n|[Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)| Predict whether the cancer is benign or malignant | |\n|[DDSM](http://marathon.csee.usf.edu/Mammography/Database.html)|Mammogram Database | |\n|[Kaggle Soft Tissue Sarcomas](https://www.kaggle.com/4quant/soft-tissue-sarcoma)| Preprocessed subset of the TCIA study \"Soft Tissue Sarcoma\" | segmentation task |\n|[Kaggle Cervical Cancer Screening](https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening)| Classify cervix type from images| |\n|[CMELYON17](https://camelyon17.grand-challenge.org/) | Pathology challenge - automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections| |\n|[Grand Challenges](https://grand-challenge.org/all_challenges/) | Datasets from biomedical image analysis competitions | |\n|[Breast Cancer MRI Dataset](https://sites.duke.edu/mazurowski/resources/breast-cancer-mri-dataset/) | Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images | |\n\n## fMRI\n|Name | Description | Comments |\n|:-:|---|---|\n|[ENGIMA Cerebellum](https://my.vanderbilt.edu/enigmacerebellum/)| Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction | |\n|[Seizure Prediction](https://www.kaggle.com/c/melbourne-university-seizure-prediction/data) | Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure). | |\n\n## Electronic Medical Records\n|Name | Description | Comments |\n|:-:|---|---|\n|[MIMIC](https://mimic.physionet.org/)| 59,000 EHRs |  |\n|[UCI Diabetes](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008)| 130 US hospital data for 1999-2008| |\n|[i2b2](https://www.i2b2.org/NLP/DataSets/Main.php) | Clinical notes only, designed for NLP tasks | |\n|[PhysioNet](https://www.physionet.org/physiobank/database/) |  | |\n|[Metadata Acquired from Clinical Case Reports (MACCRs)](https://www.nature.com/articles/sdata2018258) | 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases | |\n|[eICU](https://www.nature.com/articles/sdata2018178)| 200k EHRs | |\n|[All of Us](https://databrowser.researchallofus.org/)| \u003e250k EHRs, some genomic data | |\n|[PMC-Patients](https://www.nature.com/articles/s41597-023-02814-8)| 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations | |\n\n## Radiographs\n|Name | Description | Comments |\n|:-:|---|---|\n|[CheXPert](https://stanfordmlgroup.github.io/competitions/chexpert/) | 200k chest radiographs |  Competition and leaderboard associated |\n|[MIMIC-CXR](https://arxiv.org/abs/1901.07042) | ~400k chest x-rays, 14 labels | Data on PhysioNet |\n|[PadChest](http://bimcv.cipf.es/bimcv-projects/padchest/) | 160k chest x-rays, 174 different findings | |\n\n## Protein-Protein Interactions\n|Name | Description | Comments |\n|:-:|---|---|\n|[HINT (High-quality INTeractomes)](http://hint.yulab.org/) |  curated compilation of high-quality protein-protein interactions from 8 interactome resources | |\n\n## Longitudinal Studies\n|Name | Description | Comments |\n|:-:|---|---|\n|[National Population Health Survey](http://www.statcan.gc.ca/eng/survey/household/3225)| Longitudinal Survey that collects health information via surveys every two years. | |\n\n## Protein Structure\n|Name | Description | Comments |\n|:-:|---|---|\n|[ProteinNet](https://github.com/aqlaboratory/proteinnet) | Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits. | |\n\n## Natural Language Data\n|Name | Description | Comments |\n|:-:|---|---|\n|[BioASQ](http://www.bioasq.org/) | Abstracts of medical articles (from PubMed); ontologies of medical concepts. | Tasks: MLC, QA. |\n|[Cases](http://www.casesdatabase.com/) | Articles from medical case studies. | |\n|[UPMC Pathology](http://path.upmc.edu/cases.html) | UPMC Pathology case studies. | |\n\n## Therapeutics\n|Name | Description | Comments |\n|:-:|---|---|\n|[Therapeutic Data Commons](https://tdcommons.ai/)| Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. | Available as Python modules. |\n|[Cancer Omics Drug Experiment Response Dataset](https://github.com/PNNL-CompBio/coderdata)| Molecular datasets paired with corresponding drug sensitivity data | Seeks to standardize datasets of cancer drug responses into a [standard schema](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/README.md) |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadaptinfer%2Fcompbiodatasetsformachinelearning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadaptinfer%2Fcompbiodatasetsformachinelearning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadaptinfer%2Fcompbiodatasetsformachinelearning/lists"}