{"id":28700529,"url":"https://github.com/deepgraphlearning/sccello","last_synced_at":"2025-06-14T11:08:24.280Z","repository":{"id":271203691,"uuid":"880418018","full_name":"DeepGraphLearning/scCello","owner":"DeepGraphLearning","description":"[NeurIPS 2024 Spotlight] A cell-ontology guided transcriptome foundation model","archived":false,"fork":false,"pushed_at":"2025-03-12T02:28:00.000Z","size":23774,"stargazers_count":12,"open_issues_count":2,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-12T03:27:57.709Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DeepGraphLearning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-29T17:29:36.000Z","updated_at":"2025-03-12T02:28:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"10adcb98-6150-45bc-a261-f07166707514","html_url":"https://github.com/DeepGraphLearning/scCello","commit_stats":null,"previous_names":["deepgraphlearning/sccello"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DeepGraphLearning/scCello","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FscCello","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FscCello/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FscCello/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FscCello/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DeepGraphLearning","download_url":"https://codeload.github.com/DeepGraphLearning/scCello/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FscCello/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259804865,"owners_count":22913903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-14T11:08:16.050Z","updated_at":"2025-06-14T11:08:24.258Z","avatar_url":"https://github.com/DeepGraphLearning.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n\n# scCello: Cell-ontology Guided Transcriptome Foundation Model\n\n[![pytorch](https://img.shields.io/badge/PyTorch_2.5+-ee4c2c?logo=pytorch\u0026logoColor=white)](https://pytorch.org/get-started/locally/)\n[![scCello arxiv](http://img.shields.io/badge/arxiv-2408.12373-yellow.svg)](https://arxiv.org/abs/2408.12373)\n[![HuggingFace Hub](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-black)](https://huggingface.co/collections/katarinayuan/sccello-67a01b6841f3658ba443c58a)\n![license](https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray)\n\n\u003c/div\u003e\n\nPyTorch implementation of [scCello], a cell-ontology guided transcriptome foundation model (TFM) for single cell RNA-seq data. Authored by [Xinyu Yuan], and [Zhihao Zhan].\n\n[Xinyu Yuan]: https://github.com/KatarinaYuan\n[Zhihao Zhan]: https://github.com/zhan8855\n[scCello]: https://github.com/DeepGraphLearning/scCello\n\n## Overview ##\n\nscCello enhances transcriptome foundation models (TFMs) by integrating cell ontology graphs into pre-training, addressing the limitation of treating cells as independent entities. By incorporating cell-level objectives: **cell-type coherence loss** and **ontology alignment loss**, scCello demonstrate superior or competitive generalization and transferability capability over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses.\n\nThis repository is based on PyTorch 2.0 and Python 3.9.\n\n![Main Method](asset/main_method_sccello.png)\n\nTable of contents:\n* [Features](#features)\n* [Updates](#updates)\n* [Installation](#installation)\n* [Download](#download)\n    * [Model checkpoints](#model-checkpoints)\n    * [Pre-training and downstream datasets](#pre-training-and-downstream-datasets)\n    * [Example h5ad data](#example-h5ad-data)\n* [Usage](#usage)\n    * [h5ad data format transformation](#h5ad-data-format-transformation)\n    * [Downstream generalization](#downstream-generalization)\n         * [Cell type clustering \u0026 batch integration](#cell-type-clustering--batch-integration)\n         * [Cell type classification](#cell-type-classification)\n         * [Novel cell type classification](#novel-cell-type-classification)\n    * [Downstream transferability](#downstream-transferability)\n         * [Marker gene prediction](#marker-gene-prediction)\n         * [Cancer drug response prediction](#cancer-drug-response-prediction)\n    * [Pre-training](#pre-training)\n* [Citation](#citation)\n\n## Features ##\n* **Cell-type Specific Learning**: Utilizes cell-type coherence loss to learn specific gene expression patterns relevant to each cell type.\n* **Ontology-aware Modeling**: Employs ontology alignment loss to understand and preserve the hierarchical relationships among different cell types.\n* **Large-scale Pre-training**: Trained on over 22 million cells from the CellxGene database, ensuring robust and generalizable models.\n* **Advanced Generalization and Transferability**: Demonstrates superior performance on various biologically significant tasks such as identifying novel cell types and predicting cell-type-specific marker genes.\n\n\n\n## Updates\n* **Feb 5th, 2025**: scCello code released!\n* **Oct 1st, 2024**: scCello got accepted at NeurIPS 2024!\n* **Aug 22nd, 2024**: scCello preprint release on arxiv!\n\n## Installation ##\n\nYou may install the dependencies via the following bash command. \n\n```bash\nconda install pytorch==2.0.1 pytorch-cuda=11.7 -c pytorch -c nvidia\npip install transformers[torch]\npip install easydict\npip install psutil\npip install wandb\npip install pytz\npip install ipdb\npip install pandas\npip install datasets\npip install torchmetrics\npip install rdflib\npip install hickle\npip install anndata\npip install scikit-learn\npip install scanpy\npip install scib\nconda install -c conda-forge cupy\nconda install rapidsai::cuml\nconda install -c rapidsai -c conda-forge -c nvidia cugraph\n```\n\n\n## Download ##\n### Model Checkpoints ###\n\nQuick start guide to load scCello checkpoint:\n* for zero-shot inference tasks\n```\nfrom sccello.src.model_prototype_contrastive import PrototypeContrastiveForMaskedLM\n\nmodel = PrototypeContrastiveForMaskedLM.from_pretrained(\"katarinayuan/scCello-zeroshot\", output_hidden_states=True)\n```\n\n* for linear probing tasks (see details in sccello/script/run_cell_type_classification.py)\n```\nfrom sccello.src.model_prototype_contrastive import PrototypeContrastiveForSequenceClassification\n\nmodel_kwargs = {\n    \"num_labels\": NUM_LABELS, # number of labels for classification\n    \"total_logging_steps\": training_cfg[\"logging_steps\"] * training_args.gradient_accumulation_steps,\n}\n\nmodel = PrototypeContrastiveForSequenceClassification.from_pretrained(\"katarinayuan/scCello-zeroshot\", **model_kwargs)\n```\n### Pre-training and Downstream Datasets ###\nFor downstreams, in-distribution (ID) data $D^{id}$ and out-of-distribution (OOD) data across cell type $\\{D_i^{ct}\\}|i\\in{1,2}$, tissue $\\{D_i^{ts}\\}|i\\in{1,2}$ and donors $\\{D_i^{dn}\\}|i\\in{1,2}$ are summarized (see App. B for data preprocessing details.)\n\n\n```\n# Note that some datasets are extremely large, use the following command to change data caching directory. The default is \"~/.cache/huggingface/datasets/\".\nexport HF_HOME=\"/path/to/another/directory/datasets\"\n\nfrom sccello.src.utils import data_loading\n\n# pre-training data \u0026 D^{id}\ntrain_dataset = load_dataset(\"katarinayuan/scCello_pretrain_unsplitted\")[\"train\"]\ntrain_dataset, indist_test_data = train_dataset.train_test_split(test_size=0.001, seed=237) # seed used in scCello\n\n# D_1^{ct} \u0026 D_2^{ct}\nd1_ct, d2_ct = data_loading.get_fracdata(\"celltype\", \"frac100\", False, False)\n\n# D_1^{ts} \u0026 D_2^{ts}\nd1_ts, d2_ts = data_loading.get_fracdata(\"tissue\", \"frac100\", False, False)\n\n# D_1^{dn} \u0026 D_2^{dn}\nd1_dn, d2_dn = data_loading.get_fracdata(\"donor\", \"frac100\", False, False)\n\n```\n\n### Example h5ad data ###\nExample data for transforming h5ad format to huggingface format.\nFor building pre-training datasets and downstream datasets, we downloaded a series of human h5ad data from [CellxGene](https://chanzuckerberg.github.io/cellxgene-census/)\n```bash\npip install gdown\ncd ./data/example_h5ad/\ngdown https://drive.google.com/uc?id=1UsbkhmZwSDWTgY4die60fHvzL_FnXtWE\n```\n\n## Usage ##\nThe `sccello/script` folder contains all executable files.\n\nGeneral configurations:\n```\npretrained_ckpt=katarinayuan/scCello-zeroshot\noutput_dir=/home/xinyu402/single_cell_output/\nwandb_run_name=test\n```\n\n### h5ad Data Format Transformation ###\n\n```\npython ./sccello/script/run_data_transformation.py \n```\n\n### Downstream Generalization ###\n\n#### Cell Type Clustering \u0026 Batch Integration ####\n\n```\npython ./sccello/script/run_cell_type_clustering.py --pretrained_ckpt  $pretrained_ckpt --wandb_run_name $wandb_run_name --output_dir $output_dir\n```\n\n\n#### Cell Type Classification ####\n```\n# Linear Probing\ntraining_type=linear_probing\n# or Train from Scratch without Loading the Pre-trained Model\n# training_type=from_scratch_linear\n\ntorchrun ./sccello/script/run_cell_type_classification.py --pretrained_ckpt $pretrained_ckpt --training_type $training_type --wandb_run_name $wandb_run_name --further_downsample 0.01 --output_dir $output_dir\n```\n\n#### Novel Cell Type Classification ####\n```\npython ./sccello/script/run_novel_cell_type_classification.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name --indist_repr_path ./embedding_storage/cellreprs_indist_frac_celltype_data1.pkl --output_dir $output_dir\n```\n\n### Downstream Transferability ###\n#### Marker Gene Prediction ####\n```\npython ./sccello/script/run_marker_gene_prediction.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name --output_dir $output_dir\n```\n\n#### Cancer Drug Response Prediction ####\n```\npython ./sccello/script/run_cancer_drug_response.py --pretrained_ckpt $pretrained_ckpt --wandb_run_name $wandb_run_name\n```\n\n### Pre-training ###\n```\npython -m torch.distributed.run --nproc_per_node=1 ./sccello/script/run_pretrain_prototype_contrastive.py --wandb_run_name pretrain_test \n```\n\n## Citation ##\n\nIf you find this codebase useful in your research, please cite the original papers.\n\nThe main scCello paper:\n\n```bibtex\n@inproceedings{yuancell,\n  title={Cell ontology guided transcriptome foundation model},\n  author={Yuan, Xinyu and Zhan, Zhihao and Zhang, Zuobai and Zhou, Manqi and Zhao, Jianan and Han, Boyu and Li, Yue and Tang, Jian},\n  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fsccello","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepgraphlearning%2Fsccello","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Fsccello/lists"}