{"id":28676547,"url":"https://github.com/zjunlp/instructcell","last_synced_at":"2025-06-13T23:05:05.924Z","repository":{"id":271172948,"uuid":"851487520","full_name":"zjunlp/InstructCell","owner":"zjunlp","description":"A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following","archived":false,"fork":false,"pushed_at":"2025-01-15T02:21:23.000Z","size":11244,"stargazers_count":19,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-01-15T04:02:44.396Z","etag":null,"topics":["ai","ai-copilot","ai-for-science","artificial-intelligence","cell","cell-type-annotation","chatbot","copilot","instruction-following","large-language-model","multimodal-large-language-models","natural-language-processing","single-cell-analysis"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zjunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-03T07:27:04.000Z","updated_at":"2025-01-15T02:21:24.000Z","dependencies_parsed_at":"2025-01-06T03:34:31.499Z","dependency_job_id":null,"html_url":"https://github.com/zjunlp/InstructCell","commit_stats":null,"previous_names":["zjunlp/instructcell"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zjunlp/InstructCell","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FInstructCell","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FInstructCell/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FInstructCell/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FInstructCell/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zjunlp","download_url":"https://codeload.github.com/zjunlp/InstructCell/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FInstructCell/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259732771,"owners_count":22903087,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-copilot","ai-for-science","artificial-intelligence","cell","cell-type-annotation","chatbot","copilot","instruction-following","large-language-model","multimodal-large-language-models","natural-language-processing","single-cell-analysis"],"created_at":"2025-06-13T23:05:05.270Z","updated_at":"2025-06-13T23:05:05.884Z","avatar_url":"https://github.com/zjunlp.png","language":"Jupyter Notebook","readme":"\u003ch1 align=\"center\"\u003e 🎨 InstructCell \u003c/h1\u003e\n\u003ch3 align=\"center\"\u003e A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following \u003c/h3\u003e\n\n[![Awesome](https://awesome.re/badge.svg)](https://github.com/zjunlp/InstructCell) \n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n![](https://img.shields.io/github/last-commit/zjunlp/InstructCell?color=green) \n\n## Table of Contents\n- 🗞️ [Overview](#1)\n- 🗝️ [Quick start](#2)\n- 🚀 [How to run](#3)\n- 🌻 [Acknowledgement](#4)\n- 🔖 [Citation](#5)\n\n\n\u003ch2 id=\"1\"\u003e🗞️ Overview\u003c/h2\u003e\n\n**InstructCell** is a multi-modal AI copilot that integrates natural language with single-cell RNA sequencing data, enabling researchers to perform tasks like cell type annotation, pseudo-cell generation, and drug sensitivity prediction through intuitive text commands. \nBy leveraging a specialized multi-modal architecture and our multi-modal single-cell instruction dataset, InstructCell reduces technical barriers and enhances accessibility for single-cell analysis.\n\n**InstructCell** has two versions:\n\n1. **Chat Version**: Supports generating both detailed textual answers and single-cell data, offering comprehensive and context-rich outputs.\n2. **Instruct Version**: Supports generating only the answer portion without additional explanatory text, providing concise and task-specific outputs.\n   \nBoth versions of the model are available for download from Hugging Face ([zjunlp/InstructCell-chat](https://huggingface.co/zjunlp/InstructCell-chat) and [zjunlp/InstructCell-instruct](https://huggingface.co/zjunlp/InstructCell-instruct)).\n\n\u003cimg width=\"1876\" alt=\"image\" src=\"https://github.com/user-attachments/assets/3fefe71c-3c00-4c21-b388-cf2300fb9f90\" /\u003e\n\n\n\u003ch2 id=\"2\"\u003e🗝️ Quick start\u003c/h2\u003e\n\n### 🪜 Requirements\n- python 3.10 and above are recommended\n- CUDA 11.7 and above are recommended\n\nWe provide a simple example for quick reference. This demonstrates a basic **cell type annotation** workflow. \n\nMake sure to specify the paths for `H5AD_PATH` and `GENE_VOCAB_PATH` appropriately:\n- `H5AD_PATH`: Path to your `.h5ad` single-cell data file (e.g., `H5AD_PATH = \"path/to/your/data.h5ad\"`).\n- `GENE_VOCAB_PATH`: Path to your gene vocabulary file (e.g., `GENE_VOCAB_PATH = \"path/to/your/gene_vocab.npy\"`).\n\n```python\nfrom mmllm.module import InstructCell\nimport anndata\nimport numpy as np\nfrom utils import unify_gene_features\n\n# Load the pre-trained InstructCell model from HuggingFace\nmodel = InstructCell.from_pretrained(\"zjunlp/InstructCell-chat\")\n\n# Load the single-cell data (H5AD format) and gene vocabulary file (numpy format)\nadata = anndata.read_h5ad(H5AD_PATH)\ngene_vocab = np.load(GENE_VOCAB_PATH)\nadata = unify_gene_features(adata, gene_vocab, force_gene_symbol_uppercase=False)\n\n# Select a random single-cell sample and extract its gene counts and metadata\nk = np.random.randint(0, len(adata)) \ngene_counts = adata[k, :].X.toarray()\nsc_metadata = adata[k, :].obs.iloc[0].to_dict()\n\n# Define the model prompt with placeholders for metadata and gene expression profile\nprompt = (\n    \"Can you help me annotate this single cell from a {species}? \" \n    \"It was sequenced using {sequencing_method} and is derived from {tissue}. \" \n    \"The gene expression profile is {input}. Thanks!\"\n)\n\n# Use the model to generate predictions\nfor key, value in model.predict(\n    prompt, \n    gene_counts=gene_counts, \n    sc_metadata=sc_metadata, \n    do_sample=True, \n    top_p=0.95,\n    top_k=50,\n    max_new_tokens=256,\n).items():\n    # Print each key-value pair\n    print(f\"{key}: {value}\")\n```\n\nFor more detailed explanations and additional examples, please refer to the Jupyter notebook [demo.ipynb](https://github.com/zjunlp/InstructCell/blob/main/demo.ipynb).\n  \n\u003ch2 id=\"3\"\u003e🚀 How to run\u003c/h2\u003e\n\nAssume your current directory path is `DIR_PATH`. \n\n### 🧫 Collecting Raw Single-Cell Datasets\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg width=\"500\" alt=\"image\" src=\"https://github.com/user-attachments/assets/b2002629-a2dc-4009-976e-f63fa6d4aec6\" /\u003e\n\u003c/div\u003e\n\nThe datasets used in the paper are all publicly available. \nDetailed instructions and dataset links are provided in the Jupyter notebooks: [`HumanUnified.ipynb`](https://github.com/zjunlp/InstructCell/blob/main/HumanUnified.ipynb) and [`MouseUnified.ipynb`](https://github.com/zjunlp/InstructCell/blob/main/MouseUnified.ipynb). Below is a summary of the datasets and their corresponding details:\n\n\n|Dataset|Species|Task|Data Repository|Download Link|\n|:-------:|:-------:|:----:|:---------------:|:-------------:|\n|Xin-2016|human|cell type annotation|GEO|https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114297|\n|Segerstolpe-2016|human|cell type annotation|BioStudies|https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-5061|\n|He-2020|human|cell type annotation|GEO|https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE159929|\n|PBMC68K|human|conditional pseudo cell generation|Figshare|https://figshare.com/s/49b29cb24b27ec8b6d72|\n|GSE117872|human|drug sensitivity prediction|GEO|https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117872|\n|GSE149383|human|drug sensitivity predictio|Github|https://github.com/OSU-BMBL/scDEAL|\n|Ma-2020|mouse|cell type annotation|GEO|https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140203|\n|Bastidas-Ponce-2019|mouse|cell type annotation|GEO|https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132188|\n|GSE110894|mouse|drug sensitivity predictio|GEO|https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110894|\n|Mouse-Atlas|mouse|conditional pseudo cell generation|GEO|https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4505404|\n\n🔗 Please Note:\n\nFor the **He-2020** dataset, the cell type annotation file is sourced from the GitHub repository [scRNA-AHCA](https://github.com/bei-lab/scRNA-AHCA/tree/master/Cell_barcode_and_corresponding_cell_types_of_AHCA) 👈. \n\n\n\n### ⚙️ Installation Guide\n\nFollow these steps to set up InstructCell:\n\n1. Clone the repository:\n```sh\ngit clone https://github.com/zjunlp/InstructCell.git\n```\n2. Set up a virtual environment and install the dependencies:\n```sh\nconda create -n instructcell python=3.10\nconda activate instructcell\ncd InstructCell\npip install -r requirements.txt\n```\n\n### 🌐 Downloading Pre-trained Language Models \nThe pre-trained language model used in this project is **T5-base**. You can download it from 🤗 [Hugging Face](https://huggingface.co/google-t5/t5-base) and place the corresponding model directory under `DIR_PATH`.\n\nAlternatively, you can use the provided script to automate the download process:\n```sh\npython download_script.py --repo_id google-t5/t5-base --parent_dir ..\n```\n\n### 🛠️ Single Cell Data Preprocessing\nNavigate to the parent directory `DIR_PATH` and organize your data by creating a main data folder and three task-specific subfolders:\n```sh\ncd ..\nmkdir data \ncd data\nmkdir cell_type_annotation \nmkdir drug_sensitivity_prediction \nmkdir conditional_pseudo_cell_generation\ncd ..\n```\n\nFor dataset preprocessing, refer to the previously mentioned Jupyter notebooks:\n- [HumanUnified.ipynb](https://github.com/zjunlp/InstructCell/blob/main/HumanUnified.ipynb) for human datasets.\n- [MouseUnified.ipynb](https://github.com/zjunlp/InstructCell/blob/main/MouseUnified.ipynb) for mouse datasets.\n\n\n\n\u003e [!NOTE]\n\u003e Matching orthologous genes between mouse and human is based on [pybiomart](https://github.com/jrderuiter/pybiomart/tree/develop) and [pyensembl](https://github.com/openvax/pyensembl). Before preprocessing mouse datasets, ensure the corresponding Ensembl data are downloaded by running:\n```sh\npyensembl install --release 100 --species mus_musculus\n```\n\nAfter completing the preprocessing steps, split each dataset and build a gene vocabulary using the following command: \n```sh\ncd InstructCell\npython preprocess.py --n_top_genes 3600 \n```\nTo customize the size of the gene vocabulary, adjust the `n_top_genes` parameter as needed. For instance, setting it to 2000 will generate a smaller vocabulary. At this point, two files, `gene_vocab.npy` and `choices.pkl`, are generated. The first file stores the selected genes, while the second holds the category labels for each classification dataset. The gene vocabulary and label set used in this project can both be found in this [folder](https://github.com/zjunlp/InstructCell/tree/main/exp_log).\n\n\n### 🧺 Instruction-Response Template Construction\n\nThe instruction-response templates used in the projects are stored in this [folder](https://github.com/zjunlp/InstructCell/tree/main/exp_log/exp_templates).\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg width=\"800\" alt=\"image\" src=\"https://github.com/user-attachments/assets/a58e5c62-c6dd-4fac-8677-c47c4cb7c093\" /\u003e\n\n\u003c/div\u003e\n\nThe construction of instruction-response templates is divided into four stages:\n1. **Motivation and personality generation**: In this stage, the large language model is prompted to generate potential motivations for each task and corresponding personalities. This step is implemented in the `data_synthesis.py` script.\n2. **Template synthesis via parallel API calls**: Multiple APIs are run in parallel to synthesize templates, with each API invoked a specified number of times per task. This process is also implemented in the `data_synthesis.py` script.\n3. **Merging synthesized templates**: The generated templates are consolidated into a unified collection using the `merge_templates.py` script.\n4. **Filtering and splitting templates**: Finally, the templates are filtered for quality and divided into specific datasets using the `split_templates.py` script.\n\n\nTo execute all four stages in sequence, use the `run_data_synthesis.sh` script:\n```sh\nbash run_data_synthesis.sh  \n```\n\n\u003e [!NOTE]\n\u003e Before executing `run_data_synthesis.sh`, ensure the parameters in the script are configured correctly. Update the API keys and base URL as needed, specify the model for template synthesis (`model` in the script), and adjust the number of API calls per task (`num_templates_for_task` in the script).\n\n\n### 🚀 Training InstructCell \n\n\u003cdiv align=\"center\"\u003e\n     \u003cimg width=\"650\" alt=\"image\" src=\"https://github.com/user-attachments/assets/82ed82c4-5d9d-4e84-9ce2-dc11fc4e560e\" /\u003e\n\u003c/div\u003e\n\nTo train InstructCell, use the following command: \n```sh\ntorchrun --nproc_per_node=8 mm_train.py \\\n    --epochs 160 \\\n    --save_freq 20 \\\n    --batch_size 64 \\\n    --train_template_dir ../output/train_templates \\\n    --valid_template_dir ../output/valid_templates \\\n    --save_dir ../checkpoints \\\n    --best_model_dir ../trained_models \\ \n    --train_no_extra_output_ratio 1.0 \\\n    --eval_no_extra_output_ratio 1.0\n```\n- To obtain the chat version of InstructCell, set both `train_no_extra_output_ratio` and `eval_no_extra_output_ratio` to 0. \n- To resume training from a specific checkpoint (`YOUR_CHECKPOINT_PATH`), include the flags `--resume True` and `--resume_path YOUR_CHECKPOINT_PATH`.\n- For training on a single task and dataset, modify the `TASKS` parameter in `metadata.py`, retain only one dataset directory in the corresponding task folder, and set `--unify_gene False`.\n- You can customize the architecture of InstructCell (e.g., the number of query tokens in Q-Former or the latent variable dimensions in the VAE) by modifying the `MODEL_PARAMETERS` in `metadata.py`.\n\n\n\n### 📑 Evaluation\nTo evaluate the performance of InstructCell on conditional pseudo-cell generation, run:\n```sh\npython evaluate.py \\\n    --best_model_path ../trained_models/best_mm_model.pkl \\\n    --task_type \"conditional pseudo cell generation\" \\\n    --template_dir_name ../output/test_templates \\\n    --no_extra_output_ratio 1.0 \n```\n- To evaluate InstructCell on other tasks, modify the `task_type` parameter to `\"cell type annotation\"` or `\"drug sensitivity prediction\"` accordingly.\n- To test InstructCell’s robustness to different task descriptions, add the flag `--evaluate_single_prompt True`. By default, 20 different task descriptions are used. To increase this number (e.g., to 40), include `--num_single_prompt 40`.\n- If you want to evaluate only test templates that contain options, add `--provide_choices True`. By default, all test templates are evaluated.\n- To evaluate the **chat** version of InstructCell, set the `no_extra_output_ratio` parameter to 0.0.  This will generate content formatted for xFinder’s JSON input requirements.  For detailed evaluation procedures using xFinder, please visit the [xFinder repository](https://github.com/IAAR-Shanghai/xFinder) 👈. Alternatively, you can refer to the [README](https://github.com/zjunlp/InstructCell/blob/main/xfinder/README.md) provided in [our repository](https://github.com/zjunlp/InstructCell/tree/main/xfinder) for additional guidance.\n\n\u003c!-- ## 🧬 Extracting Marker Genes --\u003e\n\n\u003c!-- ## 🌠 Visualization --\u003e \n\n\u003c!-- ## 🎬 Demo  --\u003e \n\n\u003ch2 id=\"4\"\u003e🌻 Acknowledgement\u003c/h2\u003e\n\nWe would like to express our sincere gratitude to the excellent work [ALBEF](https://github.com/salesforce/ALBEF), and [scvi-tools](https://github.com/scverse/scvi-tools).\n\n\n\u003ch2 id=\"5\"\u003e🔖 Citation\u003c/h2\u003e\n\nIf you use the code or data, please cite the following paper:\n\n\n```bibtex\n@article{fang2025instructcell,\n  title={A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following},\n  author={Fang, Yin and Deng, Xinle and Liu, Kangwei and Zhang, Ningyu and Qian, Jingyang and Yang, Penghui and Fan, Xiaohui and Chen, Huajun},\n  journal={arXiv preprint arXiv:2501.08187},\n  year={2025}\n}\n```\n\n## ✨ Contributors\n\n\u003ca href=\"https://github.com/zjunlp/InstructCell/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=zjunlp/InstructCell\" /\u003e\n\u003c/a\u003e\n\nWe will offer long-term maintenance to fix bugs and solve issues. So if you have any problems, please put issues to us.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Finstructcell","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjunlp%2Finstructcell","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Finstructcell/lists"}