{"id":27647789,"url":"https://github.com/cafferychen777/mLLMCelltype","last_synced_at":"2025-04-24T02:01:46.645Z","repository":{"id":287538807,"uuid":"961664731","full_name":"cafferychen777/mLLMCelltype","owner":"cafferychen777","description":"An iterative multi-LLM consensus framework for accurate cell type annotation in single-cell RNA-seq data","archived":false,"fork":false,"pushed_at":"2025-04-19T14:56:36.000Z","size":8696,"stargazers_count":20,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-19T15:09:34.569Z","etag":null,"topics":["bioinformatics","cell-type-annotation","consensus-algorithm","large-language-models","llm","scanpy","seurat","single-cell"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cafferychen777.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-07T00:56:29.000Z","updated_at":"2025-04-19T14:56:40.000Z","dependencies_parsed_at":"2025-04-13T06:47:01.822Z","dependency_job_id":null,"html_url":"https://github.com/cafferychen777/mLLMCelltype","commit_stats":null,"previous_names":["cafferychen777/mllmcelltype"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafferychen777%2FmLLMCelltype","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafferychen777%2FmLLMCelltype/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafferychen777%2FmLLMCelltype/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafferychen777%2FmLLMCelltype/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cafferychen777","download_url":"https://codeload.github.com/cafferychen777/mLLMCelltype/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250137085,"owners_count":21380951,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","cell-type-annotation","consensus-algorithm","large-language-models","llm","scanpy","seurat","single-cell"],"created_at":"2025-04-24T02:01:03.102Z","updated_at":"2025-04-24T02:01:46.616Z","avatar_url":"https://github.com/cafferychen777.png","language":"Python","funding_links":[],"categories":["Software packages and methods","Software packages","Useful Links:","🔬 Domain-Specific Applications"],"sub_categories":["Single cell multi-omics","Cell type identification and classification","Data Analysis Pipeline","🧬 Biology \u0026 Medicine"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/mLLMCelltype_logo.png\" alt=\"mLLMCelltype Logo\" width=\"300\"/\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"README_CN.md\"\u003e中文\u003c/a\u003e | \u003ca href=\"README_ES.md\"\u003eEspañol\u003c/a\u003e | \u003ca href=\"README_JP.md\"\u003e日本語\u003c/a\u003e | \u003ca href=\"README_DE.md\"\u003eDeutsch\u003c/a\u003e | \u003ca href=\"README_FR.md\"\u003eFrançais\u003c/a\u003e | \u003ca href=\"README_KR.md\"\u003e한국어\u003c/a\u003e\n\u003c/div\u003e\n\nmLLMCelltype is an iterative multi-LLM consensus framework for cell type annotation in single-cell RNA sequencing data. By leveraging the complementary strengths of multiple large language models (OpenAI GPT-4o/4.1, Anthropic Claude-3.7/3.5, Google Gemini-2.0, X.AI Grok-3, DeepSeek-V3, Alibaba Qwen2.5, Zhipu GLM-4, MiniMax, Stepfun, and OpenRouter), this framework significantly improves annotation accuracy while providing transparent uncertainty quantification.\n\n## Key Features\n\n- **Multi-LLM Consensus Architecture**: Harnesses collective intelligence from diverse LLMs to overcome single-model limitations and biases\n- **Structured Deliberation Process**: Enables LLMs to share reasoning, evaluate evidence, and refine annotations through multiple rounds of collaborative discussion\n- **Transparent Uncertainty Quantification**: Provides quantitative metrics (Consensus Proportion and Shannon Entropy) to identify ambiguous cell populations requiring expert review\n- **Hallucination Reduction**: Cross-model deliberation actively suppresses inaccurate or unsupported predictions through critical evaluation\n- **Robust to Input Noise**: Maintains high accuracy even with imperfect marker gene lists through collective error correction\n- **Hierarchical Annotation Support**: Optional extension for multi-resolution analysis with parent-child consistency\n- **No Reference Dataset Required**: Performs accurate annotation without pre-training or reference data\n- **Complete Reasoning Chains**: Documents the full deliberation process for transparent decision-making\n- **Seamless Integration**: Works directly with standard Scanpy/Seurat workflows and marker gene outputs\n- **Modular Design**: Easily incorporate new LLMs as they become available\n\n## Directory Structure\n\n- `R/`: R language interface and implementation\n- `python/`: Python interface and implementation\n\n## Installation\n\n### R Version\n\n```r\n# Install from GitHub\ndevtools::install_github(\"cafferychen777/mLLMCelltype\", subdir = \"R\")\n```\n\n### Python Version\n\n```bash\n# Install from PyPI\npip install mllmcelltype\n\n# Or install from GitHub\npip install git+https://github.com/cafferychen777/mLLMCelltype.git\n```\n\n### Supported Models\n\n- **OpenAI**: GPT-4.1/GPT-4.5/GPT-4o ([API Key](https://platform.openai.com/settings/organization/billing/overview))\n- **Anthropic**: Claude-3.7-Sonnet/Claude-3.5-Haiku ([API Key](https://console.anthropic.com/))\n- **Google**: Gemini-2.0-Pro/Gemini-2.0-Flash ([API Key](https://ai.google.dev/?authuser=2))\n- **Alibaba**: Qwen2.5-Max ([API Key](https://www.alibabacloud.com/en/product/modelstudio))\n- **DeepSeek**: DeepSeek-V3/DeepSeek-R1 ([API Key](https://platform.deepseek.com/usage))\n- **Minimax**: MiniMax-Text-01 ([API Key](https://intl.minimaxi.com/user-center/basic-information/interface-key))\n- **Stepfun**: Step-2-16K ([API Key](https://platform.stepfun.com/account-info))\n- **Zhipu**: GLM-4 ([API Key](https://bigmodel.cn/))\n- **X.AI**: Grok-3/Grok-3-mini ([API Key](https://accounts.x.ai/))\n- **OpenRouter**: Access to multiple models through a single API ([API Key](https://openrouter.ai/keys))\n  - Supports models from OpenAI, Anthropic, Meta, Google, Mistral, and more\n  - Format: 'provider/model-name' (e.g., 'openai/gpt-4o', 'anthropic/claude-3-opus')\n\n## Usage Examples\n\n### Python\n\n```python\nimport scanpy as sc\nimport pandas as pd\nfrom mllmcelltype import annotate_clusters, setup_logging, interactive_consensus_annotation\nimport os\n\n# Set up logging\nsetup_logging()\n\n# Load your data\nadata = sc.read_h5ad('your_data.h5ad')\n\n# Check if leiden clustering is already computed, if not, compute it\nif 'leiden' not in adata.obs.columns:\n    print(\"Computing leiden clustering...\")\n    # Ensure data is preprocessed (normalize, log-transform if needed)\n    if 'log1p' not in adata.uns:\n        sc.pp.normalize_total(adata, target_sum=1e4)\n        sc.pp.log1p(adata)\n    \n    # Calculate PCA if not already done\n    if 'X_pca' not in adata.obsm:\n        sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)\n        sc.pp.pca(adata, use_highly_variable=True)\n    \n    # Compute neighbors and leiden clustering\n    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)\n    sc.tl.leiden(adata, resolution=0.8)\n    print(f\"Leiden clustering completed, found {len(adata.obs['leiden'].cat.categories)} clusters\")\n\n# Run differential expression analysis to get marker genes\nsc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')\n\n# Extract marker genes for each cluster\nmarker_genes = {}\nfor i in range(len(adata.obs['leiden'].cat.categories)):\n    # Extract top 10 genes for each cluster\n    genes = [adata.uns['rank_genes_groups']['names'][str(i)][j] for j in range(10)]\n    marker_genes[str(i)] = genes\n\n# IMPORTANT: Ensure genes are represented as gene symbols (e.g., KCNJ8, PDGFRA) not as Ensembl IDs (e.g., ENSG00000176771)\n# If your AnnData object stores genes as Ensembl IDs, convert them to gene symbols first:\n# Example:\n# if 'Gene' in adata.var.columns:  # Check if gene symbols are available in the var dataframe\n#     gene_name_dict = dict(zip(adata.var_names, adata.var['Gene']))\n#     marker_genes = {cluster: [gene_name_dict.get(gene_id, gene_id) for gene_id in genes] \n#                    for cluster, genes in marker_genes.items()}\n\n# Set API keys for the providers you want to use\n# You need at least one API key for the models you plan to use\nos.environ[\"OPENAI_API_KEY\"] = \"your-openai-api-key\"      # Required for GPT models\nos.environ[\"ANTHROPIC_API_KEY\"] = \"your-anthropic-api-key\"  # Required for Claude models\nos.environ[\"GEMINI_API_KEY\"] = \"your-gemini-api-key\"      # Required for Gemini models\nos.environ[\"QWEN_API_KEY\"] = \"your-qwen-api-key\"        # Required for Qwen models\n# Additional optional models\n# os.environ[\"DEEPSEEK_API_KEY\"] = \"your-deepseek-api-key\"   # For DeepSeek models\n# os.environ[\"ZHIPU_API_KEY\"] = \"your-zhipu-api-key\"       # For GLM models\n# os.environ[\"STEPFUN_API_KEY\"] = \"your-stepfun-api-key\"    # For Step models\n# os.environ[\"MINIMAX_API_KEY\"] = \"your-minimax-api-key\"    # For MiniMax models\n\n# Run consensus annotation with multiple models\nconsensus_results = interactive_consensus_annotation(\n    marker_genes=marker_genes,\n    species=\"human\",\n    tissue=\"blood\",\n    models=[\"gpt-4o\", \"claude-3-7-sonnet-20250219\", \"gemini-1.5-pro\", \"qwen-max-2025-01-25\"],\n    consensus_threshold=1,  # Adjust threshold for consensus agreement\n    max_discussion_rounds=3   # Maximum rounds of discussion between models\n)\n\n# Access the final consensus annotations from the dictionary\nfinal_annotations = consensus_results[\"consensus\"]\n\n# Add consensus annotations to your AnnData object\nadata.obs['consensus_cell_type'] = adata.obs['leiden'].astype(str).map(final_annotations)\n\n# Add uncertainty metrics to your AnnData object\nadata.obs['consensus_proportion'] = adata.obs['leiden'].astype(str).map(consensus_results[\"consensus_proportion\"])\nadata.obs['entropy'] = adata.obs['leiden'].astype(str).map(consensus_results[\"entropy\"])\n\n# IMPORTANT: Ensure UMAP coordinates are calculated before visualization\n# If UMAP coordinates are not available in your AnnData object, compute them:\nif 'X_umap' not in adata.obsm:\n    print(\"Computing UMAP coordinates...\")\n    # Make sure neighbors are computed first\n    if 'neighbors' not in adata.uns:\n        sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)\n    sc.tl.umap(adata)\n    print(\"UMAP coordinates computed\")\n\n# Visualize results with enhanced aesthetics\n# Basic visualization\nsc.pl.umap(adata, color='consensus_cell_type', legend_loc='right', frameon=True, title='mLLMCelltype Consensus Annotations')\n\n# More customized visualization\nimport matplotlib.pyplot as plt\n\n# Set figure size and style\nplt.rcParams['figure.figsize'] = (10, 8)\nplt.rcParams['font.size'] = 12\n\n# Create a more publication-ready UMAP\nfig, ax = plt.subplots(1, 1, figsize=(12, 10))\nsc.pl.umap(adata, color='consensus_cell_type', legend_loc='on data', \n         frameon=True, title='mLLMCelltype Consensus Annotations',\n         palette='tab20', size=50, legend_fontsize=12, \n         legend_fontoutline=2, ax=ax)\n\n# Visualize uncertainty metrics\nfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))\nsc.pl.umap(adata, color='consensus_proportion', ax=ax1, title='Consensus Proportion',\n         cmap='viridis', vmin=0, vmax=1, size=30)\nsc.pl.umap(adata, color='entropy', ax=ax2, title='Annotation Uncertainty (Shannon Entropy)',\n         cmap='magma', vmin=0, size=30)\nplt.tight_layout()\n```\n\n### R\n\n```r\n# Load required packages\nlibrary(mLLMCelltype)\nlibrary(Seurat)\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(cowplot) # Added for plot_grid\n\n# Load your preprocessed Seurat object\npbmc \u003c- readRDS(\"your_seurat_object.rds\")\n\n# If starting with raw data, perform preprocessing steps\n# pbmc \u003c- NormalizeData(pbmc)\n# pbmc \u003c- FindVariableFeatures(pbmc, selection.method = \"vst\", nfeatures = 2000)\n# pbmc \u003c- ScaleData(pbmc)\n# pbmc \u003c- RunPCA(pbmc)\n# pbmc \u003c- FindNeighbors(pbmc, dims = 1:10)\n# pbmc \u003c- FindClusters(pbmc, resolution = 0.5)\n# pbmc \u003c- RunUMAP(pbmc, dims = 1:10)\n\n# Find marker genes for each cluster\npbmc_markers \u003c- FindAllMarkers(pbmc,\n                            only.pos = TRUE,\n                            min.pct = 0.25,\n                            logfc.threshold = 0.25)\n\n# Set up cache directory to speed up processing\ncache_dir \u003c- \"./mllmcelltype_cache\"\ndir.create(cache_dir, showWarnings = FALSE, recursive = TRUE)\n\n# Choose a model from any supported provider\n# Supported models include:\n# - OpenAI: 'gpt-4o', 'gpt-4o-mini', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4-turbo', 'gpt-3.5-turbo', 'o1', 'o1-mini', 'o1-preview', 'o1-pro'\n# - Anthropic: 'claude-3-7-sonnet-20250219', 'claude-3-5-sonnet-latest', 'claude-3-5-haiku-latest', 'claude-3-opus'\n# - DeepSeek: 'deepseek-chat', 'deepseek-reasoner'\n# - Google: 'gemini-2.5-pro', 'gemini-2.0-flash', 'gemini-2.0-flash-exp', 'gemini-1.5-pro', 'gemini-1.5-flash'\n# - Qwen: 'qwen-max-2025-01-25'\n# - Stepfun: 'step-2-mini', 'step-2-16k', 'step-1-8k'\n# - Zhipu: 'glm-4-plus', 'glm-3-turbo'\n# - MiniMax: 'minimax-text-01'\n# - Grok: 'grok-3', 'grok-3-latest', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'\n# - OpenRouter: Access to models from multiple providers through a single API. Format: 'provider/model-name'\n#   - OpenAI models: 'openai/gpt-4o', 'openai/gpt-4o-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'\n#   - Anthropic models: 'anthropic/claude-3-7-sonnet-20250219', 'anthropic/claude-3-5-sonnet-latest', 'anthropic/claude-3-5-haiku-latest', 'anthropic/claude-3-opus'\n#   - Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'\n#   - Google models: 'google/gemini-2.5-pro-preview-03-25', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'\n#   - Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'\n#   - Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b'\n\n# Run LLMCelltype annotation with multiple LLM models\nconsensus_results \u003c- interactive_consensus_annotation(\n  input = pbmc_markers,\n  tissue_name = \"human PBMC\",  # provide tissue context\n  models = c(\n    \"claude-3-7-sonnet-20250219\",  # Anthropic\n    \"gpt-4o\",                   # OpenAI\n    \"gemini-1.5-pro\",           # Google\n    \"qwen-max-2025-01-25\"       # Alibaba\n  ),\n  api_keys = list(\n    anthropic = \"your-anthropic-key\",\n    openai = \"your-openai-key\",\n    gemini = \"your-google-key\",\n    qwen = \"your-qwen-key\"\n  ),\n  top_gene_count = 10,\n  controversy_threshold = 1.0,\n  entropy_threshold = 1.0,\n  cache_dir = cache_dir\n)\n\n# Print structure of results to understand the data\nprint(\"Available fields in consensus_results:\")\nprint(names(consensus_results))\n\n# Add annotations to Seurat object\n# Get cell type annotations from consensus_results$final_annotations\ncluster_to_celltype_map \u003c- consensus_results$final_annotations\n\n# Create new cell type identifier column\ncell_types \u003c- as.character(Idents(pbmc))\nfor (cluster_id in names(cluster_to_celltype_map)) {\n  cell_types[cell_types == cluster_id] \u003c- cluster_to_celltype_map[[cluster_id]]\n}\n\n# Add cell type annotations to Seurat object\npbmc$cell_type \u003c- cell_types\n\n# Add uncertainty metrics\n# Extract detailed consensus results containing metrics\nconsensus_details \u003c- consensus_results$initial_results$consensus_results\n\n# Create a data frame with metrics for each cluster\nuncertainty_metrics \u003c- data.frame(\n  cluster_id = names(consensus_details),\n  consensus_proportion = sapply(consensus_details, function(res) res$consensus_proportion),\n  entropy = sapply(consensus_details, function(res) res$entropy)\n)\n\n# Add uncertainty metrics for each cell\npbmc$consensus_proportion \u003c- uncertainty_metrics$consensus_proportion[match(current_clusters, uncertainty_metrics$cluster_id)]\npbmc$entropy \u003c- uncertainty_metrics$entropy[match(current_clusters, uncertainty_metrics$cluster_id)]\n\n# Save results for future use\nsaveRDS(consensus_results, \"pbmc_mLLMCelltype_results.rds\")\nsaveRDS(pbmc, \"pbmc_annotated.rds\")\n\n# Visualize results with SCpubr for publication-ready plots\nif (!requireNamespace(\"SCpubr\", quietly = TRUE)) {\n  remotes::install_github(\"enblacar/SCpubr\")\n}\nlibrary(SCpubr)\nlibrary(viridis)  # For color palettes\n\n# Basic UMAP visualization with default settings\npdf(\"pbmc_basic_annotations.pdf\", width=8, height=6)\nSCpubr::do_DimPlot(sample = pbmc,\n                  group.by = \"cell_type\",\n                  label = TRUE,\n                  legend.position = \"right\") +\n  ggtitle(\"mLLMCelltype Consensus Annotations\")\ndev.off()\n\n# More customized visualization with enhanced styling\npdf(\"pbmc_custom_annotations.pdf\", width=8, height=6)\nSCpubr::do_DimPlot(sample = pbmc,\n                  group.by = \"cell_type\",\n                  label = TRUE,\n                  label.box = TRUE,\n                  legend.position = \"right\",\n                  pt.size = 1.0,\n                  border.size = 1,\n                  font.size = 12) +\n  ggtitle(\"mLLMCelltype Consensus Annotations\") +\n  theme(plot.title = element_text(hjust = 0.5))\ndev.off()\n\n# Visualize uncertainty metrics with enhanced SCpubr plots\n# Get cell types and create a named color palette\ncell_types \u003c- unique(pbmc$cell_type)\ncolor_palette \u003c- viridis::viridis(length(cell_types))\nnames(color_palette) \u003c- cell_types\n\n# Cell type annotations with SCpubr\np1 \u003c- SCpubr::do_DimPlot(sample = pbmc,\n                  group.by = \"cell_type\",\n                  label = TRUE,\n                  legend.position = \"bottom\",  # Place legend at the bottom\n                  pt.size = 1.0,\n                  label.size = 4,  # Smaller label font size\n                  label.box = TRUE,  # Add background box to labels for better readability\n                  repel = TRUE,  # Make labels repel each other to avoid overlap\n                  colors.use = color_palette,\n                  plot.title = \"Cell Type\") +\n      theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 15, t = 10)),\n            legend.text = element_text(size = 8),\n            legend.key.size = unit(0.3, \"cm\"),\n            plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), \"cm\"))\n\n# Consensus proportion feature plot with SCpubr\np2 \u003c- SCpubr::do_FeaturePlot(sample = pbmc,\n                       features = \"consensus_proportion\",\n                       order = TRUE,\n                       pt.size = 1.0,\n                       enforce_symmetry = FALSE,\n                       legend.title = \"Consensus\",\n                       plot.title = \"Consensus Proportion\",\n                       sequential.palette = \"YlGnBu\",  # Yellow-Green-Blue gradient, following Nature Methods standards\n                       sequential.direction = 1,  # Light to dark direction\n                       min.cutoff = min(pbmc$consensus_proportion),  # Set minimum value\n                       max.cutoff = max(pbmc$consensus_proportion),  # Set maximum value\n                       na.value = \"lightgrey\") +  # Color for missing values\n      theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 15, t = 10)),\n            plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), \"cm\"))\n\n# Shannon entropy feature plot with SCpubr\np3 \u003c- SCpubr::do_FeaturePlot(sample = pbmc,\n                       features = \"entropy\",\n                       order = TRUE,\n                       pt.size = 1.0,\n                       enforce_symmetry = FALSE,\n                       legend.title = \"Entropy\",\n                       plot.title = \"Shannon Entropy\",\n                       sequential.palette = \"OrRd\",  # Orange-Red gradient, following Nature Methods standards\n                       sequential.direction = -1,  # Dark to light direction (reversed)\n                       min.cutoff = min(pbmc$entropy),  # Set minimum value\n                       max.cutoff = max(pbmc$entropy),  # Set maximum value\n                       na.value = \"lightgrey\") +  # Color for missing values\n      theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 15, t = 10)),\n            plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), \"cm\"))\n\n# Combine plots with equal widths\npdf(\"pbmc_uncertainty_metrics.pdf\", width=18, height=7)\ncombined_plot \u003c- cowplot::plot_grid(p1, p2, p3, ncol = 3, rel_widths = c(1.2, 1.2, 1.2))\nprint(combined_plot)\ndev.off()\n```\n\n### Using a Single LLM Model\n\nIf you only want to use a single LLM model instead of the consensus approach, use the `annotate_cell_types()` function. This is useful when you have access to only one API key or prefer a specific model:\n\n```r\n# Load required packages\nlibrary(mLLMCelltype)\nlibrary(Seurat)\n\n# Load your preprocessed Seurat object\npbmc \u003c- readRDS(\"your_seurat_object.rds\")\n\n# Find marker genes for each cluster\npbmc_markers \u003c- FindAllMarkers(pbmc,\n                            only.pos = TRUE,\n                            min.pct = 0.25,\n                            logfc.threshold = 0.25)\n\n# Choose a model from any supported provider\n# Supported models include:\n# - OpenAI: 'gpt-4o', 'gpt-4o-mini', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4-turbo', 'gpt-3.5-turbo', 'o1', 'o1-mini', 'o1-preview', 'o1-pro'\n# - Anthropic: 'claude-3-7-sonnet-20250219', 'claude-3-5-sonnet-latest', 'claude-3-5-haiku-latest', 'claude-3-opus'\n# - DeepSeek: 'deepseek-chat', 'deepseek-reasoner'\n# - Google: 'gemini-2.5-pro', 'gemini-2.0-flash', 'gemini-2.0-flash-exp', 'gemini-1.5-pro', 'gemini-1.5-flash'\n# - Qwen: 'qwen-max-2025-01-25'\n# - Stepfun: 'step-2-mini', 'step-2-16k', 'step-1-8k'\n# - Zhipu: 'glm-4-plus', 'glm-3-turbo'\n# - MiniMax: 'minimax-text-01'\n# - Grok: 'grok-3', 'grok-3-latest', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'\n# - OpenRouter: Access to models from multiple providers through a single API. Format: 'provider/model-name'\n#   - OpenAI models: 'openai/gpt-4o', 'openai/gpt-4o-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'\n#   - Anthropic models: 'anthropic/claude-3-7-sonnet-20250219', 'anthropic/claude-3-5-sonnet-latest', 'anthropic/claude-3-5-haiku-latest', 'anthropic/claude-3-opus'\n#   - Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'\n#   - Google models: 'google/gemini-2.5-pro-preview-03-25', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'\n#   - Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'\n#   - Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b'\n\n# Run cell type annotation with a single LLM model\nsingle_model_results \u003c- annotate_cell_types(\n  input = pbmc_markers,\n  tissue_name = \"human PBMC\",  # provide tissue context\n  model = \"claude-3-7-sonnet-20250219\",  # specify a single model\n  api_key = \"your-anthropic-key\",  # provide the API key directly\n  top_gene_count = 10\n)\n\n# Print the results\nprint(single_model_results)\n\n# Add annotations to Seurat object\n# single_model_results is a character vector with one annotation per cluster\npbmc$cell_type \u003c- plyr::mapvalues(\n  x = as.character(Idents(pbmc)),\n  from = as.character(0:(length(single_model_results)-1)),\n  to = single_model_results\n)\n\n# Visualize results\nDimPlot(pbmc, group.by = \"cell_type\", label = TRUE) +\n  ggtitle(\"Cell Types Annotated by Single LLM Model\")\n```\n\n#### Comparing Different Models\n\nYou can also compare annotations from different models by running `annotate_cell_types()` multiple times with different models:\n\n```r\n# Define models to test\nmodels_to_test \u003c- c(\n  \"claude-3-7-sonnet-20250219\",  # Anthropic\n  \"gpt-4o\",                      # OpenAI\n  \"gemini-1.5-pro\",              # Google\n  \"qwen-max-2025-01-25\"          # Alibaba\n)\n\n# API keys for different providers\napi_keys \u003c- list(\n  anthropic = \"your-anthropic-key\",\n  openai = \"your-openai-key\",\n  gemini = \"your-gemini-key\",\n  qwen = \"your-qwen-key\"\n)\n\n# Test each model and store results\nresults \u003c- list()\nfor (model in models_to_test) {\n  provider \u003c- get_provider(model)\n  api_key \u003c- api_keys[[provider]]\n  \n  # Run annotation\n  results[[model]] \u003c- annotate_cell_types(\n    input = pbmc_markers,\n    tissue_name = \"human PBMC\",\n    model = model,\n    api_key = api_key,\n    top_gene_count = 10\n  )\n  \n  # Add to Seurat object\n  column_name \u003c- paste0(\"cell_type_\", gsub(\"[^a-zA-Z0-9]\", \"_\", model))\n  pbmc[[column_name]] \u003c- plyr::mapvalues(\n    x = as.character(Idents(pbmc)),\n    from = as.character(0:(length(results[[model]])-1)),\n    to = results[[model]]\n  )\n}\n```\n\n## Visualization Example\n\nBelow is an example of publication-ready visualization created with mLLMCelltype and SCpubr, showing cell type annotations alongside uncertainty metrics (Consensus Proportion and Shannon Entropy):\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/mLLMCelltype_visualization.png\" alt=\"mLLMCelltype Visualization\" width=\"900\"/\u003e\n\u003c/div\u003e\n\n*Figure: Left panel shows cell type annotations on UMAP projection. Middle panel displays the consensus proportion using a yellow-green-blue gradient (deeper blue indicates stronger agreement among LLMs). Right panel shows Shannon entropy using an orange-red gradient (deeper red indicates lower uncertainty, lighter orange indicates higher uncertainty).*\n\n## Citation\n\nIf you use mLLMCelltype in your research, please cite:\n\n```bibtex\n@article{Yang2025.04.10.647852,\n  author = {Yang, Chen and Zhang, Xianyang and Chen, Jun},\n  title = {Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data},\n  elocation-id = {2025.04.10.647852},\n  year = {2025},\n  doi = {10.1101/2025.04.10.647852},\n  publisher = {Cold Spring Harbor Laboratory},\n  URL = {https://www.biorxiv.org/content/early/2025/04/17/2025.04.10.647852},\n  journal = {bioRxiv}\n}\n```\n\nYou can also cite this in plain text format:\n\nYang, C., Zhang, X., \u0026 Chen, J. (2025). Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data. *bioRxiv*. https://doi.org/10.1101/2025.04.10.647852\n\n## Contributing\n\nWe welcome and appreciate contributions from the community! There are many ways you can contribute to mLLMCelltype:\n\n### Reporting Issues\n\nIf you encounter any bugs, have feature requests, or have questions about using mLLMCelltype, please [open an issue](https://github.com/cafferychen777/mLLMCelltype/issues) on our GitHub repository. When reporting bugs, please include:\n\n- A clear description of the problem\n- Steps to reproduce the issue\n- Expected vs. actual behavior\n- Your operating system and package version information\n- Any relevant code snippets or error messages\n\n### Pull Requests\n\nWe encourage you to contribute code improvements or new features through pull requests:\n\n1. Fork the repository\n2. Create a new branch for your feature (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n### Areas for Contribution\n\nHere are some areas where contributions would be particularly valuable:\n\n- Adding support for new LLM models\n- Improving documentation and examples\n- Optimizing performance\n- Adding new visualization options\n- Extending functionality for specialized cell types or tissues\n- Translations of documentation into different languages\n\n### Code Style\n\nPlease follow the existing code style in the repository. For R code, we generally follow the [tidyverse style guide](https://style.tidyverse.org/). For Python code, we follow [PEP 8](https://www.python.org/dev/peps/pep-0008/).\n\nThank you for helping improve mLLMCelltype!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcafferychen777%2FmLLMCelltype","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcafferychen777%2FmLLMCelltype","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcafferychen777%2FmLLMCelltype/lists"}