{"id":30111003,"url":"https://github.com/astrazeneca/cellatria","last_synced_at":"2025-08-10T05:06:22.629Z","repository":{"id":307509006,"uuid":"1028549924","full_name":"AstraZeneca/cellatria","owner":"AstraZeneca","description":"An Agentic AI Framework for Ingestion and Standardization of Single-Cell RNA-seq Data Analysis","archived":false,"fork":false,"pushed_at":"2025-07-31T15:26:43.000Z","size":9250,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-31T18:25:25.004Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraZeneca.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-29T17:38:16.000Z","updated_at":"2025-07-31T15:26:47.000Z","dependencies_parsed_at":"2025-07-31T18:33:30.345Z","dependency_job_id":"de9d6afa-4cb5-41c0-9924-e6a9121c503d","html_url":"https://github.com/AstraZeneca/cellatria","commit_stats":null,"previous_names":["astrazeneca/cellatria"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/AstraZeneca/cellatria","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2Fcellatria","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2Fcellatria/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2Fcellatria/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2Fcellatria/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraZeneca","download_url":"https://codeload.github.com/AstraZeneca/cellatria/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2Fcellatria/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269677801,"owners_count":24457876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-10T05:06:22.004Z","updated_at":"2025-08-10T05:06:22.575Z","avatar_url":"https://github.com/AstraZeneca.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"https://img.shields.io/badge/made%20with-Python-830051?style=flat\u0026logo=python\u0026logoColor=white\" alt=\"Made with Python\"/\u003e\u003c/a\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"https://img.shields.io/badge/container-Docker-830051?style=flat\u0026logo=docker\u0026logoColor=white\" alt=\"Docker\"/\u003e\u003c/a\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"https://img.shields.io/badge/platform-GitHub-830051?style=flat\u0026logo=github\u0026logoColor=white\" alt=\"Platform\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/langchain-ai/langgraph\"\u003e\u003cimg src=\"https://img.shields.io/badge/built%20with-LangGraph-830051?style=flat\u0026logo=python\u0026logoColor=white\" alt=\"LangGraph\"/\u003e\u003c/a\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"https://img.shields.io/badge/agentic-AI%20Agent-830051?style=flat\u0026logo=robotframework\u0026logoColor=white\" alt=\"Agentic\"/\u003e\u003c/a\u003e\n  \u003cbr\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"http://www.repostatus.org/badges/latest/active.svg\" alt=\"Project Status\"/\u003e\u003c/a\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"https://img.shields.io/badge/lifecycle-Stable-brightgreen.svg\" alt=\"Lifecycle\"/\u003e\u003c/a\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"https://img.shields.io/badge/docs-latest-brightgreen?style=flat\" alt=\"Docs\"/\u003e\u003c/a\u003e\n  \u003ca href=\"#\"\u003e\u003cimg src=\"https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat\" alt=\"Contributions welcome\"/\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Maturity%20Level-ML--0-brightgreen\" alt=\"Maturity level-0\"/\u003e\n  \u003cbr\u003e\n  \u003ca href=\"https://github.com/AstraZeneca/cellatria/actions/workflows/docker.yml\"\u003e\u003cimg src=\"https://github.com/AstraZeneca/cellatria/actions/workflows/docker.yml/badge.svg?branch=main\" alt=\"cellatria image\"/\u003e\u003c/a\u003e\n  \u003cbr\u003e\n\u003c/p\u003e\n\n\u003c!-- Version Banner --\u003e\n\u003cp align=\"center\" width=\"100%\"\u003e  \n  \u003cimg width=\"15%\" src=\"https://img.shields.io/badge/release-v1.0.0-4444AA.svg?style=for-the-badge\" alt=\"Release v1.0.0\"/\u003e\n\u003c/p\u003e\n\n\u003c!-- CellAtria Infographic --\u003e\n\u003cp align=\"center\" width=\"100%\"\u003e\n  \u003cimg width=\"100%\" src=\"cellatria_git_logo.png\"\u003e \n\u003c/p\u003e\n\n---\n\n## ✨ Introduction\n\u003cdetails\u003e\n\u003cbr\u003e\n\n\n**CellAtria** is an agentic AI system that enables **full-lifecycle, document-to-analysis automation** in single-cell research. It integrates natural language interaction with a robust, graph-based, multi-actor execution framework. The system orchestrates diverse tasks, ranging from literature parsing and metadata extraction to dataset retrieval and downstream scRNA-seq analysis via the co-developed [**CellExpress**](#cellexpress) pipeline.\n\n\u003e Through its comprehensive interface, **CellAtria** empowers users to engage with a language model augmented by task-specific tools. This eliminates the need for manual command-line operations, accelerating data onboarding and the reuse of public single-cell resources.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n  \u003cimg width=\"55%\" src=\"cellatria_git_fig1.png\"\u003e \n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003csmall\u003e\u003cem\u003e\u003cstrong\u003eLanguage model-mediated orchestration of toolchains\u003c/strong\u003e. Upon receiving a user prompt, the CellAtria interface transfers the request to the LLM agent, which interprets intent and autonomously invokes relevant tools. Outputs are returned through the interface, completing a full cycle of context-aware execution.\u003c/em\u003e\u003c/small\u003e\n\u003c/p\u003e\n\n\n\u003c/details\u003e\n\n---\n\n## 💡 Key Features\n\u003cdetails\u003e\n\u003cbr\u003e\n\n- **Flexible Input**: Accepts primary research articles as **PDFs** or **URLs** for seamless integration.\n- **Automated Metadata Extraction**: Extracts structured metadata, including sample annotations, organism, tissue type, and GEO (Gene Expression Omnibus) accession identifiers.\n- **Intelligent Data Retrieval**: Resolves and organizes GEO datasets by accessing both **GSE (study-level)** and **GSM (sample-level)** records, ensuring structured and comprehensive data retrieval.\n- **Integrated Analysis Pipeline**: Orchestrates full pipeline configuration and launches [**CellExpress**](#cellexpress), a containerized framework for standardized scRNA-seq analysis, ensuring reproducible results.\n- **Enhanced User Control**: Enables metadata editing, secure file transfers, and direct file system management within the agent session.\n- **Modular \u0026 Reusable Architecture**: Composes all core actions into reusable, graph-based tools that serve as callable agent nodes, fostering extensibility.\n\n\u003e Additional details on the underlying toolkits can be found in the [toolkit reference](https://github.com/AstraZeneca/cellatria/blob/main/agent/toolkit.md)\n\n\u003c/details\u003e\n\n---\n\n## 🚀  Getting Started\n\u003cdetails\u003e\n\n### (1) Prerequisites\n\n- **Docker**: Install [Docker](https://docs.docker.com/get-docker/) and ensure the Docker daemon is running.\n- **Environment Configuration**: Provide a `.env` file with credentials and parameters (see [LLM Configuration](#env_setup) section below).\n\n---\n\n### (2) Docker Images\n\nThe **CellAtria** repository includes a GitHub Actions workflow that builds and publishes a preconfigured Docker image to the [GitHub Container Registry](https://github.com/AstraZeneca/cellatria/pkgs/container/cellatria).\n\nPull the latest **CellAtria** Docker image using:\n\n```bash\n# Run this command in your terminal\ndocker pull ghcr.io/astrazeneca/cellatria:v1.0.0\n```\n\n\u003e This image contains all dependencies needed to run  the **CellAtria** agent in a consistent environment.\n\n---\n\n### (3)  Launching Agent\nStart the agent with the following command (replace paths with your actual directories):\n\n```bash\n# Run this command in your terminal\ndocker run -it --rm \\\n  -p 7860:7860 \\\n  -v /path/to/your/project/directory:/data \\\n  -v /path/to/your/env/directory:/envdir \\\n  ghcr.io/astrazeneca/cellatria:v1.0.0 cellatria \\\n  --env_path /envdir\n```\n\nCommand Breakdown:\n\n- `-p 7860:7860`: Exposes the agent user interface (UI) on port 7860.\n- `-v /path/to/your/project/directory:/data`: Mounts your project directory into the container.\n- `-v /path/to/your/env/directory:/envdir`: Mounts your `.env` directory for configuration (see [LLM Configuration](#env_setup) section below).\n- `ghcr.io/astrazeneca/cellatria:v1.0.0 cellatria`: Specifies the Docker image and the entrypoint command to launch the app inside the container.\n- `--env_path /envdir`: Tells agent where to find the `.env` file for provider setup.\n\n\u003e macOS users with Apple Silicon (M1/M2): You may encounter a warning due to platform mismatch. To ensure compatibility, add `--platform=linux/amd64` when running the container (i.e., `docker run --platform=linux/amd64 -it --rm`). \n\nOnce launched, the agent will initialize and provide a local URL for interaction. Simply open the link printed in your terminal to begin using CellAtria through your browser.\n\n---\n\n**Mounting a Working Directory:**\n\nWhen running the container, any host directory you want the container to access must be explicitly mounted using Docker’s `-v` (volume) flag. The container can only see and interact with the directories you specify at runtime.\n\nFor example, the following command:\n\n```bash\n-v /absolute/path/on/host:/data\n```\n\nmakes the contents of `/absolute/path/on/host` on your host machine available inside the container at `/data`.\n\n\u003e If you set a working directory inside the container (e.g., `my_project`), make sure to reference it using the container’s path — for instance: `/data/my_project`. Attempting to access files or directories outside the mounted path from within the container will fail, as they are not visible to the container’s filesystem.\n\n\u003c/details\u003e\n\n---\n\n\u003ca name=\"env_setup\"\u003e\u003c/a\u003e\n## 🛠️ LLM Configuration\n\n\u003cdetails\u003e\n\n### Quick Start\n\nCellAtria requires a `.env` file to configure access to your chosen LLM provider. You can download the template [`.env`](https://github.com/AstraZeneca/cellatria/blob/main/.env), fill in the necessary credentials and parameters. Ensure the directory containing the `.env` file is mounted into the container.\n\n### Supported LLM Backends\n\n- `azure`: Azure OpenAI (enterprise-grade access to GPT models)\n- `openai`: Standard OpenAI API (e.g., GPT-4, GPT-3.5)\n- `anthropic`: Claude models via the Anthropic API\n- `google`: Gemini models via Google Cloud / Vertex AI\n- `local`: Offline models (e.g., Llama.cpp, Ollama, Hugging Face)\n\n\u003e Set the `PROVIDER` variable in your `.env` file to one of the supported values above. Only one provider can be active at a time. \n\n\u003e You only need to configure the block for the provider you're using. The rest can remain commented.\n\n\u003c/details\u003e\n\n---\n\n\u003ca name=\"cellexpress\"\u003e\u003c/a\u003e\n## 🚂 CellExpress Engine\n\u003cdetails\u003e\n\u003cbr\u003e\n\n**CellExpress** is a companion pipeline embedded within the **CellAtria** framework. It delivers a reproducible and automated workflow for processing single-cell RNA-seq datasets (scRNA-seq) - from raw count matrices to comprehensive cell type annotations and report generation.\n\n\u003e Designed to lower bioinformatics barriers, **CellExpress** implements a comprehensive set of state-of-the-art, Scanpy-based processing stages, including quality control (performed globally or per sample), data transformation (including normalization, highly variable gene selection, and scaling), dimensionality reduction (UMAP and t-SNE), graph-based clustering, and marker gene identification. Additional tools are integrated to support advanced analysis tasks, including doublet detection, batch correction, and automated cell type annotation using both tissue-agnostic and tissue-specific models. All analytical steps are executed sequentially under centralized control, with parameters fully configurable via a comprehensive input schema. \n\n---\n\n### Run CellExpress in Standalone Mode\n\n**CellExpress** is a fully standalone pipeline for comprehensive scRNA-seq data analysis. It can be orchestrated either through an agentic system - as incorporated into the **CellAtria** framework - or via direct command-line execution.\n\nTo execute the CellExpress pipeline directly using Docker, use the following command:\n\n```bash\n# Run this command in your terminal\ndocker run -it --rm \\\n  -v /path/to/your/local/data:/data \\\n  ghcr.io/astrazeneca/cellatria:v1.0.0 cellexpress \\\n    --input /data \\\n    --project your_project_name \\\n    --species `species` \\\n    --tissue `tissue` \\\n    --disease `disease` \\\n    [--additional `options`...]\n```\n\nCommand Breakdown:\n\n- `-v /path/to/your/local/data:/data`: Mounts your project directory into the container.\n- `ghcr.io/astrazeneca/cellatria:v1.0.0 cellexpress`: Specifies the Docker image and the entrypoint command to launch **CellExpress** inside the container.\n-  [--additional `options`...]: arguments to configure pipeline.\n\n\u003e macOS users with Apple Silicon (M1/M2): You may encounter a warning due to platform mismatch. To ensure compatibility, add `--platform=linux/amd64` when running the container (i.e., `docker run --platform=linux/amd64 -it --rm`). \n\nFor full details, usage instructions, and configuration options, refer to the [CellExpress README](https://github.com/AstraZeneca/cellatria/blob/main/cellexpress/README.md).\n\n\u003c/details\u003e\n\n---\n\n## 🛠️ Computing Environment\n\n\u003cdetails\u003e\n\u003cbr\u003e\n\nThe `Dockerfile` defines the dedicated computing environment for executing **CellAtria** and the co-developed **CellExpress** pipelie in a consistent and reproducible manner. \nIt includes all required Python and R dependencies, along with support for HTML reporting and visualization. \nBuilt on an Ubuntu-based system, the environment also provides essential system-level packages to support end-to-end \npipeline execution. \n\n\u003c/details\u003e\n\n---\n\n## 🧠 Usage Intuition\n\u003cdetails\u003e\n\u003cbr\u003e\n\nWhile **CellAtria** supports flexible, user-driven interactions, its functionality is governed by an underlying **execution narrative** — a structured flow of modular actions that define how tasks are interpreted, routed, and executed. Users may invoke any module independently; however, for optimal results and seamless orchestration, we recommend following the intended workflow trajectory below.\n\n**CellAtria's internal logic integrates the following key stages:**\n\n1.  **Document Parsing** - Extracts structured metadata from narrative-formatted scientific documents (article URL or PDF).\n2.  **Accession Resolution** - Identifies relevant GEO (Gene Expression Omnibus) accession IDs from parsed metadata.\n3.  **Dataset Retrieval** - Downloads datasets directly from public repositories.\n4.  **File \u0026 Data Organization** - Structures downloaded content into a consistent directory schema for analysis.\n5.  **Pipeline Configuration** - Prepares **CellExpress** arguments and environmental parameters for execution.\n6.  **CellExpress Execution** - Launches the standardized single-cell analysis pipeline in a detached mode. \n\n\u003e This modular, agent-guided framework allows users to begin at any point while preserving logical consistency across steps.\n\n\u003c/details\u003e\n\n---\n\n## 📖 Related Publication\n\u003cdetails\u003e\n\u003cbr\u003e\n\nIf you use this repository, please cite:\n\n\u003e Nima Nouri, et al. (2025). An Agentic AI Framework for Ingestion and Standardization of Single-Cell RNA-seq Data Analysis. *bioRxiv*. https://doi.org/10.1101/2025.07.31.667880\n\n```\n@article{nouri2025agentic,\n  title={An Agentic AI Framework for Ingestion and Standardization of Single-Cell RNA-seq Data Analysis},\n  author={Nouri, Nima and Artzi, Ronen and Savova, Virginia},\n  journal={bioRxiv},\n  year={2025},\n  publisher={Cold Spring Harbor Laboratory}\n}\n```\n\n\u003c/details\u003e\n\n---\n\n## 📬 Contact\n\u003cdetails\u003e\n\u003cbr\u003e\n\n| Role         | Name               | Contact                                     |\n|--------------|--------------------|---------------------------------------------|\n| Author/Maintainer   | Nima Nouri         | [nima.nouri@astrazeneca.com](mailto:nima.nouri@astrazeneca.com) | \n\n\u003c/details\u003e\n\n---","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrazeneca%2Fcellatria","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrazeneca%2Fcellatria","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrazeneca%2Fcellatria/lists"}