https://github.com/biomed-ai/sango

The official implementation for "SANGO".
https://github.com/biomed-ai/sango

bioinformatics cell-type-annotation cell-type-classification cell-type-identification sequence single-cell supervised-classification-methods

Last synced: 6 months ago
JSON representation

The official implementation for "SANGO".

Host: GitHub
URL: https://github.com/biomed-ai/sango
Owner: biomed-AI
Created: 2023-09-25T03:17:17.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-02-03T02:37:31.000Z (over 2 years ago)
Last Synced: 2024-05-06T00:02:48.486Z (about 2 years ago)
Topics: bioinformatics, cell-type-annotation, cell-type-classification, cell-type-identification, sequence, single-cell, supervised-classification-methods
Language: Jupyter Notebook
Homepage:
Size: 18.2 MB
Stars: 6
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

          ![](figures/model.png)

 We propose a novel method, SANGO, for accurate single cell annotation by integrating genome sequences around the accessibility peaks within scATAC data.   



# SANGO

The official implementation for "**SANGO**".

**Table of Contents**

* [Datasets](#Datasets)

* [Installation](#Installation)

* [Usage](#Usage)

* [Tutorial](#Tutorial)

* [Citation](#Citation)

## Datasets

We provide an easy access to the used datasets in the [synapse](https://www.synapse.org/#!Synapse:syn52559388/files/).

## Installation

To reproduce **SANGO**, we suggest first create a conda environment by:

~~~shell

conda create -n SANGO python=3.8

conda activate SANGO

~~~

and then run the following code to install the required package:

~~~shell

pip install -r requirements.txt

~~~

and then install [PyG](https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html) according to the CUDA version, take torch-1.13.1+cu117 (Ubuntu 20.04.4 LTS) as an example:

~~~shell

pip install torch_geometric

pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.1+cu117.html

~~~

## Usage

### data preprocessing

In order to run **SANGO**, we need to first create anndata from the raw data.

The h5ad file should have cells as obs and peaks as var. There should be at least three columns in `var`:  `chr`, `start`, `end` that indicate the genomic region of each peak. The h5ad file should also contain two columns in the `obs`: `Batch` and `CellType` （reference data）, where `Batch` is used to distinguish between reference and query data, and `CellType` indicates the true label of the cell.

Notice that we filter out peaks accessible in < 1% cells for optimal performance.

### Stage 1: embeddings extraction

The processed data are used as input to CACNN and a reference genome is provided to extract the embedding incorporating sequence information: 

~~~shell

# Stage 1: embeddings extraction

cd SANGO/CACNN

python main.py -i ../../preprocessed_data/reference_query_example.h5ad \ # input data(after data preprocessing)

               -g mm9 \ # reference genome

               -o ../../output/reference_query_example # output path

~~~

Running the above command will generate three output files in the output path:

* `CACNN_train.log`: recording logs during training

* `CACNN_best_model.pt`: storing the model weights with the best AUC score during training

* `CACNN_output.h5ad`: an anndata file storing the embedding extracted by CACNN.

### Stage 2: cell type prediction

~~~shell

# Stage 2: cell type prediction

cd ../GraphTransformer

python main.py  --data_dir ../../output/reference_query_example/CACNN_output.h5ad \ # input data

                --train_name_list reference --test_name query \

                --save_path ../../output \

                --save_name reference_query_example

~~~

Running the above command will generate three output files in the output path:

* `model.pkl`: storing the model weights with the best valid loss during training.

* `embedding.h5ad`: an anndata file storing the embedding extracted by GraphTransformer.  And `.obs['Pred']` saves the results of the prediction.

## Tutorial

### Tutorial 1: Cell annotations within samples (LargeIntestineB_LargeIntestineA)

1. Install the required environment according to [Installation](#Installation).

2. Create a `data` folder in the same directory as the 'SANGO' folder and download datasets from [LargeIntestineA_LargeIntestineB.h5ad](https://www.synapse.org/#!Synapse:syn52559388/files/).

3. Create a folder `genome` in the ./SANGO/CACNN/ directory and download [mm9.fa.h5](https://www.synapse.org/#!Synapse:syn52559388/files/).

4. For more detailed information, run the tutorial [LargeIntestineB_LargeIntestineA.ipynb](LargeIntestineB_LargeIntestineA.ipynb) for how to do data preprocessing and training.

### Tutorial 2: Cell annotations on datasets cross platforms (MosP1_Cerebellum)

1. Install the required environment according to [Installation](#Installation).

2. Create a `data` folder in the same directory as the 'SANGO' folder and download datasets from [MosP1_Cerebellum.h5ad](https://www.synapse.org/#!Synapse:syn52559388/files/).

3. Create a folder `genome` in the ./SANGO/CACNN/ directory and download [mm10.fa.h5](https://www.synapse.org/#!Synapse:syn52559388/files/).

4. For more detailed information, run the tutorial [MosP1_Cerebellum.ipynb](MosP1_Cerebellum.ipynb) for how to do data preprocessing and training.

### Tutorial 3: Cell annotations on datasets cross tissues (BoneMarrowB_Liver)

1. Install the required environment according to [Installation](#Installation).

2. Create a `data` folder in the same directory as the 'SANGO' folder and download datasets from [BoneMarrowB_Liver.h5ad](https://www.synapse.org/#!Synapse:syn52559388/files/).

3. Create a folder `genome` in the ./SANGO/CACNN/ directory and download [mm9.fa.h5](https://www.synapse.org/#!Synapse:syn52559388/files/).

4. For more detailed information, run the tutorial [BoneMarrowB_Liver.ipynb](BoneMarrowB_Liver.ipynb) for how to do data preprocessing and training.

### Tutorial 4: Multi-level cell type annotation and unknown cell type identification

1. Install the required environment according to [Installation](#Installation).

2. Create a `data` folder in the same directory as the 'SANGO' folder and download datasets from [BCC_TIL_atlas.h5ad, BCC_samples.zip, HHLA_atlas.h5ad](https://www.synapse.org/#!Synapse:syn52559388/files/).

3. Create a `genome` folder in the same directory as the 'SANGO' folder and download [GRCh38.primary_assembly.genome.fa.h5](https://www.synapse.org/#!Synapse:syn52559388/files/).

4. For more detailed information, run the tutorial [tumor_example.ipynb](tumor_example.ipynb) for how to do data preprocessing and training.

## Citation

If you find our codes useful, please consider citing our work:

~~~bibtex

@article{zengSANGO,

  title={Deciphering Cell Types by Integrating scATAC-seq Data with Genome Sequences},

  author={Yuansong Zeng, Mai Luo, Ningyuan Shangguan, Peiyu Shi, Junxi Feng, Jin Xu, Weijiang Yu, and Yuedong Yang},

  journal={},

  year={2023},

}

~~~

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/biomed-ai/sango

Awesome Lists containing this project

README