https://github.com/biocpy/scranpy

Single-cell data analysis, from C++ to Python
https://github.com/biocpy/scranpy

Last synced: 6 months ago
JSON representation

Single-cell data analysis, from C++ to Python

Host: GitHub
URL: https://github.com/biocpy/scranpy
Owner: BiocPy
License: mit
Created: 2023-06-28T21:53:51.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2024-02-05T17:02:38.000Z (over 1 year ago)
Last Synced: 2024-07-07T18:23:02.599Z (10 months ago)
Language: Python
Homepage: https://biocpy.github.io/scranpy/
Size: 2.43 MB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
- Authors: AUTHORS.md

Awesome Lists containing this project

README

        

[![Project generated with PyScaffold](https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold)](https://pyscaffold.org/)

[![PyPI-Server](https://img.shields.io/pypi/v/scranpy.svg)](https://pypi.org/project/scranpy/)

[![Downloads](https://static.pepy.tech/badge/scranpy/month)](https://pepy.tech/project/scranpy)

![Unit tests](https://github.com/BiocPy/scranpy/actions/workflows/pypi-test.yml/badge.svg)

# scran, in Python

## Overview

The **scranpy** package provides Python bindings to the single-cell analysis methods in [**libscran**](https://github.com/LTLA/libscran) and related C++ libraries.

It performs the standard steps in a typical single-cell analysis including quality control, normalization, feature selection, dimensionality reduction, clustering and marker detection.

**scranpy** makes heavy use of the [BiocPy](https://github.com/BiocPy) data structures in its user interface,

while it uses the [**mattress**](https://pypi.org/project/mattress) package to provide a C++ representation of the underlying matrix data.

This package is effectively a mirror of its counterparts in Javascript ([**scran.js**](https://npmjs.com/package/scran.js)) and R ([**scran.chan**](https://github.com/LTLA/scran.chan)),

which are based on the same underlying C++ libraries and concepts.

## Quick start

Let's load in the famous PBMC 4k dataset from 10X Genomics (available [here](https://github.com/kanaverse/random-test-files/releases/tag/10x-pbmc-v1.0.0)):

```python

import singlecellexperiment

sce = singlecellexperiment.read_tenx_h5("pbmc4k-tenx.h5")

```

Then we just need to call one of **scranpy**'s `analyze()` functions.

(We do have to tell it what the mitochondrial genes are, though.)

```python

import scranpy

options = scranpy.AnalyzeOptions()

options.per_cell_rna_qc_metrics_options.subsets = {

    "mito": scranpy.guess_mito_from_symbols(sce.row_data["name"], "mt-")

}

results = scranpy.analyze_sce(sce, options=options)

```

This will perform all of the usual steps for a routine single-cell analysis, 

as described in Bioconductor's [Orchestrating single cell analysis](https://bioconductor.org/books/OSCA) book.

It returns an object containing clusters, t-SNEs, UMAPs, marker genes, and so on:

```python

results.clusters

results.tsne

results.umap

results.rna_markers

```

We won't go over the theory here as it's explained more thoroughly in the book.

Check out the [reference documentation](https://biocpy.github.io/scranpy) for more details.

## Multiple batches

To demonstrate, let's grab two batches of PBMC datasets from 10X Genomics (again, available [here](https://github.com/kanaverse/random-test-files/releases/tag/10x-pbmc-v1.0.0)):

```python

import singlecellexperiment

sce3k = singlecellexperiment.read_tenx_h5("pbmc3k-tenx.h5")

sce4k = singlecellexperiment.read_tenx_h5("pbmc4k-tenx.h5")

```

They don't have the same features, so we'll just take the intersection of their Ensembl IDs before combining them:

```python

import biocutils

common = biocutils.intersect(sce3k.row_data["id"], sce4k.row_data["id"])

sce3k_common = sce3k[biocutils.match(common, sce3k.row_data["id"]), :]

sce4k_common = sce4k[biocutils.match(common, sce4k.row_data["id"]), :]

import scipy.sparse

combined = scipy.sparse.hstack((sce3k_common.assay(0), sce4k_common.assay(0)))

batch = ["3k"] * sce3k_common.shape[1] + ["4k"] * sce4k_common.shape[1]

```

We can now perform a batch-aware analysis:

```python

import scranpy

options = scranpy.AnalyzeOptions()

options.per_cell_rna_qc_metrics_options.subsets = {

    "mito": scranpy.guess_mito_from_symbols(sce3k_common.row_data["name"], "mt-")

}

options.miscellaneous_options.block = batch

results = scranpy.analyze(combined, options=options)

```

This yields mostly the same set of results as before, but with an extra MNN-corrected embedding for clustering, visualization, etc.

```python

results.mnn

```

## Multiple modalities

Let's grab a 10X Genomics immune profiling dataset (see [here](https://github.com/kanaverse/random-test-files/releases/download/10x-immune-v1.0.0/immune_3.0.0-tenx.h5)):

```python

import singlecellexperiment

sce = singlecellexperiment.read_tenx_h5("immune_3.0.0-tenx.h5")

```

We need to split it to genes and ADTs:

```python

is_gene = [x == "Gene Expression" for x in sce.row_data["feature_type"]]

gene_data = sce[is_gene,:]

is_adt = [x == "Antibody Capture" for x in sce.row_data["feature_type"]]

adt_data = sce[is_adt,:]

```

And now we can run the analysis:

```python

import scranpy

options = scranpy.AnalyzeOptions()

options.per_cell_rna_qc_metrics_options.subsets = {

    "mito": scranpy.guess_mito_from_symbols(gene_data.row_data["name"], "mt-")

}

options.per_cell_adt_qc_metrics_options.subsets = {

    "igg": [n.lower().startswith("igg") for n in adt_data.row_data["name"]]

}

results = scranpy.analyze_se(gene_data, adt_se = adt_data, options=options)

```

This returns ADT-specific results in the relevant fields, as well as a set of combined PCs for use in clustering, visualization, etc. 

```python

results.adt_size_factors

results.adt_markers

results.combined_pcs

```

## Customizing the analysis

Most parameters can be changed by setting the relevant fields in the `AnalyzeOptions` object.

For example, we can modify the number of neighbors and resolution used for graph-based clustering:

```python

options.build_snn_graph_options.num_neighbors = 10

options.miscellaneous_options.snn_graph_multilevel_resolution = 2

```

Or we can fiddle the the various dimensionality reduction parameters:

```python

options.run_pca_options.rank = 50

options.run_tsne_options.perplexity = 20

options.run_umap_options.min_dist = 0.5

```

The `AnalyzeOptions` has a few convenience methods to easily set the same parameter across multiple `*_options` attributes.

For example, to enable parallel processing in every step:

```python

options.set_threads(5)

```

Advanced users can even obtain the sequence of steps used internally by `analyze()` by calling it with `dry_run = True`:

```python

commands = scranpy.analyze(sce, dry_run = True)

print(commands)

## import scranpy

## import numpy

## 

## results = AnalyzeResults()

## ...

```

Users can then add, remove or replace steps as desired.

## Developer Notes

Steps to setup dependencies -

- initialize git submodules in `extern/libscran`.

- run `cmake .` inside the `extern/knncolle` to download the annoy library. a future version of this will use a cmake to setup the extern directory.

First one needs to build the extern library, this would generate a shared object file to `src/scranpy/core-[*].so`

```shell

python setup.py build_ext --inplace

```

For typical development workflows, run this for tests

```shell

python setup.py build_ext --inplace && tox

```

To rebuild the **ctypes** bindings [**cpptypes**](https://github.com/BiocPy/ctypes-wrapper):

```shell

cpptypes src/scranpy/lib --py src/scranpy/_cpphelpers.py --cpp src/scranpy/lib/bindings.cpp --dll _core

```

To rebuild the [dry run analysis source code](src/scranpy/analysis_dry.py):

```shell

./scripts/dryrun.py src/scranpy/analyze/live_analyze.py > src/scranpy/analyze/dry_analyze.py

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/biocpy/scranpy

Awesome Lists containing this project

README