https://github.com/jeffreypullin/smash-fork
Fork of smashpy for research purposes
https://github.com/jeffreypullin/smash-fork
Last synced: about 1 year ago
JSON representation
Fork of smashpy for research purposes
- Host: GitHub
- URL: https://github.com/jeffreypullin/smash-fork
- Owner: jeffreypullin
- License: mit
- Created: 2023-06-10T05:15:19.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-06-12T01:28:12.000Z (about 3 years ago)
- Last Synced: 2025-01-29T18:11:18.518Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 40.6 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SMaSH framework
## Overview
The ```SMaSH``` (Scalable Marker gene Signal Hunter) framework is a general, scalable codebase for calculating marker genes from single-cell RNA-sequencing
data for a variety of different cell annotations as provided by the user, using supervised machine learning approaches. These annotations can be truly general:
they can be broad cell types/clusters, detailed sub-types of different broad clusters, cell organ of origin, whether the cell inhabits tumour tissue, surrounding
microenvironment, or healthy tissue, and more besides. ```SMaSH``` implements marker gene extraction using four different models (Random Forest, Balanced Random Forest, XGBoost,
and a deep neural network) and two different information gain metrics (Gini impurity for the ensemble learners, and Shapley value for the neural network). For some details
on the ```SMaSH``` implementation (see Figure below) please consult our pre-print: https://www.biorxiv.org/content/10.1101/2021.04.08.438978v1. ```SMaSH``` is integrated with the ```ScanPy``` framework, working directly from the ```AnnData```
object of RNA-sequencing counts and a vector of user-defined annotations for each cell according to the marker gene extraction problem.

## Installation
```SMaSH``` is accessible on ```pypi``` (https://pypi.org/project/smashpy) and can be installed with ```pip```:
```
pip install smashpy
```
All package requirements and versions are summarised in ```setup.py``` and are automatically installed with ```SMaSH```. We therefore recommend the user
work from a fresh environment, such as is implemented in Anaconda:
```
conda -n smash_env
conda activate smash_env
pip install smashpy
```
## Up and running with ```SMaSH``` !
The full ```SMaSH``` workflow is implemented sequentially from several functions, covering data preparation, initial gene filtering with principal components analysis, one of the
```SMaSH``` models for gene importance calculation, and the final ranking and selection of all genes from the initial ```AnnData``` object. For complete coverage of all models, we
have included several notebooks in this repository (see ```notebooks/```), where each folder corresponds to a different publicly available data-set and contains four notebooks
corresponding to a separate implementation of the four different ```SMaSH``` models for the gene importance calculation. Let's consider the Paul15 data-set, available from ```ScanPy```:
```
import scanpy as sc
obj = sc.datasets.paul15()
```
This can then be analysed step-by-step with the ```SMaSH``` functions, starting from the instantiation of the SMaSH object
```
import smashpy
sm = smashpy.smashpy()
```
Each step in the marker gene extraction chain (see Figure) can now be applied. For more details on each of these functions, see the examples provided in ```notebooks/``` and
the help service, where full details on the implementation and attributes of any ```SMaSH``` function ```func``` can be accessed with
```
help(sm.func())
```
Please note that the user-defined vector of annotations much be added for each cell
and stored as an object which can be accessed directly from the ```AnnData``` input, i.e. corresponding to
```
import numpy as np
obj.obs["annotation"] = np.array([my_annotations])
```
using the usual convention in ```ScanPy``` and ```AnnData```.
For the ```obj```, and ```AnnData``` object of counts, and the additional user-defined set of annotations, we may now apply ```SMaSH``` step-by-step:
```
# Data preparation
sm.data_preparation(obj)
# Removing general genes
obj = sm.remove_general_genes(obj)
# Removing genes expressed in less than 30% within groups
obj = sm.remove_features_pct(obj, group_by="annotation", pct=0.3)
# Removing genes expressed in more than 50% in a given group where genes are expressed for more 75% within a given group
obj = sm.remove_features_pct_2groups(obj, group_by="annotation", pct1=0.75, pct2=0.5)
# Inverse PCA to remove unimportant genes
obj = sm.scale_filter_features(obj, n_components=None, filter_expression=True)
# Run deep neural network to locate optimal markers for classification of cells according to the orginal user annotations
sm.DNN(obj, group_by="annotation", model=None, balance=True, verbose=True, save=False)
# Top 20 genes as a final dictionary, for each annotation (class) provided
# Calculate the importances of each gene using the Shapley value
selectedGenes, selectedGenes_dict = sm.run_shap(obj, group_by="annotation", model=None, verbose=True, pct=0.1, restrict_top=("local", 20))
```
NB: To complete and up-to-date pipelines to follow step-by-step are in ```updated notebooks``` folder.
## Contact
We're always happy to hear of any suggestions, issues, bug reports, and possible ideas for collaboration.
Simone Riva , , (University of Cambridge, and Wellcome Sanger Institute)
Mike Nelson (University of Cambridge, and EMBL-EBI)