{"id":17152777,"url":"https://github.com/jeffreypullin/smash-fork","last_synced_at":"2025-03-24T12:45:10.597Z","repository":{"id":174113943,"uuid":"651788305","full_name":"jeffreypullin/smash-fork","owner":"jeffreypullin","description":"Fork of smashpy for research purposes","archived":false,"fork":false,"pushed_at":"2023-06-12T01:28:12.000Z","size":42540,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-29T18:11:18.518Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jeffreypullin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-10T05:15:19.000Z","updated_at":"2023-06-10T05:16:02.000Z","dependencies_parsed_at":"2023-07-02T12:00:43.144Z","dependency_job_id":null,"html_url":"https://github.com/jeffreypullin/smash-fork","commit_stats":null,"previous_names":["jeffreypullin/smash-fork"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeffreypullin%2Fsmash-fork","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeffreypullin%2Fsmash-fork/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeffreypullin%2Fsmash-fork/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeffreypullin%2Fsmash-fork/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jeffreypullin","download_url":"https://codeload.github.com/jeffreypullin/smash-fork/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245275330,"owners_count":20588886,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T21:44:31.656Z","updated_at":"2025-03-24T12:45:10.573Z","avatar_url":"https://github.com/jeffreypullin.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SMaSH framework\n\n## Overview \nThe ```SMaSH``` (Scalable Marker gene Signal Hunter) framework is a general, scalable codebase for calculating marker genes from single-cell RNA-sequencing\ndata for a variety of different cell annotations as provided by the user, using supervised machine learning approaches.  These annotations can be truly general:\nthey can be broad cell types/clusters, detailed sub-types of different broad clusters, cell organ of origin, whether the cell inhabits tumour tissue, surrounding\nmicroenvironment, or healthy tissue, and more besides. ```SMaSH``` implements marker gene extraction using four different models (Random Forest, Balanced Random Forest, XGBoost,\nand a deep neural network) and two different information gain metrics (Gini impurity for the ensemble learners, and Shapley value for the neural network). For some details\non the ```SMaSH``` implementation (see Figure below) please consult our pre-print: https://www.biorxiv.org/content/10.1101/2021.04.08.438978v1. ```SMaSH``` is integrated with the ```ScanPy``` framework, working directly from the ```AnnData```\nobject of RNA-sequencing counts and a vector of user-defined annotations for each cell according to the marker gene extraction problem. \n\n\u003c!-- \u003cimg src=\"images/SMaSH_flowchart.png\"\u003e --\u003e\n\u003cimg src=\"images/SMaSH_framework.png\"\u003e\n\n## Installation\n```SMaSH``` is accessible on ```pypi``` (https://pypi.org/project/smashpy) and can be installed with ```pip```:\n\n```\npip install smashpy\n```\nAll package requirements and versions are summarised in ```setup.py``` and are automatically installed with ```SMaSH```. We therefore recommend the user\nwork from a fresh environment, such as is implemented in Anaconda:\n\n``` \nconda -n smash_env \nconda activate smash_env\npip install smashpy\n```\n\n## Up and running with ```SMaSH``` ! \nThe full ```SMaSH``` workflow is implemented sequentially from several functions, covering data preparation, initial gene filtering with principal components analysis, one of the\n```SMaSH``` models for gene importance calculation, and the final ranking and selection of all genes from the initial ```AnnData``` object. For complete coverage of all models, we \nhave included several notebooks in this repository (see ```notebooks/```), where each folder corresponds to a different publicly available data-set and contains four notebooks \ncorresponding to a separate implementation of the four different ```SMaSH``` models for the gene importance calculation. Let's consider the Paul15 data-set, available from ```ScanPy```:\n\n```\nimport scanpy as sc\nobj = sc.datasets.paul15()\n```\n\nThis can then be analysed step-by-step with the ```SMaSH``` functions, starting from the instantiation of the SMaSH object\n\n```\nimport smashpy\nsm = smashpy.smashpy()\n```\n\nEach step in the marker gene extraction chain (see Figure) can now be applied. For more details on each of these functions, see the examples provided in ```notebooks/``` and\nthe help service, where full details on the implementation and attributes of any ```SMaSH``` function ```func``` can be accessed with \n\n```\nhelp(sm.func())\n```\n\n Please note that the user-defined vector of annotations much be added for each cell\nand stored as an object which can be accessed directly from the ```AnnData``` input, i.e. corresponding to\n\n```\nimport numpy as np\nobj.obs[\"annotation\"] = np.array([my_annotations])\n```\n\nusing the usual convention in ```ScanPy``` and ```AnnData```. \n\nFor the ```obj```, and ```AnnData``` object of counts, and the additional user-defined set of annotations, we may now apply ```SMaSH``` step-by-step:\n\n```\n# Data preparation\nsm.data_preparation(obj)\n\n# Removing general genes\nobj = sm.remove_general_genes(obj)\n\n# Removing genes expressed in less than 30% within groups\nobj = sm.remove_features_pct(obj, group_by=\"annotation\", pct=0.3)\n\n# Removing genes expressed in more than 50% in a given group where genes are expressed for more 75% within a given group\nobj = sm.remove_features_pct_2groups(obj, group_by=\"annotation\", pct1=0.75, pct2=0.5)\n\n# Inverse PCA to remove unimportant genes\nobj = sm.scale_filter_features(obj, n_components=None, filter_expression=True)\n\n# Run deep neural network to locate optimal markers for classification of cells according to the orginal user annotations\nsm.DNN(obj, group_by=\"annotation\", model=None, balance=True, verbose=True, save=False)\n\n# Top 20 genes as a final dictionary, for each annotation (class) provided\n# Calculate the importances of each gene using the Shapley value\nselectedGenes, selectedGenes_dict = sm.run_shap(obj, group_by=\"annotation\", model=None, verbose=True, pct=0.1, restrict_top=(\"local\", 20))\n\n```\n\nNB: To complete and up-to-date pipelines to follow step-by-step are in ```updated notebooks``` folder.\n\n\n## Contact\nWe're always happy to hear of any suggestions, issues, bug reports, and possible ideas for collaboration.\n\nSimone Riva \u003csimo.riva15@gmail.com\u003e, \u003csgr34@cam.ac.uk\u003e, \u003csr31@sanger.ac.uk\u003e (University of Cambridge, and Wellcome Sanger Institute) \n\nMike Nelson \u003cnelson@ebi.ac.uk\u003e (University of Cambridge, and EMBL-EBI)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjeffreypullin%2Fsmash-fork","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjeffreypullin%2Fsmash-fork","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjeffreypullin%2Fsmash-fork/lists"}