{"id":20456547,"url":"https://github.com/krassowski/gsea-api","last_synced_at":"2025-04-13T04:05:42.579Z","repository":{"id":35012354,"uuid":"188071398","full_name":"krassowski/gsea-api","owner":"krassowski","description":"Pandas API for multiple Gene Set Enrichment Analysis implementations in Python (GSEApy, cudaGSEA, GSEA)","archived":false,"fork":false,"pushed_at":"2023-03-31T15:31:22.000Z","size":166,"stargazers_count":14,"open_issues_count":3,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-13T04:05:21.384Z","etag":null,"topics":["bioinformatics","cuda","enrichment","gene-set-enrichment","gene-sets","gsea","pandas","pathway-analysis","python3","transcriptomics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krassowski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-05-22T16:00:30.000Z","updated_at":"2025-03-01T23:13:29.000Z","dependencies_parsed_at":"2023-01-15T11:59:34.398Z","dependency_job_id":"6e9a612a-7a46-48ba-b633-1d2688b13704","html_url":"https://github.com/krassowski/gsea-api","commit_stats":{"total_commits":69,"total_committers":2,"mean_commits":34.5,"dds":"0.33333333333333337","last_synced_commit":"265ff71deb9b7188b70b9054d3bb455e355926bb"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krassowski%2Fgsea-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krassowski%2Fgsea-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krassowski%2Fgsea-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krassowski%2Fgsea-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krassowski","download_url":"https://codeload.github.com/krassowski/gsea-api/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248661707,"owners_count":21141450,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","cuda","enrichment","gene-set-enrichment","gene-sets","gsea","pandas","pathway-analysis","python3","transcriptomics"],"created_at":"2024-11-15T11:23:02.542Z","updated_at":"2025-04-13T04:05:42.549Z","avatar_url":"https://github.com/krassowski.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GSEA API for Pandas\n[![Build Status](https://travis-ci.com/krassowski/gsea-api.svg?branch=master)](https://travis-ci.com/krassowski/gsea-api)\n[![codecov](https://codecov.io/gh/krassowski/gsea-api/branch/master/graph/badge.svg)](https://codecov.io/gh/krassowski/gsea-api)\n[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/)\n[![DOI](https://zenodo.org/badge/188071398.svg)](https://zenodo.org/badge/latestdoi/188071398)\n\nPandas API for Gene Set Enrichment Analysis in Python (GSEApy, cudaGSEA, GSEA)\n\n- aims to provide a unified API for various GSEA implementations; uses pandas DataFrames and a hierarchy of Pythonic classes.\n- file exports (exporting input for GSEA) use low-level numpy functions and are much faster than in pandas\n- aims to allow researchers to easily compare different implementations of GSEA, and to integrate those in projects which require high-performance GSEA (e.g. massive screening for drug-repositioning)\n- provides useful utilities for work with GMT files, or gene sets and pathways in general in Python\n\n\n## Installation\n\nTo install the API use:\n\n```bash\npip3 install gsea_api\n```\n\nSee [below](#Installing-GSEA-implementations) for the instructions on installation of specific GSEA implementations.\n\n## Example usage\n\n```python\nfrom pandas import read_table\nfrom gsea_api.expression_set import ExpressionSet\nfrom gsea_api.gsea import GSEADesktop\nfrom gsea_api.molecular_signatures_db import GeneSets\n\nreactome_pathways = GeneSets.from_gmt('ReactomePathways.gmt')\n\ngsea = GSEADesktop()\n\ndesign = ['Disease', 'Disease', 'Disease', 'Control', 'Control', 'Control']\nmatrix = read_table('expression_data.tsv', index_col='Gene')\n\nresult = gsea.run(\n    # note: contrast() is not necessary in this simple case\n    ExpressionSet(matrix, design).contrast('Disease', 'Control'),\n    reactome_pathways,\n    metric='Signal2Noise',\n    permutations=1000\n)\n```\n\n\nWhere `expression_data.tsv` is in the following format:\n\n```\nGene\tPatient_1\tPatient_2\tPatient_3\tPatient_4\tPatient_5\tPatient_6\nTACC2\t0.2\t0.1\t0.4\t0.6\t0.7\t2.1\nTP53\t2.3\t0.2\t2.1\t2.0\t0.3\t0.6\n```\n\n### MSigDB integration\n\n[Molecular Signatures Database](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) (MSigDB) can be downloaded from the [Broad Institute GSEA website](https://www.gsea-msigdb.org/gsea/downloads.jsp). It provides expert-curated gene set collections, as well as curated subset of pathway databases (Reactome, KEGG, Biocarta, Gene Ontology) trimmed to remove redundant, overlapping and and otherwise little-value terms (if needed).\n\nYou can download all the pathways collections at once (search for `ZIPped MSigDB` on the download page). After downloading and un-zipping (e.g., to a local directory named `msigdb`), you can access the gene sets from MSigDB with:\n\n```python\nfrom gsea_api.molecular_signatures_db import MolecularSignaturesDatabase\n\nmsigdb = MolecularSignaturesDatabase('msigdb', version=7.1)\nmsigdb.gene_sets\n```\n\n`msigdb.gene_sets` returns a list of dictionaries describing auto-detected pathways:\n\n```python\n[\n    {'name': 'c1.all', 'id_type': 'symbols'},\n    {'name': 'c1.all', 'id_type': 'entrez'},\n    {'name': 'c2.cp.reactome', 'id_type': 'symbols'},\n    {'name': 'c2.cp.reactome', 'id_type': 'entrez'}\n    # etc..\n]\n```\n\nInformation about the location on disk and version are available in `msigdb.path` and `msigdb.version`.\n\n`msigdb.load` loads the specific collection into a `GeneSets` object:\n\n```python\n\u003e kegg_pathways = msigdb.load('c2.cp.kegg', 'symbols')\n\u003e print(kegg_pathways)\n\u003cGeneSets 'c2.cp.kegg' with 186 gene sets\u003e\n```\n\nThis object can be passed to any of the supported GSEA implementations; please see below for a detailed description of the `GeneSets` object.\n\n### `GeneSets` objects\n\n`GeneSets` represents a collection of sets of genes, where each set is represented as `GeneSet` object.\n\nYou can check the number of sets contained within a collection with:\n\n```python\n\u003e len(kegg_pathways)\n186\n```\n\nThe gene sets are accessible with `gene_sets` (tuple) and `gene_sets_by_name` (dict) properties:\n\n```python\n\u003e kegg_pathways.gene_sets[:2]\n(\u003cGeneSet 'KEGG_TIGHT_JUNCTION' with 132 genes\u003e, \u003cGeneSet 'KEGG_RNA_DEGRADATION' with 59 genes\u003e)\n\u003e kegg_pathways.gene_sets_by_name\n{\n    'KEGG_TIGHT_JUNCTION': \u003cGeneSet 'KEGG_TIGHT_JUNCTION' with 132 genes\u003e,\n    'KEGG_RNA_DEGRADATION': \u003cGeneSet 'KEGG_RNA_DEGRADATION' with 59 genes\u003e\n    # etc.\n }\n```\n\n#### Subsetting collections\n\nSometimes only a subset of genes is measured in an experiment. You can remove gene sets which do not contain any of the measured genes from the collection:\n\n```python\n\u003e measured_genes = {'APOE', 'CYB5R1', 'FCER1G', 'PVR', 'HK2'}\n\u003e measured_subset = kegg_pathways.subset(measured_genes)\n\u003e print(measured_subset)\n\u003cGeneSets with 12 gene sets\u003e\n```\n\nThe skipped gene sets are accessible in `measured_subset.empty_gene_sets` for inspection.\n\n#### Trimming collections\n\n```python\n\u003e kegg_pathways.trim(min_genes=10, max_genes=20)\n\u003cGeneSets with 21 gene sets\u003e\n```\n\n#### Prettify names\n\n```python\ndef prettify_kegg_name(gene_set):\n    return gene_set.name.replace('KEGG_', '').replace('_', ' ')\n\nkegg_pathways_pretty = kegg_pathways.format_names(prettify_kegg_name)\nkegg_pathways_pretty.gene_sets[:2]\n# (\u003cGeneSet 'TIGHT JUNCTION' with 132 genes\u003e, \u003cGeneSet 'RNA DEGRADATION' with 59 genes\u003e)\n```\n\nFor MSigDB 7.4+:\n\n```python\ndef pretty_reactome_name(gene_set):\n    return gene_set.metadata['DESCRIPTION_BRIEF']\n\nreactome_pathways_pretty = reactome_pathways.format_names(pretty_reactome_name)\nreactome_pathways_pretty.gene_sets[:2]\n#\n```\n\n#### Other properties\n\nOther properties and methods offered by `GeneSets` include:\n   - `all_genes`: return a set of all genes which are covered by the gene sets in the collection\n   - `name`: the name of the collection\n   - `to_frame()` return a pandas `DataFrame` describing membership of the genes (gene sets = rows, genes = columns), which can be used for UpSet visualisation (e.g. with [ComplexUpset](https://github.com/krassowski/complex-upset))\n   - `to_gmt(path: str)` exports the gene set to a GMT (Gene Matrix Transposed) file\n\n## Installing GSEA implementations\n\nFollowing GSEA implementations are supported:\n\n### GSEA from Broad Institute\n\nLogin/register on [the official GSEA website](http://software.broadinstitute.org/gsea/login.jsp) and download the `gsea_3.0.jar` file (or a newer version).\n\nProvide the location of the downloaded file to `GSEADesktop()` using `gsea_jar_path` argument, e.g.:\n\n```python\ngsea = GSEADesktop(gsea_jar_path='downloads/gsea_3.0.jar')\n```\n\n### GSEApy\n\nTo use gsea.py please install it with:\n\n```\npip3 install gseapy\n```\n\nUse it with:\n\n```python\nfrom gsea_api.gsea import GSEApy\n\ngsea = GSEApy()\n```\n\n### cudaGSEA\n\nPlease clone this [fork of cudaGSEA](https://github.com/krassowski/cudaGSEA) and compile the binary version:\n\n```bash\ngit clone https://github.com/krassowski/cudaGSEA\ncd cudaGSEA/cudaGSEA/src/\n# if on Ubuntu:\n# sudo apt install nvidia-cuda-toolkit\n# whereis nvcc\nexport CUDA_HOME=/usr\nexport R_INC=/usr/share/R/include\nexport RCPP_INC=/usr/local/lib/R/site-library/Rcpp/include\nmake cudaGSEA\n```\n\ndepending on your GPU and drivers you may see `Unsupported gpu architecture 'compute_20'` error; simply edit `Makefile` removing `-gencode arch=compute_20,code=compute_20` (see [this askUbuntu post](https://askubuntu.com/questions/960238/nvcc-fatal-unsupported-gpu-architecture-compute-20))\n\nYou can also try to use [the original version](https://github.com/gravitino/cudaGSEA), which does not implement FDR calculations.\n\nUse it with:\n\n```python\nfrom gsea_api.gsea import cudaGSEA\n\n# CPU implementation can be used with use_cpu=True\ngsea = cudaGSEA(fdr='full', use_cpu=False, path='cudaGSEA/cudaGSEA/src/cudaGSEA')\n```\n\n## Citation\n\n[![DOI](https://zenodo.org/badge/188071398.svg)](https://zenodo.org/badge/latestdoi/188071398)\n\nPlease also cite the authors of the wrapped tools that you use.\n\n\n## References\n\nThe initial version of this code was written for a [Master thesis project](https://github.com/krassowski/drug-disease-profile-matching) at Imperial College London.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrassowski%2Fgsea-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrassowski%2Fgsea-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrassowski%2Fgsea-api/lists"}