{"id":32111252,"url":"https://github.com/na396/sgcp","last_synced_at":"2026-02-18T09:02:49.955Z","repository":{"id":61377208,"uuid":"540984270","full_name":"na396/SGCP","owner":"na396","description":"SGCP: a spectral self-learning method for clustering genes in co-expression networks","archived":false,"fork":false,"pushed_at":"2024-08-19T23:34:17.000Z","size":5464,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-12T18:46:19.245Z","etag":null,"topics":["bioinformatics","clustering","genecoexpressionnetwork","graphs","networkclustering","networks","self-training","semi-supervised-learning","unsupervised-learning"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/na396.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-09-24T22:29:36.000Z","updated_at":"2024-08-19T23:34:20.000Z","dependencies_parsed_at":"2023-01-30T05:01:15.764Z","dependency_job_id":"99f229d3-7dc1-4ac7-8d9d-31cfddb48407","html_url":"https://github.com/na396/SGCP","commit_stats":{"total_commits":180,"total_committers":3,"mean_commits":60.0,"dds":"0.37222222222222223","last_synced_commit":"b0f4c0a9425b06ecd8983c9e9db53353c4145e92"},"previous_names":["na396/sgcp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/na396/SGCP","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/na396%2FSGCP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/na396%2FSGCP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/na396%2FSGCP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/na396%2FSGCP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/na396","download_url":"https://codeload.github.com/na396/SGCP/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/na396%2FSGCP/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29574065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-18T08:38:15.585Z","status":"ssl_error","status_checked_at":"2026-02-18T08:38:14.917Z","response_time":162,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","clustering","genecoexpressionnetwork","graphs","networkclustering","networks","self-training","semi-supervised-learning","unsupervised-learning"],"created_at":"2025-10-20T14:21:41.290Z","updated_at":"2026-02-18T09:02:49.947Z","avatar_url":"https://github.com/na396.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SGCP: a spectral self-learning method for clustering genes in co-expression networks, [link](https://link.springer.com/article/10.1186/s12859-024-05848-w)\n\n\n## SGCP Introduction\nThe Self-training Gene Clustering Pipeline (`SGCP`) is an innovative framework for constructing and analyzing gene co-expression networks. Its primary objective is to group genes with similar expression patterns into cohesive clusters, often referred to as modules. SGCP introduces several novel steps that enable the computation of highly enriched gene modules in an unsupervised manner. What sets SGCP apart from existing frameworks is its integration of a semi-supervised clustering approach, which leverages Gene Ontology (GO) information. This unique step significantly enhances the quality of the resulting modules, producing highly enriched and biologically relevant clusters.\n\n## SGCP Publication\n`SGCP` is available at [BMC Bioinformatics](https://link.springer.com/article/10.1186/s12859-024-05848-w). \n\n## SGCP Installation\nFor detailed instructions and steps, please refer to the `SGCP` manual on\n[Bioconductor page](https://bioconductor.org/packages/release/bioc/html/SGCP.html). To install the latest version of `SGCP`, you can access the GitHub repository using the following command:\n```{r}\n#install.packages(\"devtools\")\n#devtools::install_github(\"na396/SGCP\")\n```\n## SGCP license\nGPL-3\n\n## SGCP encoding\nUTF-8\n\n\n## SGCP Input\n\n`SGCP` requires three main inputs; __expData__ , __geneID__, and __annotation_db__.\n*    __expData__: This is a matrix or dataframe of size `m*n` where `m` represents the number of genes and `n` represents the number of samples. It can contain data from either DNA-microarray or RNA-seq experiments . Note that `SGCP` assumes that pre-processing steps, such as normalization and batch effect corection, have already been performed, as these are not handled by the pipeline.\n*    __geneID__: A vector of gene identifier corresponding to the rows in __expData__.\n*    __anotation_db__: The name of a genome-wide annotation package for the organism of interest, used in the gene ontology (GO) enrichment step. The `annotation_db` package must be installed by user prior to using `SGCP`.\n\nBelow are some commonly used `annotation_db` packages along with their corresponding gene identifiers for different organisms.\n\n|organism                     | annotation_db  | gene identifier         |\n|:----------------------------|:--------------:|:---------------------   | \n|Homo sapiens (Hs)            | org.Hs.eg.db   | Entrez Gene identifiers |\n|Drosophila melanogaster (Dm) | org.Dm.eg.db   | Entrez Gene identifiers |\n|Rattus norvegicus (Rn)       | org.Rn.eg.db   | Entrez Gene identifiers |\n|Mus musculus (Mm)            | org.Mm.eg.db   | Entrez Gene identifiers |\n|Arabidopsis thaliana (At)    | org.At.tair.db | TAIR identifiers        |\n\nGene expression datasets for your analysis can be obtained from the [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/), a public repository of high-throughput gene expression data.\n\n\n### SGCP Input Cleaning\nIn `SGCP`, the following assumptions are made about the input genes:\n\n* Genes must have expression values available across all samples, with no missing values.\n* Genes must exhibit non-zero variance in expression across all samples.\n* ach gene must have exactly one unique identifier, specified by __geneID__.\n* Genes must be annotated with Gene Ontology (GO) terms.\n\n\n## SGCP Input Example\nHere, we give a brief example of the `SGCP` input. For this documentation, we use the gene expression `GSE181225`. For more information visit its [Gene Expression Omnibus page](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE181225)). \n\nThroughout this section, several Bioconductor packages will be required. Make sure to install and load them as needed to follow the example.\n\n```{r}\nif (!require(\"BiocManager\", quietly = TRUE))\n    install.packages(\"BiocManager\")\n\nBiocManager::install(c(\"org.Hs.eg.db\", \"GEOquery\", \"AnnotationDbi\"))\n```\n\nFirst, set the directory\n```{r}\n# Display the current working directory\nprint(getwd())\n\n# If necessary, change the path below to the directory where the data files are stored.\n# \".\" means current directory. On Windows use a forward slash / instead of the usual \\.\nworkingDir = \".\"\nsetwd(workingDir)\n```\n\nFirst, we need to download the gene expression file. The R package `GEOquery` is used to obtain gene expression data from the [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/). For detailed information on how to use `GEOquer`y, refer to the [GEOquery guide](https://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html). \n\n\nTo download the expression file for `GSE181225`, visit its [Gene Expression Omnibus page](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE181225). On the page, locate the file` GSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz` in the `Supplementary files` section, which contains the normalized gene expression data. Download this `supplementary file` and save it to the directory specified by `baseDir`.\n\n```{r}\n\nlibrary(GEOquery)\n\ngse = getGEOSuppFiles(\"GSE181225\", baseDir = getwd())\n\n```\nAfter downloading the file, you should find a new directory named `GSE181225`, which contains the gene expression file. To proceed, read the gene expression file into R. The file has the following structure:\n*   The `Symbol` column contains the gene symbols.\n*  The remaining four columns represent different samples.\n \n\n```{r}\ndf = read.delim(\"GSE181225/GSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz\")\nhead(df)\n```\n\nNext, create the __expData__, __geneID__, and __annotation_db__.\n```{r}\ngeneID = df[,1]\n\nexpData = df[, 2:ncol(df)]\nrownames(expData) = geneID\n\nlibrary(org.Hs.eg.db)\n```\nTo map gene symbols to Entrez identifiers using the __annotation_db__, you can use the `select` function from the `AnnotationDbi` package. Here’s how you can do it in R:\n\n```{r}\nlibrary(AnnotationDbi)\n\ngenes = AnnotationDbi::select(org.Hs.eg.db, keys = rownames(expData), \n                      columns=c(\"ENTREZID\"), \n                      keytype=\"SYMBOL\")\n# initial dimension\nprint(dim(genes))\nhead(genes)\n```\nRemove genes with missing `SYMBOL` or `ENTREZID`.\n\n```{r}\ngenes = genes[!is.na(genes$SYMBOL), ]\ngenes = genes[!is.na(genes$ENTREZID), ]\n\n#dimension after dropping missing values\nprint(dim(genes))\nhead(genes)\n```\n \nRemove genes with duplicated `SYMBOL` or `ENTREZID`.\n```{r}\ngenes = genes[!duplicated(genes$SYMBOL),]\ngenes = genes[!duplicated(genes$ENTREZID), ]\n#dimension after dropping missing values\nprint(dim(genes))\nprint(head(genes))\n```\n\nKeep only rows in __expData__ that have corresponding gene identifiers present in `genes`.\n\n```{r}\nexpData = data.frame(expData, SYMBOL = rownames(expData))\nexpData =  merge(expData, genes, by = \"SYMBOL\")\n```\n\nProduce __expData__.\n```{r}\nrownames(expData) = expData$ENTREZID\nexpData = expData[, c(2:6)]\nprint(head(expData))\n```\n\nRemove genes with zero variance from __expData__.\n\n```{r}\n# Dropping zero variance genes\n\nvars = apply(expData, 1, var)\nzeroInd = which(vars == 0)\n\nif(length(zeroInd) != 0) {\n  print(paste0(\"number of zero variance genes \", length(zeroInd)))\n  expData = expData[-zeroInd, ]\n  genes = genes[-zeroInd, ]\n}\n\nprint(paste0(\"number of genes after dropping \", dim(genes)[1]))\n```\nRemove genes with no gene ontology mapping.\n\n```{r}\n## Remove genes with no GO mapping\n\nxx = as.list(org.Hs.egGO[genes$ENTREZID])\nhaveGO  = sapply(xx,\n                 function(x) {if (length(x) == 1 \u0026\u0026 is.na(x)) FALSE else TRUE })\nnumNoGO  = sum(!haveGO)\nif(numNoGO != 0){\n  print(paste0(\"number of genes with no GO mapping \", length(zeroInd)))\n  expData = expData[haveGO, ]\n  genes = genes[haveGO, ]\n  \n}\nprint(paste0(\"number of genes after dropping \", dim(genes)[1]))\n```\nProduce the final __expData__, __geneID__, __annotation_db__. Now, the input is ready for `SGCP`. Refer to\n[SGCP Bioconductor page](https://bioconductor.org/packages/release/bioc/html/SGCP.html) in order to see how to use this input in `SGCP`. \n\n```{r}\nexpData = expData\nprint(head(expData))\n\ngeneID = genes$ENTREZID\nprint(head(geneID))\n\nannotation_db = \"org.Hs.eg.db\" \n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fna396%2Fsgcp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fna396%2Fsgcp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fna396%2Fsgcp/lists"}