{"id":28269685,"url":"https://github.com/m-jahn/r-tools","last_synced_at":"2025-09-07T21:41:09.815Z","repository":{"id":65674064,"uuid":"165861467","full_name":"m-jahn/R-tools","owner":"m-jahn","description":"often used R scripts for bio-informatics work and plotting","archived":false,"fork":false,"pushed_at":"2023-01-27T16:06:19.000Z","size":2314,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-08-24T23:33:18.291Z","etag":null,"topics":["goterm","helper-functions","mass-spectrometry","proteomics","r-programming"],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/m-jahn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-15T14:05:16.000Z","updated_at":"2023-06-29T22:33:56.000Z","dependencies_parsed_at":"2023-02-15T10:46:59.681Z","dependency_job_id":null,"html_url":"https://github.com/m-jahn/R-tools","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/m-jahn/R-tools","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-jahn%2FR-tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-jahn%2FR-tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-jahn%2FR-tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-jahn%2FR-tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/m-jahn","download_url":"https://codeload.github.com/m-jahn/R-tools/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-jahn%2FR-tools/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274101680,"owners_count":25222446,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-07T02:00:09.463Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["goterm","helper-functions","mass-spectrometry","proteomics","r-programming"],"created_at":"2025-05-20T15:15:11.697Z","updated_at":"2025-09-07T21:41:09.806Z","avatar_url":"https://github.com/m-jahn.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"Rtools\n================\nMichael Jahn,\n2022-10-17\n\n\u003c!-- badges start --\u003e\n\n[![R build\nstatus](https://github.com/m-jahn/R-tools/workflows/R-CMD-check/badge.svg)](https://github.com/m-jahn/R-tools/actions)\n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/m-jahn)\n![GitHub issues](https://img.shields.io/github/issues/m-jahn/R-tools)\n![GitHub last\ncommit](https://img.shields.io/github/last-commit/m-jahn/R-tools)\n![Platform](https://img.shields.io/badge/platform-all-green)\n\u003c!-- badges end --\u003e\n\n------------------------------------------------------------------------\n\nUtility functions for bioinformatics work\n\n## Description\n\nThis package contains utility functions or wrappers for bioinformatics\nwork. It is not intended to be a full grown R package but is maintained\nas a package for the sake of accessability and documentation. Feel free\nto copy, fork or source functions that you find useful.\n\n## Installation\n\nTo install the package directly from github, use this function from\n`devtools` package in your R session:\n\n``` r\nrequire(devtools)\ndevtools::install_github(\"https://github.com/m-jahn/R-tools\")\n```\n\n## Proteomics functions\n\n### aggregate_pep\n\nAggregate peptide abundances to protein abundances.\n\nSimilar to the openMS module ProteinQuantifier, this function provides\ndifferent methods to aggregate peptide intensities to their parent\nproteins. It is mainly intended for the use with (raw) Diffacto results,\na table of peptide intensities and covariation scores (weights) that can\nbe used to filter peptides before aggregating them up to protein\nabundances.\n\n``` r\n# load additional dependencies\nlibrary(Rtools)\n\n# generate data frame\ndf \u003c- data.frame(\n  protein = c(\"A\", \"B\", \"C\", \"C/D\", \"C/D/E\", \"E\", \"F\", \"G\"),\n  n_protein = c(1,1,1,2,3,1,1,1),\n  weight = rep(1,8),\n  peptide = letters[1:8],\n  ab1 = sample(1:100, 8),\n  ab2 = sample(1:100, 8),\n  ab3 = sample(1:100, 8)\n)\n\naggregate_pep(\n  data = df, \n  sample_cols = c(\"ab1\", \"ab2\", \"ab3\"),\n  protein_col = \"protein\",\n  peptide_col = \"peptide\",\n  n_protein_col = \"n_protein\",\n  split_ambiguous = TRUE,\n  split_char = \"/\",\n  method = \"sum\"\n)\n```\n\n    ## [1] \"ab1\" \"ab2\" \"ab3\"\n\n    ## # A tibble: 7 × 5\n    ##   protein n_peptides   ab1   ab2   ab3\n    ##   \u003cchr\u003e        \u003cint\u003e \u003cdbl\u003e \u003cdbl\u003e \u003cdbl\u003e\n    ## 1 A                1  31    14     9  \n    ## 2 B                1  79    25    93  \n    ## 3 C                3  80.3 158.   87.3\n    ## 4 D                2  29.3  68.5  15.3\n    ## 5 E                2  64.3 120    44.3\n    ## 6 F                1  50    57   100  \n    ## 7 G                1  43    92    83\n\n### apply_norm\n\nApply normalization based on different published methods. This function\nis a wrapper applying different normalization functions from other\npackages, such as `limma`, `justvsm` and `preprocesscore`. These are not\nimported automatically but have to be installed separately.\nAlternatively, it will apply any custom normalization functions that is\npassed to the `norm_function` argument.\n\n``` r\ndf \u003c- data.frame(\n  protein = LETTERS[1:5],\n  cond1 = sample(1:100, 5),\n  cond2 = sample(1:100, 5),\n  cond3 = sample(1:100, 5)\n)\n\n# normalize protein abundance to obtain identical median;\n# function borrowed from limma::normalizeMedianValues()\nmedian_norm \u003c- function(x) {\n  cmed \u003c- log(apply(x, 2, median, na.rm = TRUE))\n  cmed \u003c- exp(cmed - mean(cmed))\n  t(t(x)/cmed)\n}\n\ndf_norm \u003c- apply_norm(\n  df, \n  norm_function = median_norm, \n  sample_cols = 2:ncol(df),\n  ref_cols = NULL\n)\n\n# the data after normalization\nprint(df_norm)\n```\n\n    ##   protein    cond1     cond2    cond3\n    ## 1       A 18.58728  39.23981 54.78766\n    ## 2       B 40.27244  83.71160 17.02860\n    ## 3       C 41.82138  18.31191 19.99009\n    ## 4       D 22.20147  23.54389 44.42243\n    ## 5       E 39.23981 107.25548 39.23981\n\n``` r\n# Has the normalization worked? We can compare column medians\n# for original and normalized data\napply(df[2:4], 2, median)\n```\n\n    ## cond1 cond2 cond3 \n    ##    76    15    53\n\n``` r\napply(df_norm[2:4], 2, median)\n```\n\n    ##    cond1    cond2    cond3 \n    ## 39.23981 39.23981 39.23981\n\n### fct_cluster\n\nCluster levels of a factor based on a response and a grouping variable.\nThe function changes the order of levels of a factor by clustering\nlevels according to similarity of a second response variable, and an\noptional third grouping variable.\n\n``` r\n# set seed to obtain same values\nset.seed(123)\n\n# a data frame with 5 observations for 5 different groups (A to E)\ndf \u003c- data.frame(\n  fc = factor(rep(letters[1:5], 5)),\n  group = rep(LETTERS[1:5], each = 5),\n  response = rnorm(25)\n)\n\n# levels in alphabetical order\nlevels(df$fc)\n```\n\n    ## [1] \"a\" \"b\" \"c\" \"d\" \"e\"\n\n``` r\n# reorder levels of \"fc\" by clustering values in \"response\" over \"groups\"\nlevels(with(df, fct_cluster(fc, group, response)))\n```\n\n    ## [1] \"c\" \"a\" \"e\" \"b\" \"d\"\n\n``` r\n# also works with NA or infinite values;\n# infinite values are internally replaced with NA to allow clustering\ndf[c(1,6,7), \"response\"] \u003c- -Inf\nlevels(with(df, fct_cluster(fc, group, response)))\n```\n\n    ## [1] \"c\" \"a\" \"e\" \"b\" \"d\"\n\n``` r\n# missing combinations of variables are completed with NA internally\ndf \u003c- df[-c(1,6), ]\nlevels(with(df, fct_cluster(fc, group, response)))\n```\n\n    ## [1] \"c\" \"a\" \"e\" \"b\" \"d\"\n\n``` r\n# different order of factor level does not change result\ndf$fc \u003c- factor(df$fc, c(\"c\",\"b\",\"e\",\"d\", \"a\"))\nlevels(with(df, fct_cluster(fc, group, response)))\n```\n\n    ## [1] \"c\" \"a\" \"e\" \"b\" \"d\"\n\n### get_topgo\n\nConvenience wrapper to TopGO package (Rahnenfueher et al.). This\nfunction carries out a TopGO gene ontology enrichment on a data set with\ncustom protein/gene IDs and GO terms. The function takes as main input a\ndata frame with three specific columns: cluster numbers, Gene IDs, and\nGO terms. Alternatively, these can also be supplied as three individual\nlists.\n\n``` r\n# The get_topgo function will require the TopGO package\n# as an additional dependency that is not automatically\n# attached with this package.\nlibrary(topGO)\n\n# a list of arbitrary GO terms\ngo_terms \u003c- c(\n  \"GO:0006412\", \"GO:0015979\", \"GO:0046148\", \"GO:1901566\", \"GO:0042777\", \"GO:0006614\",\n  \"GO:0016114\", \"GO:0006605\", \"GO:0090407\", \"GO:0031564\", \"GO:0032784\", \"GO:0052889\",\n  \"GO:0032787\", \"GO:0043953\", \"GO:0046394\", \"GO:0042168\", \"GO:0009124\", \"GO:0006090\",\n  \"GO:0016108\", \"GO:0016109\", \"GO:0016116\", \"GO:0016117\", \"GO:0065002\", \"GO:0006779\",\n  \"GO:0072330\", \"GO:0046390\", \"GO:0006754\", \"GO:0018298\", \"GO:0006782\", \"GO:0022618\",\n  \"GO:0042255\", \"GO:0046501\", \"GO:0070925\", \"GO:0071826\", \"GO:0006783\", \"GO:0009156\"\n)\n\n# construct a sample data set with 26  different genes in 2 different groups\n# and test which (randomly sampled) GO terms might be enriched in both groups.\n# We randomly sample 1 to 3 GO terms per gene. They need to be formatted as one\n# string of GO terms separated by \"; \".\n# set seed to obtain same values\nset.seed(123)\n\ndf \u003c- data.frame(\n  GeneID = LETTERS,\n  cluster = rep(c(1, 2), each = 13),\n  Gene.ontology.IDs = sapply(1:26,\n    function(x) paste(sample(go_terms, sample(1:3, 1)), collapse = \";\")\n  ),\n  stringsAsFactors = FALSE\n)\n\n# test if GO terms are enriched in group 1 against background\nget_topgo(df, selected.cluster = 1, topNodes = 5)\n```\n\n    ##        GO.ID                                   Term Annotated Significant\n    ## 1 GO:0044249          cellular biosynthetic process        16          11\n    ## 2 GO:0009058                   biosynthetic process        17          11\n    ## 3 GO:1901576 organic substance biosynthetic process        17          11\n    ## 4 GO:0018130       heterocycle biosynthetic process         8           6\n    ## 5 GO:0019438 aromatic compound biosynthetic process         8           6\n    ##   Expected classicFisher weightedFisher elimFisher              SigGenes\n    ## 1      8.0         0.021           0.45      0.021 A,B,C,D,E,F,I,J,K,L,M\n    ## 2      8.5         0.048           1.00      0.048 A,B,C,D,E,F,I,J,K,L,M\n    ## 3      8.5         0.048           0.30      0.048 A,B,C,D,E,F,I,J,K,L,M\n    ## 4      4.0         0.101           1.00      0.101           C,D,E,J,K,M\n    ## 5      4.0         0.101           1.00      0.101           C,D,E,J,K,M\n\n### silhouette_analysis\n\nWrapper function to perform silhouette analysis on different cluster\nnumbers. Silhouette analysis shows the clusters that have explanatory\npower. That includes clusters that are best separated from the\nneighbours resulting in a higher average silhoutte width (the decisive\nmetric to judge optimal cluster number). This function applies the\nsilhouette analysis iteratively for a vector of different cluster\nnumbers and stores results in a list.\n\n``` r\n# generate a random matrix that we use for clustering with the \n# format of 100 rows (e.g. determined gene expression) and 10 \n# columns (conditions)\nmat \u003c- matrix(rnorm(1000), ncol = 10)\n\n# we can perform clustering on this matrix using e.g. hclust:\n# there is clearly no good separation between different clusters of 'genes'\nclust \u003c- hclust(dist(mat))\nplot(clust)\n```\n\n![](vignettes/README_files/figure-gfm/unnamed-chunk-7-1.png)\u003c!-- --\u003e\n\n``` r\n# perform silhouette analysis for 2 to 10 different clusters\nsil_result \u003c- silhouette_analysis(mat, n_clusters = 2:10)\n\n# plot results\nprint(sil_result$plot_clusters, split = c(1,1,2,1), more = TRUE)\nprint(sil_result$plot_summary, split = c(2,1,2,1))\n```\n\n![](vignettes/README_files/figure-gfm/unnamed-chunk-7-2.png)\u003c!-- --\u003e\n\n### parse_kegg_brite\n\nParse Kegg Brite xml files step-by-step. This script is a small utility\nto parse Kegg Brite XML files and return a regular data frame instead.\nThe function take no other argument than a data frame. Changes that need\nto be made to the Kegg XML file before applying the function,\ne.g. simply using a text editor:\n\n-   replace double spaces ’ ’ by tabs ’\n-   remove first lines until regular content begins\n-   possibly add some trailing tabs or commas to end of first line (4\n    to 5) so that read.table knows how many columns to expect\n-   read raw data frame into R using read.table(“/path/to/file”, fill =\n    TRUE, sep = “, row.names = NULL, stringsAsFactors = FALSE, quote\n    =”“)\n\n## Growth models\n\n### baranyi_fun\n\nSimulate growth according to the Baranyi growth model.\n\n``` r\n# simulate growth according to the Baranyi growth model\n# for a growth period of 100 hours\nbiomass \u003c- baranyi_fun(\n  LOG10N0 = -1, LOG10Nmax = 1,\n  mumax = 0.1, lag = 10, t = 0:100)\n\n# plot time versus biomass\nplot(0:100, biomass)\n```\n\n![](vignettes/README_files/figure-gfm/unnamed-chunk-8-1.png)\u003c!-- --\u003e\n\n### gompertzm_fun\n\nSimulate growth according to the Gompertz modified growth model.\n\n``` r\n# simulate growth according to the Baranyi growth model\n# for a growth period of 100 hours\nbiomass \u003c- gompertzm_fun(\n  LOG10N0 = -1, LOG10Nmax = 1,\n  mumax = 0.1, lag = 10, t = 0:100)\n\n# plot time versus biomass\nplot(0:100, biomass)\n```\n\n![](vignettes/README_files/figure-gfm/unnamed-chunk-9-1.png)\u003c!-- --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fm-jahn%2Fr-tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fm-jahn%2Fr-tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fm-jahn%2Fr-tools/lists"}