{"id":19644821,"url":"https://github.com/uscbiostats/geese","last_synced_at":"2025-02-26T23:42:36.839Z","repository":{"id":93429966,"uuid":"341327209","full_name":"USCbiostats/geese","owner":"USCbiostats","description":"GEne-functional Evolutionary model using SufficiEncy","archived":false,"fork":false,"pushed_at":"2023-10-17T21:01:09.000Z","size":3340,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-09T20:42:31.481Z","etag":null,"topics":["biostatistics","evolutionary-biology","gene-evolution","gene-functions","phylogenetics","rpackage","rstats"],"latest_commit_sha":null,"homepage":"https://uscbiostats.github.io/geese/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/USCbiostats.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-02-22T20:24:04.000Z","updated_at":"2023-07-11T20:09:04.000Z","dependencies_parsed_at":"2023-10-05T07:38:00.765Z","dependency_job_id":"d911ddac-af7c-42e4-a9f8-5254c2299c4d","html_url":"https://github.com/USCbiostats/geese","commit_stats":{"total_commits":77,"total_committers":1,"mean_commits":77.0,"dds":0.0,"last_synced_commit":"3db7efbeab63b60c706d07c19db22c7c6482aa12"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/USCbiostats%2Fgeese","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/USCbiostats%2Fgeese/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/USCbiostats%2Fgeese/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/USCbiostats%2Fgeese/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/USCbiostats","download_url":"https://codeload.github.com/USCbiostats/geese/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240952927,"owners_count":19884019,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biostatistics","evolutionary-biology","gene-evolution","gene-functions","phylogenetics","rpackage","rstats"],"created_at":"2024-11-11T14:30:10.246Z","updated_at":"2025-02-26T23:42:36.813Z","avatar_url":"https://github.com/USCbiostats.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n\u003cdiv\u003e\n\n[![](https://raw.githubusercontent.com/USCbiostats/badges/master/tommy-image-badge.svg)](https://image.usc.edu)\n\nIntegrative Methods of Analysis for Genetic Epidemiology\n\n\u003c/div\u003e\n\n# geese: *GE*ne-functional *E*volution using *S*uffici*E*ncy \u003cimg src=\"man/figures/logo.svg\" align=\"right\" width=\"180px\"/\u003e\n\n\u003c!-- badges: start --\u003e\n\n[![R-CMD-check](https://github.com/USCbiostats/geese/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/USCbiostats/geese/actions/workflows/R-CMD-check.yaml)\n\u003c!-- badges: end --\u003e\n\nThis R package taps into statistical theory primarily developed in\nsocial networks. Using Exponential-Family Random Graph Models (ERGMs),\n`geese` provides a statistical framework for building Gene Functional\nEvolution Models using Sufficiency. For example, users can directly\nhypothesize whether Neofunctionalization or Subfunctionalization events\nwere taking place in a phylogeny, without having to estimate the full\ntransition Markov Matrix that is usually used.\n\nGEESE is computationally efficient, with C++ under the hood, allowing\nthe analyses of either single trees (a GEESE) or multiple trees\nsimultaneously (pooled model) in a Flock.\n\nThis is a work in progress and based on the theoretical work developed\nduring [George G. Vega Yon](https://ggv.cl)’s doctoral thesis.\n\n## Installation\n\n\u003c!-- You can install the released version of geese from [CRAN](https://CRAN.R-project.org) with: --\u003e\n\u003c!-- ``` r --\u003e\n\u003c!-- install.packages(\"geese\") --\u003e\n\u003c!-- ``` --\u003e\n\nThe development version from [GitHub](https://github.com/) with:\n\n``` r\n# install.packages(\"devtools\")\ndevtools::install_github(\"USCbiostats/geese\")\n```\n\n# Examples\n\n## Simulating annotations (two different sets)\n\n``` r\nlibrary(geese)\n\n# Preparing data\nn \u003c- 100L\nannotations \u003c- replicate(n * 2 - 1, c(9, 9), simplify = FALSE)\n\n# Random tree\nset.seed(31)\ntree \u003c- aphylo::sim_tree(n)$edge - 1L\n\n# Sorting by the second column\ntree \u003c- tree[order(tree[, 2]), ]\n\nduplication \u003c- sample.int(\n  n = 2, size = n * 2 - 1, replace = TRUE, prob = c(.4, .6)\n  ) == 1\n\n# Reading the data in\namodel \u003c- new_geese(\n  annotations = annotations,\n  geneid = c(tree[, 2], n),\n  parent = c(tree[, 1], -1),\n  duplication = duplication\n  )\n\n# Preparing the model\nterm_gains(amodel, 0:1, duplication = 1)\nterm_loss(amodel, 0:1, duplication = 1)\nterm_gains(amodel, 0:1, duplication = 0)\nterm_loss(amodel, 0:1, duplication = 0)\nterm_maxfuns(amodel, 0, 1, duplication = 2)\ninit_model(amodel)\n#\u003e Initializing nodes in Geese (this could take a while)...\n#\u003e ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| done.\n\n# Testing\nparams \u003c- c(\n  # Gains spe\n  2, 1.5,\n  # Loss\n  -2, -1.5,\n  # Gains spe\n  -2, -1,\n  # Loss spe\n  -4, -4,\n  # Max funs\n  2, \n  # Root probabilities\n  -10, -10\n)\nnames(params) \u003c- c(\n  \"gain0 dupl\", \"gain1 dupl\",\n  \"loss0 dupl\", \"loss1 dupl\",\n  \"gain0 spe\", \"gain1 spe\",\n  \"loss0 spe\", \"loss1 spe\",\n  \"onefun\", \n  \"root0\", \"root1\"\n  )\n\nlikelihood(amodel, params*1) # Equals 1 b/c all missings\n#\u003e [1] 1\n\n# Simulating data\nfake1 \u003c- sim_geese(p = amodel, par = params, seed = 212)\nfake2 \u003c- sim_geese(p = amodel, par = params)\n\n# Removing interior node data\nis_interior \u003c- which(tree[,2] %in% tree[,1])\nis_leaf     \u003c- which(!tree[,2] %in% tree[,1])\n# for (i in is_interior) {\n#   fake1[[i]] \u003c- rep(9, 2)\n#   fake2[[i]] \u003c- rep(9, 2)\n# }\n```\n\nWe can now visualize either of the annotations using the\n[aphylo](https://github.com/USCbiostats/aphylo) package.\n\n``` r\nlibrary(aphylo)\n#\u003e Loading required package: ape\nap \u003c- aphylo_from_data_frame(\n  tree        = as.phylo(tree), \n  annotations = data.frame(\n    id = c(tree[, 2], n),\n    do.call(rbind, fake1)\n    )\n)\nplot(ap)\n```\n\n\u003cimg src=\"man/figures/README-viz-with-aphylo-1.png\"\nstyle=\"width:100.0%\" /\u003e\n\n## Model fitting MLE\n\n``` r\n# Creating the object\n# Creating the object\namodel \u003c- new_geese(\n  annotations = fake1,\n  geneid      = c(tree[, 2], n),\n  parent      = c(tree[, 1],-1),\n  duplication = duplication\n  )\n\n# Adding the model terms\nterm_gains(amodel, 0:1, duplication = 1)\nterm_loss(amodel, 0:1, duplication = 1)\nterm_gains(amodel, 0:1, duplication = 0)\nterm_loss(amodel, 0:1, duplication = 0)\nterm_maxfuns(amodel, 0, 1, duplication = 2)\ninit_model(amodel)\n#\u003e Initializing nodes in Geese (this could take a while)...\n#\u003e ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| done.\n\nprint(amodel)\n#\u003e GEESE\n#\u003e INFO ABOUT PHYLOGENY\n#\u003e # of functions           : 2\n#\u003e # of nodes [int; leaf]   : [99; 100]\n#\u003e # of ann. [zeros; ones]  : [83; 117]\n#\u003e # of events [dupl; spec] : [43; 56]\n#\u003e Largest polytomy         : 2\n#\u003e \n#\u003e INFO ABOUT THE SUPPORT\n#\u003e Num. of Arrays       : 396\n#\u003e Support size         : 8\n#\u003e Support size range   : [1, 1]\n#\u003e Transform. Fun.      : no\n#\u003e Model terms (9)    :\n#\u003e  - Gains 0 at duplication\n#\u003e  - Gains 1 at duplication\n#\u003e  - Loss 0 at duplication\n#\u003e  - Loss 1 at duplication\n#\u003e  - Gains 0 at speciation\n#\u003e  - Gains 1 at speciation\n#\u003e  - Loss 0 at speciation\n#\u003e  - Loss 1 at speciation\n#\u003e  - Genes with [0, 1] funs\n\n# Finding MLE\nans_mle \u003c- geese_mle(amodel, hessian = TRUE, ncores = 4)\nans_mle\n#\u003e $par\n#\u003e  [1]  2.327179  1.553591 -1.729575 -1.833682 -1.590516 -1.119200 -3.823851\n#\u003e  [8] -2.864298  1.982499 -1.465843  4.366549\n#\u003e \n#\u003e $value\n#\u003e [1] -109.7751\n#\u003e \n#\u003e $counts\n#\u003e function gradient \n#\u003e     1002       NA \n#\u003e \n#\u003e $convergence\n#\u003e [1] 1\n#\u003e \n#\u003e $message\n#\u003e NULL\n#\u003e \n#\u003e $hessian\n#\u003e               [,1]          [,2]          [,3]         [,4]          [,5]\n#\u003e  [1,] -4.206819071  0.5959524394  0.8862856191 -1.721987653 -1.503185e-01\n#\u003e  [2,]  0.595952439 -5.1501119636 -2.3668333888  2.589829846  2.739261e-02\n#\u003e  [3,]  0.886285619 -2.3668333888 -6.9892574608  1.273369396  9.894126e-03\n#\u003e  [4,] -1.721987653  2.5898298457  1.2733693957 -5.950797128 -3.604817e-02\n#\u003e  [5,] -0.150318497  0.0273926144  0.0098941264 -0.036048174 -1.372080e+00\n#\u003e  [6,]  0.020065546 -0.0867748664 -0.0605347044  0.373968106  4.557307e-02\n#\u003e  [7,]  0.238633328 -0.0203662864 -0.2858568173  0.088855117 -5.867635e-02\n#\u003e  [8,] -0.169421696  0.5298915990  0.1330584389 -0.704884567  2.255319e-01\n#\u003e  [9,]  2.314439286  4.1601766227 -3.0270645492 -5.257577271  6.883251e-01\n#\u003e [10,] -0.020862576 -0.0004507292 -0.0234848407  0.008509284  1.834480e-02\n#\u003e [11,]  0.000175195 -0.0036292338 -0.0001882725 -0.003219606  2.817835e-05\n#\u003e                [,6]         [,7]          [,8]          [,9]         [,10]\n#\u003e  [1,]  0.0200655457  0.238633328 -0.1694216962  2.314439e+00 -2.086258e-02\n#\u003e  [2,] -0.0867748664 -0.020366286  0.5298915990  4.160177e+00 -4.507292e-04\n#\u003e  [3,] -0.0605347044 -0.285856817  0.1330584389 -3.027065e+00 -2.348484e-02\n#\u003e  [4,]  0.3739681063  0.088855117 -0.7048845667 -5.257577e+00  8.509284e-03\n#\u003e  [5,]  0.0455730742 -0.058676354  0.2255319007  6.883251e-01  1.834480e-02\n#\u003e  [6,] -1.7555584648  0.187628157  0.5698203367  1.306991e+00  2.208491e-04\n#\u003e  [7,]  0.1876281566 -1.111934470  0.0777368676 -1.058568e+00 -1.325888e-02\n#\u003e  [8,]  0.5698203367  0.077736868 -2.5204264773 -2.774906e+00  7.558960e-03\n#\u003e  [9,]  1.3069908000 -1.058567779 -2.7749056741 -1.941377e+01 -1.233878e-02\n#\u003e [10,]  0.0002208491 -0.013258884  0.0075589597 -1.233878e-02 -6.093654e-03\n#\u003e [11,] -0.0005919283 -0.000109976  0.0001258655  4.454019e-04 -3.267786e-05\n#\u003e               [,11]\n#\u003e  [1,]  1.751950e-04\n#\u003e  [2,] -3.629234e-03\n#\u003e  [3,] -1.882725e-04\n#\u003e  [4,] -3.219606e-03\n#\u003e  [5,]  2.817835e-05\n#\u003e  [6,] -5.919283e-04\n#\u003e  [7,] -1.099760e-04\n#\u003e  [8,]  1.258655e-04\n#\u003e  [9,]  4.454019e-04\n#\u003e [10,] -3.267786e-05\n#\u003e [11,] -9.352519e-04\n\n# Prob of each gene gaining a single function\ntransition_prob(\n  amodel,\n  params = rep(0, nterms(amodel) - nfuns(amodel)), \n  duplication = TRUE, state = c(FALSE, FALSE),\n  array = matrix(c(1, 0, 0, 1), ncol=2)\n)\n#\u003e [1] 0.0625\n```\n\n## Model fitting MCMC\n\n``` r\nset.seed(122)\nans_mcmc \u003c- geese_mcmc(\n  amodel,\n  nsteps  = 20000,\n  kernel  = fmcmc::kernel_ram(warmup = 5000), \n  prior   = function(p) c(\n      dlogis(\n        p,\n        scale = 4,\n        location = c(\n          rep(0, nterms(amodel) - nfuns(amodel)),\n          rep(-5, nfuns(amodel))\n          ),\n        log = TRUE\n        )\n  ), ncores = 2L)\n```\n\nWe can take a look at the results like this:\n\n\u003cimg src=\"man/figures/README-mcmc-analysis-1.png\"\nstyle=\"width:100.0%\" /\u003e\n\n\u003cimg src=\"man/figures/README-mcmc-analysis-2.png\"\nstyle=\"width:100.0%\" /\u003e\n\n    #\u003e \n    #\u003e Iterations = 15000:20000\n    #\u003e Thinning interval = 1 \n    #\u003e Number of chains = 1 \n    #\u003e Sample size per chain = 5001 \n    #\u003e \n    #\u003e 1. Empirical mean and standard deviation for each variable,\n    #\u003e    plus standard error of the mean:\n    #\u003e \n    #\u003e                            Mean     SD Naive SE Time-series SE\n    #\u003e Gains 0 at duplication   2.9015 0.8051 0.011385        0.09034\n    #\u003e Gains 1 at duplication   1.6914 0.5653 0.007994        0.04934\n    #\u003e Loss 0 at duplication   -2.0287 0.5349 0.007563        0.05280\n    #\u003e Loss 1 at duplication   -1.8866 0.6442 0.009110        0.08533\n    #\u003e Gains 0 at speciation  -12.1932 3.5435 0.050107        1.15176\n    #\u003e Gains 1 at speciation   -0.1454 0.6609 0.009345        0.06815\n    #\u003e Loss 0 at speciation    -2.9909 0.5184 0.007331        0.04458\n    #\u003e Loss 1 at speciation    -5.1655 1.9408 0.027444        0.39515\n    #\u003e Genes with [0, 1] funs   2.2578 0.4569 0.006461        0.06265\n    #\u003e Root 1                  -1.0470 3.0807 0.043564        0.94842\n    #\u003e Root 2                  -4.2756 4.2474 0.060061        1.59284\n    #\u003e \n    #\u003e 2. Quantiles for each variable:\n    #\u003e \n    #\u003e                            2.5%      25%      50%      75%   97.5%\n    #\u003e Gains 0 at duplication   1.4054   2.3030   2.8777   3.4337  4.5624\n    #\u003e Gains 1 at duplication   0.5451   1.3327   1.7001   2.0905  2.7559\n    #\u003e Loss 0 at duplication   -3.0657  -2.3764  -2.0460  -1.6762 -0.9765\n    #\u003e Loss 1 at duplication   -3.1944  -2.3389  -1.8797  -1.4119 -0.6868\n    #\u003e Gains 0 at speciation  -18.2113 -14.9130 -12.1597 -10.1648 -3.6030\n    #\u003e Gains 1 at speciation   -1.5472  -0.5998  -0.1365   0.3416  1.0736\n    #\u003e Loss 0 at speciation    -4.0181  -3.3470  -2.9738  -2.6539 -2.0354\n    #\u003e Loss 1 at speciation    -9.4815  -6.5157  -4.8115  -3.6121 -2.3045\n    #\u003e Genes with [0, 1] funs   1.4263   1.9483   2.2481   2.5599  3.2238\n    #\u003e Root 1                  -5.9435  -3.5719  -1.4757   1.4858  4.7924\n    #\u003e Root 2                 -14.2253  -5.9892  -3.8179  -1.5920  3.3555\n\n``` r\npar_estimates \u003c- colMeans(\n  window(ans_mcmc, start = end(ans_mcmc)*3/4)\n  )\nans_pred \u003c- predict_geese(\n  amodel, par_estimates,\n  leave_one_out = TRUE,\n  only_annotated = TRUE\n  ) |\u003e do.call(what = \"rbind\")\n\n# Preparing annotations\nann_obs \u003c- do.call(rbind, fake1)\n\n# AUC\n(ans \u003c- prediction_score(ans_pred, ann_obs))\n#\u003e Prediction score (H0: Observed = Random)\n#\u003e \n#\u003e  N obs.      : 199\n#\u003e  alpha(0, 1) : 0.40, 0.60\n#\u003e  Observed    : 0.68 ***\n#\u003e  Random      : 0.52 \n#\u003e  P(\u003ct)       : 0.0000\n#\u003e --------------------------------------------------------------------------------\n#\u003e Values scaled to range between 0 and 1, 1 being best.\n#\u003e \n#\u003e Significance levels: *** p \u003c .01, ** p \u003c .05, * p \u003c .10\n#\u003e AUC 0.80.\n#\u003e MAE 0.32.\n\nplot(ans$auc, xlim = c(0,1), ylim = c(0,1))\n```\n\n\u003cimg src=\"man/figures/README-prediction-1.png\" style=\"width:100.0%\" /\u003e\n\n## Using a flock\n\nGEESE models can be grouped (pooled) into a flock.\n\n``` r\nflock \u003c- new_flock()\n\n# Adding first set of annotations\nadd_geese(\n  flock,\n  annotations = fake1,\n  geneid      = c(tree[, 2], n),\n  parent      = c(tree[, 1],-1),\n  duplication = duplication  \n)\n\n# Now the second set\nadd_geese(\n  flock,\n  annotations = fake2,\n  geneid      = c(tree[, 2], n),\n  parent      = c(tree[, 1],-1),\n  duplication = duplication  \n)\n\n# Persistence to preserve parent state\nterm_gains(flock, 0:1, duplication = 1)\nterm_loss(flock, 0:1, duplication = 1)\nterm_gains(flock, 0:1, duplication = 0)\nterm_loss(flock, 0:1, duplication = 0)\nterm_maxfuns(flock, 0, 1, duplication = 2)\n\n\n# We need to initialize to do all the accountintg\ninit_model(flock)\n#\u003e Initializing nodes in Flock (this could take a while)...\n#\u003e ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| done.\n\nprint(flock)\n#\u003e FLOCK (GROUP OF GEESE)\n#\u003e INFO ABOUT THE PHYLOGENIES\n#\u003e # of phylogenies         : 2\n#\u003e # of functions           : 2\n#\u003e # of ann. [zeros; ones]  : [165; 235]\n#\u003e # of events [dupl; spec] : [86; 112]\n#\u003e Largest polytomy         : 2\n#\u003e \n#\u003e INFO ABOUT THE SUPPORT\n#\u003e Num. of Arrays       : 792\n#\u003e Support size         : 8\n#\u003e Support size range   : [1, 1]\n#\u003e Transform. Fun.      : no\n#\u003e Model terms (9)    :\n#\u003e  - Gains 0 at duplication\n#\u003e  - Gains 1 at duplication\n#\u003e  - Loss 0 at duplication\n#\u003e  - Loss 1 at duplication\n#\u003e  - Gains 0 at speciation\n#\u003e  - Gains 1 at speciation\n#\u003e  - Loss 0 at speciation\n#\u003e  - Loss 1 at speciation\n#\u003e  - Genes with [0, 1] funs\n```\n\nWe can use the same program to fit the MCMC\n\n``` r\nset.seed(122)\nans_mcmc2 \u003c- geese_mcmc(\n  flock,\n  nsteps  = 20000,\n  kernel  = fmcmc::kernel_ram(warmup = 2000), \n  prior   = function(p) dlogis(p, scale = 2, log = TRUE),\n  ncores  = 2\n  )\n```\n\n``` r\nop \u003c- par(\n  mfrow = c(4, 2), #tcl=.5,\n  las=1, mar = c(3,3,1,0),\n  bty = \"n\", oma = rep(1,4)\n  )\nfor (i in 1:ncol(ans_mcmc2)) {\n  tmpx \u003c- window(ans_mcmc2, start = 10000)[,i,drop=FALSE]\n  \n  coda::traceplot(\n    tmpx, smooth = FALSE, ylim = c(-11,11), col = rgb(0, 128, 128, maxColorValue = 255), \n    main = names(params)[i]\n    )\n  abline(h = params[i], lty=3, lwd=4, col = \"red\")\n}\n```\n\n\u003cimg src=\"man/figures/README-viz-flock-1.png\" style=\"width:100.0%\" /\u003e\n\n``` r\npar(op)\n```\n\n\u003cimg src=\"man/figures/README-viz-flock-2.png\" style=\"width:100.0%\" /\u003e\n\n``` r\nsummary(window(ans_mcmc2, start = 10000))\n#\u003e \n#\u003e Iterations = 10000:20000\n#\u003e Thinning interval = 1 \n#\u003e Number of chains = 1 \n#\u003e Sample size per chain = 10001 \n#\u003e \n#\u003e 1. Empirical mean and standard deviation for each variable,\n#\u003e    plus standard error of the mean:\n#\u003e \n#\u003e                            Mean     SD Naive SE Time-series SE\n#\u003e Gains 0 at duplication  2.39204 0.4707 0.004707        0.03019\n#\u003e Gains 1 at duplication  1.85804 0.4925 0.004925        0.02789\n#\u003e Loss 0 at duplication  -2.15114 0.4451 0.004451        0.03310\n#\u003e Loss 1 at duplication  -1.50477 0.4427 0.004427        0.03176\n#\u003e Gains 0 at speciation  -4.10744 2.9954 0.029952        0.76564\n#\u003e Gains 1 at speciation  -0.84969 0.8242 0.008241        0.09520\n#\u003e Loss 0 at speciation   -3.16554 0.6535 0.006535        0.05307\n#\u003e Loss 1 at speciation   -4.88115 2.0161 0.020160        0.32971\n#\u003e Genes with [0, 1] funs  2.09933 0.3703 0.003702        0.02921\n#\u003e Root 1                  0.02501 2.6487 0.026486        0.45210\n#\u003e Root 2                 -1.07238 2.9197 0.029195        0.56841\n#\u003e \n#\u003e 2. Quantiles for each variable:\n#\u003e \n#\u003e                            2.5%    25%      50%     75%   97.5%\n#\u003e Gains 0 at duplication   1.5050  2.068  2.37614  2.7239  3.3368\n#\u003e Gains 1 at duplication   0.9237  1.511  1.84256  2.2029  2.8299\n#\u003e Loss 0 at duplication   -3.0413 -2.451 -2.14564 -1.8533 -1.2836\n#\u003e Loss 1 at duplication   -2.3961 -1.809 -1.51894 -1.1984 -0.6178\n#\u003e Gains 0 at speciation  -11.2547 -5.414 -2.91312 -1.9486 -0.9131\n#\u003e Gains 1 at speciation   -3.2320 -1.183 -0.72227 -0.3283  0.3280\n#\u003e Loss 0 at speciation    -4.7209 -3.510 -3.08984 -2.7347 -2.0557\n#\u003e Loss 1 at speciation   -10.5227 -5.326 -4.19469 -3.5823 -2.7532\n#\u003e Genes with [0, 1] funs   1.3738  1.842  2.07762  2.3515  2.8303\n#\u003e Root 1                  -4.7967 -1.873 -0.04377  1.5864  6.0565\n#\u003e Root 2                  -6.5355 -3.147 -1.08668  1.1586  4.6030\n```\n\nAre we doing better in AUCs?\n\n``` r\npar_estimates \u003c- colMeans(\n  window(ans_mcmc2, start = end(ans_mcmc2)*3/4)\n  )\n\nans_pred \u003c- predict_flock(\n  flock, par_estimates,\n  leave_one_out = TRUE,\n  only_annotated = TRUE\n  ) |\u003e\n  lapply(do.call, what = \"rbind\") |\u003e\n  do.call(what = rbind)\n\n# Preparing annotations\nann_obs \u003c- rbind(\n  do.call(rbind, fake1),\n  do.call(rbind, fake2)\n)\n\n# AUC\n(ans \u003c- prediction_score(ans_pred, ann_obs))\n#\u003e Prediction score (H0: Observed = Random)\n#\u003e \n#\u003e  N obs.      : 398\n#\u003e  alpha(0, 1) : 0.42, 0.58\n#\u003e  Observed    : 0.72 ***\n#\u003e  Random      : 0.51 \n#\u003e  P(\u003ct)       : 0.0000\n#\u003e --------------------------------------------------------------------------------\n#\u003e Values scaled to range between 0 and 1, 1 being best.\n#\u003e \n#\u003e Significance levels: *** p \u003c .01, ** p \u003c .05, * p \u003c .10\n#\u003e AUC 0.86.\n#\u003e MAE 0.28.\nplot(ans$auc)\n```\n\n\u003cimg src=\"man/figures/README-unnamed-chunk-2-1.png\"\nstyle=\"width:100.0%\" /\u003e\n\n## Limiting the support\n\nIn this example, we use the function `rule_limit_changes()` to apply a\nconstraint to the support of the model. This takes the first two terms\n(0 and 1 since the index is in C++) and restricts the support to states\nwhere there are between $[0, 2]$ changes, at most.\n\nThis should be useful when dealing with multiple functions or\n[pylotomies](https://en.wikipedia.org/wiki/Polytomy).\n\n``` r\n# Creating the object\namodel_limited \u003c- new_geese(\n  annotations = fake1,\n  geneid      = c(tree[, 2], n),\n  parent      = c(tree[, 1],-1),\n  duplication = duplication\n  )\n\n# Adding the model terms\nterm_gains(amodel_limited, 0:1)\nterm_loss(amodel_limited, 0:1)\nterm_maxfuns(amodel_limited, 1, 1)\nterm_overall_changes(amodel_limited, TRUE)\n\n# At most one gain\nrule_limit_changes(amodel_limited, 5, 0, 2)\n\n# We need to initialize to do all the accounting\ninit_model(amodel_limited)\n#\u003e Initializing nodes in Geese (this could take a while)...\n#\u003e ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| done.\n\n# Is limiting the support any useful?\nsupport_size(amodel_limited)\n#\u003e [1] 31\n```\n\nSince we added the constraint based on the term\n`term_overall_changes()`, we now need to fix the parameter at 0 (i.e.,\nno effect) during the MCMC model:\n\n``` r\nset.seed(122)\nans_mcmc2 \u003c- geese_mcmc(\n  amodel_limited,\n  nsteps  = 20000,\n  kernel  = fmcmc::kernel_ram(\n    warmup = 2000,\n    fixed  = c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)\n    ), \n  prior   = function(p) dlogis(p, scale = 2, log = TRUE)\n  )\n```\n\n\u003cimg src=\"man/figures/README-mcmc-analysis-limited-1.png\"\nstyle=\"width:100.0%\" /\u003e\n\n    #\u003e \n    #\u003e Iterations = 15000:20000\n    #\u003e Thinning interval = 1 \n    #\u003e Number of chains = 1 \n    #\u003e Sample size per chain = 5001 \n    #\u003e \n    #\u003e 1. Empirical mean and standard deviation for each variable,\n    #\u003e    plus standard error of the mean:\n    #\u003e \n    #\u003e                                           Mean     SD Naive SE Time-series SE\n    #\u003e Gains 0 at duplication                 1.06329 0.8555 0.012097        0.06474\n    #\u003e Gains 1 at duplication                 1.00857 0.7727 0.010927        0.04945\n    #\u003e Loss 0 at duplication                 -1.44630 0.7529 0.010647        0.05664\n    #\u003e Loss 1 at duplication                 -0.65287 0.7342 0.010383        0.04529\n    #\u003e Genes with [1, 1] funs at duplication  1.04183 0.3736 0.005283        0.02301\n    #\u003e Overall changes at duplication         0.00000 0.0000 0.000000        0.00000\n    #\u003e Root 1                                -0.05519 3.1452 0.044476        0.35121\n    #\u003e Root 2                                -0.20215 3.2415 0.045837        0.41755\n    #\u003e \n    #\u003e 2. Quantiles for each variable:\n    #\u003e \n    #\u003e                                          2.5%     25%      50%     75%    97.5%\n    #\u003e Gains 0 at duplication                -0.5104  0.5096  1.07974  1.5870  2.75348\n    #\u003e Gains 1 at duplication                -0.3511  0.4883  0.97593  1.4741  2.72087\n    #\u003e Loss 0 at duplication                 -3.0046 -1.9420 -1.39766 -0.9289 -0.05484\n    #\u003e Loss 1 at duplication                 -2.0463 -1.1631 -0.65509 -0.2187  0.87313\n    #\u003e Genes with [1, 1] funs at duplication  0.3743  0.7911  1.01242  1.2674  1.88310\n    #\u003e Overall changes at duplication         0.0000  0.0000  0.00000  0.0000  0.00000\n    #\u003e Root 1                                -6.4868 -2.1595  0.08435  2.1941  5.72248\n    #\u003e Root 2                                -6.6845 -2.0668 -0.14747  1.7791  6.08394\n\n# Code of Conduct\n\nPlease note that the aphylo2 project is released with a [Contributor\nCode of\nConduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).\nBy contributing to this project, you agree to abide by its terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuscbiostats%2Fgeese","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuscbiostats%2Fgeese","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuscbiostats%2Fgeese/lists"}