{"id":24527282,"url":"https://github.com/biobakery/gaplac","last_synced_at":"2025-03-15T16:28:24.093Z","repository":{"id":40402760,"uuid":"251672563","full_name":"biobakery/GaPLAC","owner":"biobakery","description":"Gaussian Process models for the microbial masses","archived":false,"fork":false,"pushed_at":"2023-04-11T22:58:12.000Z","size":5219,"stargazers_count":0,"open_issues_count":4,"forks_count":0,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-01-22T06:17:49.693Z","etag":null,"topics":["bayesian-inference","julia"],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/biobakery.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-31T17:01:17.000Z","updated_at":"2021-12-10T21:01:50.000Z","dependencies_parsed_at":"2022-08-09T19:30:22.963Z","dependency_job_id":null,"html_url":"https://github.com/biobakery/GaPLAC","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FGaPLAC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FGaPLAC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FGaPLAC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FGaPLAC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/biobakery","download_url":"https://codeload.github.com/biobakery/GaPLAC/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243757516,"owners_count":20343249,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bayesian-inference","julia"],"created_at":"2025-01-22T06:17:53.817Z","updated_at":"2025-03-15T16:28:24.062Z","avatar_url":"https://github.com/biobakery.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"\r\n# Guide\r\n\r\n[![Build Status](https://github.com/biobakery/gptool.jl/workflows/CI/badge.svg)](https://github.com/biobakery/GaPLAC/actions)\r\n[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://kescobo.github.io/GaPLAC/stable)\r\n[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://kescobo.github.io/GaPLAC/dev)\r\n\r\nThis guide is intended to provide an overview of the basic workflow using GaPLAC. A complete command reference, as well as available covariance and likelihood functions are provided below.\r\n\r\n## Installation\r\n\r\n1. Install [Julia](https://julialang.org/).\r\n2. Download GaPLAC's repository and unpack it somewhere.\r\n3. Open a console in GaPLAC's root folder and run \r\n   ```\r\n   $ julia --project=@. -e 'using Pkg; Pkg.instantiate()'`\r\n\r\n   This will install the required packages and may take a few minutes.\r\n4. If on Mac/UNIX, to use `./gaplac ...` format, you may need to run `chmod u+x ./gaplac`.\r\n\r\n## Generating some sample data\r\n\r\nGaPLAC has five main commands to work with: `sample`, `mcmc`, `select`, `predict`, and `fitplot`. We will first look at `sample`, which draws a sample from a Gaussian Process. This can be helpful to visualize the kinds of functions described by a particular GP, or to provide some sample data to test later functions.\r\n\r\nRun the following command from the GaPLAC root folder:\r\n\r\n```\r\n./gaplac sample \"y :~| SqExp(:x; l=1)\" --at \"x=-5:0.1:5\" --plot gp_sample.png\r\n```\r\n\r\nThis may take a few minutes the first time, since Julia must compile all the packages. It will produce a large amount of output to the console, and should also produce a plot in `gp_sample.png` which resembles:\r\n\r\n![Wavy line](img/guide1.png)\r\n\r\nIf instead you get an error mentioning missing dependencies, it means that step 4 of the Installation section was not successfully completed. Try following the Installation instructions again to resolve this.\r\n\r\nLet's look at each of the pieces of the command:\r\n\r\n- `\"y :~| SqExp(:x; l=1)\"`: This is the GP formula, much like a model formula in R. In this case, the output (`y`) is modeled as a GP with a Squared-Exponential covariance function (`SqExp`) with a lengthscale (`l`) of `1`. Note also the `:` in `:~|`. Normally, a data likelihood (described later) can be specified between the `:` and the `~`, but here we don't specify anything, and the GP will be modeled _without_ a likelihood. This effectively allows us more directly observe the types of dynamics modeled by the Gaussian Process described in the formula.\r\n- `--at \"x=-5:0.1:5\"`: This tells GaPLAC what values of `x` to sample the GP at.\r\n- `--plot gp_sample.png`: Plot the dynamics here.\r\n\r\nTry changing the lengthscale of the `SqExp` term. How does this affect the function? Try adding other components (the full list is at the end of this document) by adding them to the formula, such as an Ornstein-Uhlenbeck process (`OU(x; l=1)`), or simply some `Noise`.\r\n\r\nNow let's generate a smaller set of data at some randomly chosen `x` coordinates, and store the results in a file instead of printing to stdout:\r\n\r\n```\r\n./gaplac sample \"y :~| SqExp(:x; l=1.5)\" --at \"x = rand(Uniform(-5,5), 50)\" --output data.tsv\r\n```\r\n\r\nLook at the contents of `data.tsv`. It should contain two columns: `x` and `y`, and the rows are not sorted in any way. We will use this data for the next command.\r\n\r\n\r\n## Fitting parameters\r\n\r\nWe are usually interested in the parameters of the covariance function which best fits some data. This is accomplished with the `mcmc` command in GaPLAC, which will produce a [MCMC chain](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) of samples from the posterior distribution of the model parameters.\r\n\r\nTry running the following command:\r\n\r\n```\r\n./gaplac mcmc \"y ~| SqExp(:x)\" --data data.tsv --output mcmc.tsv --samples 500 --infer x\r\n```\r\n\r\nFirst, note that the model formula omits the additional `:` before the `~`. This will therefore default to a Gaussian likelihood.\r\n\r\nNow examine the output file `mcmc.tsv`. Instead of showing a relationship between `x` and `y`, as the previous run of `sample` did, this file should contain several columns of parameter values, as well as the all-important final column containing the log of the unnormalized posterior density for the sample (think of this something like the goodness of fit). In particular, take a look at the covariance function parameter `ℓ`, which we set above in the model formula to `1.5`, but which we did not tell the `mcmc` command. If all worked well, the mean of this parameter should converge to, and hover around the true value of `1.5`.\r\n\r\n## TODO: Add `predict`\r\n\r\n## Comparing models\r\n\r\n`mcmc` fits a single model, but how do we know it's the right model? Maybe another model would be a better fit? To answer this question, we can run `mcmc` again with the other model. In this case, let's test a different kind of time-varying process called an Ornstein-Uhlenbeck (OU) process:\r\n\r\n```\r\n./gaplac mcmc \"y ~| OU(:x)\" --data data.tsv --output mcmc_ou.tsv --samples 500\r\n```\r\n\r\nThis will give us a second set of model fit results in a new `mcmc_ou.tsv` file. Now we can use those goodness-of-fit values in each of the files to determine which of the models we believe (hint: we generated data from a Squared-Exponential covariance function, and thus we expect the OU process will perform worse. Let's test that with the `select` command:\r\n\r\n```\r\n./gaplac select --chains mcmc.tsv mcmc_ou.tsv\r\n```\r\n```\r\n┌ Info: Log2 Bayes: 8.405\r\n│\r\n│   •  Log(pdf) - model 1: -81.29118\r\n│     \r\n│\r\n│   •  Log(pdf) - model 2: -89.69639\r\n│      9.972996e-28\r\n│\r\n└ Note - Positive values indicate more evidence for model 1\r\n```\r\n\r\nThis will compare the log posterior values stored in each of the MCMC chains, and summarize them as a [Bayes Factor](https://en.wikipedia.org/wiki/Bayes_factor), which is reported in log2 scale. Here, log2 Bayes Factors greater than 1 indicate that the first model (in this case the Squared-Exponential) should be preferred, while negative numbers indicate the opposite - that the second model should be preferred.\r\n\r\nYou may also compare different forumla parameters on your initial data,\r\nrather than using the outputs from MCMC.\r\n\r\n```\r\n./gaplac -v select --formulae \"y ~| SqExp(:x, l=2)\" \"y ~| SqExp(:x, l=1)\" --data data.tsv\r\n```\r\n```\r\n[ Info: running 'select'\r\n┌ Info:\r\n│ Dict{String, Any} with 4 entries:\r\n│   \"plot\" =\u003e nothing\r\n│   \"formulae\" =\u003e Any[\"y ~| SqExp(:x, l=1.5)\", \"y ~| OU(:x, l=1.5)\"]\r\n│   \"data\" =\u003e \"data.tsv\"\r\n└   \"chains\" =\u003e Any[]\r\n┌ Info: Log2 Bayes: 4.44\r\n│\r\n│   •  Log(pdf) - model 1: -31.53397005887427\r\n│\r\n│   •  Log(pdf) - model 2: -35.97395926954643\r\n│\r\n└ Note - Positive values indicate more evidence for model 1\r\n```\r\n\r\n\r\n# Command references\r\n\r\nAvailable by running the scrupt with `--help`\r\n\r\n## Commands\r\n\r\n```sh\r\n./gaplac --help\r\nusage: main.jl [-v] [-q] [--debug] [--log LOG] [-h]\r\n               {mcmc|predict|sample|fitplot|select}\r\n\r\ncommands:\r\n  mcmc           Run MCMC to optimize hyperparameters\r\n  predict        Calculate the posterior of a GP given data # not yet implemented\r\n  sample         Sample the posterior of a GP\r\n  fitplot        Diagnostic plots showing the posteriors of different\r\n                 components of the GP # not yet implemented\r\n  select         Output model selection parameters; requires --mcmc\r\n                 and --mcmc2\r\n\r\noptional arguments:\r\n  -v, --verbose  Log level to @info\r\n  -q, --quiet    Log level to @warning\r\n  --debug        Log level to @debug\r\n  --log LOG      Log to a file as well as stdout\r\n  -h, --help     show this help message and exit\r\n```\r\n\r\n## Sample\r\n\r\n\r\n```sh\r\n./gaplac sample --help\r\nusage: main.jl sample --at AT [--plot PLOT] [-o OUTPUT] [-h] formula\r\n\r\npositional arguments:\r\n  formula              GP formula specification\r\n\r\noptional arguments:\r\n  --at AT              Range to sample at, eg 'x=-5:0.1:5\r\n  --plot PLOT          File to plot to\r\n  -o, --output OUTPUT  Table output of GP sample - must end with\r\n                       '.csv' or '.tsv'\r\n  -h, --help           show this help message and exit\r\n```\r\n\r\n## MCMC\r\n\r\n```sh\r\n./gaplac mcmc --help\r\nusage: main.jl mcmc -i DATA --infer INFER [INFER...] [-o OUTPUT]\r\n                    [--plot PLOT] [-h] formula\r\n\r\npositional arguments:\r\n  formula               GP formula specification\r\n\r\noptional arguments:\r\n  -i, --data DATA       Table input on which to run inference. Must\r\n                        contain columns that correspond to values in\r\n                        'formula'\r\n  --infer INFER [INFER...]\r\n                        Which model hyperparameter to infer. Specify\r\n                        variable names, the hyperparameter(s) will be\r\n                        determined based on kernel type (eg length\r\n                        scale for SqExp)\r\n  -o, --output OUTPUT   Table to output sampling chain\r\n  --plot PLOT           File to plot to\r\n  -h, --help            show this help message and exit\r\n```\r\n\r\n## Select\r\n\r\n```sh\r\n./gaplac select --help\r\nusage: main.jl select [--formulae FORMULAE FORMULAE]\r\n                      [--chains CHAINS CHAINS] [-i DATA] [--plot PLOT]\r\n                      [-h]\r\n\r\noptional arguments:\r\n  --formulae FORMULAE FORMULAE\r\n                        Compare 2 GP formula specifications, requires\r\n                        '--data' as well. Result will be logpdf of\r\n                        formula 2 - logpdf of formula 1. A positive\r\n                        value indicates more evidence for formula 2.\r\n  --chains CHAINS CHAINS\r\n                        Compare 2 sampling chains from 'mcmc' command.\r\n                        Result will be the log2 bayes factor. A\r\n                        positive value indicates more evidence for\r\n                        chain 1.\r\n  -i, --data DATA       Table input on which to run inference. Must\r\n                        contain columns that correspond to values in\r\n                        both 'formulae'\r\n  --plot PLOT           File to plot to\r\n  -h, --help            show this help message and exit\r\n```\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiobakery%2Fgaplac","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbiobakery%2Fgaplac","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiobakery%2Fgaplac/lists"}