Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tyjo/luminate
Denoising longitudinal microbiome data.
https://github.com/tyjo/luminate
Last synced: 11 days ago
JSON representation
Denoising longitudinal microbiome data.
- Host: GitHub
- URL: https://github.com/tyjo/luminate
- Owner: tyjo
- Created: 2019-10-31T22:03:24.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-14T16:24:00.000Z (almost 4 years ago)
- Last Synced: 2024-08-01T16:18:38.926Z (3 months ago)
- Language: Python
- Homepage:
- Size: 264 KB
- Stars: 7
- Watchers: 1
- Forks: 2
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-rm-omics - **LUMINATE** - Joseph - [Efficient and Accurate Inference of Mixed Microbial Population Trajectories from Longitudinal Count Data](https://doi.org/10.1016/j.cels.2020.05.006) (Software packages and methods / Stochastic/Probabilistic Modeling)
README
# LUMINATE
LUMINATE (longitudinal microbiome inference and zero detection) includes four programs for inference in longitudinal microbiome datasets:* `estimate`: estimate relative abundances and posterior probabilities of biological zeros.
* `train`: estimate model parameters for compositional Lotka-Volterra (cLV).
* `predict` (experimental): predict longitudinal trajectories one time point at a time.
* `plot`: generate a stacked bar plot of relative abundances and external effects## Dependencies
Program dependencies are listed in `requirements.txt`. Install dependencies with `pip install -r requirements.txt`.## Quick Start
See `./run-examples.sh` to run each program on an example dataset.## Input
Each program takes as input an OTU Table (taxa by samples) with the following structure:
* The first row gives IDs for each longitudinal sequence.
* The second row gives sample times.
* The first column gives taxon names.See `datasets/bucci2016mdsine/cdiff-counts.csv` for an example.
A second optional table providing external perturbations can be specified. This is a .csv file where the columns are `sequenceID,eventID,startDay,endDay,magnitude(optional)`
The first column links an event to a longitudinal sample through `sequenceID`. This should correspond to an ID in the OTU Table. The second column gives a name for each event: parameters are inferred for each unique event name.
The start day and stop day give a range of time over which an event occurs (end point included). For example, these could be a range of time when a patient receives antibiotics. For one time events, set `startDay=stopDay.`
The `magnitude` column is used to estimate the parameters of cLV. If unspecified, events are treated like indicator variables: set to 1 during the time of an event and 0 otherwise.
## Worked Example
Here we run through a typical workflow for LUMINATE using the data in `bucci2016mdsine`. First, let's take a look at the data by running```
python main.py plot \
datasets/bucci2016mdsine/cdiff-counts.csv \
-e datasets/bucci2016mdsine/cdiff-events.csv \
-i datasets/bucci2016mdsine \
-o datasets/bucci2016mdsine```
This will produce one stacked bar plot per sequence of relative abundances over time, along with indicators for external events. The top 19 taxa each receive a unique color, while the remaining taxa are aggregated in a single component.Here is an example, using the first sequence in the C. diff dataset:
![Example plot](./datasets/bucci2016mdsine/cdiff-counts-1.png)
### Estimate
Next, we want to estimate relative abundances from sequencing counts, and save the results:```
python main.py estimate \
datasets/bucci2016mdsine/cdiff-counts.csv \
-e datasets/bucci2016mdsine/cdiff-events.csv \
-o datasets/bucci2016mdsine
```The second position argument specifies the path to the OTU table. The optional argument `-e` (`--effects`) specifies a filepath to the .csv of external events. Finally `-o` (`--outdir`) gives a directory to store results for downstream analysis.
Running `estimate` produces 3 files:
* `cdiff-counts-est.csv` : the estimated relative abundances
* `cdiff-counts-nonzero-posterior-prob.csv`: posterior probabilities of sampling zeros (i.e. a taxon is below the detection threshold as opposed to absent from the community)
* `P.pkl` : a temporary result used for downstream analysisLet's plot the estimated relative abundances to ensure nothing went wrong:
```
python main.py plot \
datasets/bucci2016mdsine/cdiff-counts-est.csv \
-e datasets/bucci2016mdsine/cdiff-events.csv \
-i datasets/bucci2016mdsine \
-o datasets/bucci2016mdsine```
![Estimated relative abundances](./datasets/bucci2016mdsine/cdiff-counts-est-1.png)
#### Note on data sparsity
The information available to reconstruct a trajectory is proportional to the amount of nonmissing (nonzero) data. For ultra-sparse taxa (>90% missing entries), there is little information its trajectory over time. We recommend filtering out such taxa as a pre-processing step.### Train
**Note:** Estimates of cLV parameters with LUMINATE is considered experimental. To train model parameters using pseudo-counts instead of denoised estimates, pass the --use-pseudo-counts flag instead.Next, we want to estimate the parameters of cLV using the estimated relative abundances:
```
python main.py train \
datasets/bucci2016mdsine/cdiff-counts-est.csv \
-e datasets/bucci2016mdsine/cdiff-events.csv \
-i datasets/bucci2016mdsine \
-o datasets/bucci2016mdsine \
-b 200
```The `-i` (`--indir`) flag tells LUMINATE to look in this directory for estimated relative abundances (`P.pkl`). If this is not found, then `estimate` will run by default. The `-o` (`--outdir`) flag tells LUMINATE to save model parameters to this folder. Finally, the `-b` (`--bootstrap`) flag estimates one-sided p-values for model parameters.
This outputs several files
```
A : the matrix of taxa interactions
g : relative growth rate vector
B : effects of external perturbation
```
The rows of each matrix, `A, g, B`, are coordinates under the additive log ratio transformation. Bootstrap one-sided p-values are in `A_pval`, `g_pval`, `B_pval`. These are the estimated probabilities that the nonzero coefficients estimated on the training data have the same sign. For example, if `A_{ij} = 1` the computed p-value is `Prob(A_{ij} > 0)`. Note, the minimum possible p-value is `1/b` so p-values below this threshold will appear as zeros.### Predict (Experimental)
Finally, if we want to predict trajectories one step at a time we can call```
python main.py predict \
datasets/bucci2016mdsine/cdiff-counts-est.csv \
-e datasets/bucci2016mdsine/cdiff-events.csv \
-i datasets/bucci2016mdsine \
-o datasets/bucci2016mdsine \
--one-step
```This produces an OTU table of predicted trajectories. Each trajectory is predicted one time point in the future.
Plotting the results:
```
python main.py plot \
datasets/bucci2016mdsine/cdiff-counts-est-pred.csv \
-e datasets/bucci2016mdsine/cdiff-events.csv \
-i datasets/bucci2016mdsine \
-o datasets/bucci2016mdsine```
![Estimated relative abundances](./datasets/bucci2016mdsine/cdiff-counts-est-pred-1.png)
To make a prediction from initial conditions, omit the `--one-step` flag:
```
python main.py predict \
datasets/bucci2016mdsine/cdiff-counts-est.csv \
-e datasets/bucci2016mdsine/cdiff-events.csv \
-i datasets/bucci2016mdsine \
-o datasets/bucci2016mdsine
```## Program Options
```
usage: main.py [-h] [-e EVENTS] [-o OUTDIR] [-i INDIR] [-b BOOTSTRAP] [-s]
command otu-tableTime-series modeling for the microbiome
positional arguments:
command Specify analysis to run. One of:
train,predict,estimate.
otu-table Filepath to OTU table csv.optional arguments:
-h, --help show this help message and exit
-e EVENTS, --events EVENTS
Filepath to table of external events.
-o OUTDIR, --outdir OUTDIR
Specify output directory to store results. Default is
current directory.
-i INDIR, --indir INDIR
Specify input directory to load previously computed
parameters: typically the OUTDIR from a previous run.
-b BOOTSTRAP, --bootstrap BOOTSTRAP
Perform bootstrap resampling to estimate one-sided
p-values of cLV coefficients. The argument specifies
the number of bootstrap replicates to perform. Will
produce a warning if the sample size is too small
(N<30).
-s, --one-step Perform one-step prediction instead of prediction from
initial conditions.
-p, --use-pseudo-count
Estimate relative abundances using pseudo-counts instead of denoising step.
```## Code to Replicate Simulation Experiments
The directory ```experiments``` contains simulation code to regenerate simulated datasets used for evaluation in the manuscript. The script ```simulate_glv_from_data.py``` generates the simulations used the main model comparison. The script ```simulate_glv_with_zeros.py``` generates the simulations used in the biological zero evaluation. Simulated datasets are output to ```experiments/datasets```.