https://github.com/brentp/clustermodel
fitting models to clustered correlated data
https://github.com/brentp/clustermodel
Last synced: 4 months ago
JSON representation
fitting models to clustered correlated data
- Host: GitHub
- URL: https://github.com/brentp/clustermodel
- Owner: brentp
- License: other
- Created: 2013-10-18T21:15:02.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2014-11-14T19:04:37.000Z (over 10 years ago)
- Last Synced: 2025-02-02T04:41:12.020Z (4 months ago)
- Language: Python
- Size: 3.23 MB
- Stars: 5
- Watchers: 5
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
clustermodel
============fitting models to clustered correlated data.
The Problem
===========You have methylation data (or other locally correlated data) with small,
subtle changes that can not be detected at the single-CpG level and you
want to detect regional changes. Here is an example of the subtle changes
across a region with disease in green vs control in blue (image created
by this software):
The Solution(s)
===============There are a number of methods for **combining** probes. The difficulty is that
we can't use a simple linear model because we have **correlated** data. The most
obvious approaches are:1) **bumphuting** from Irizarry's group: find regions where a covariate from a
model is consistently higher than by chance. In general "chance" is determined
by shuffling the residuals of a reduced model, and re-calculating simulated
betas. (they also implement smoothing and a number of nice additional features)2) **mixed-effect models**: fit a random intercept per CpG or per sample to
account for correlated data.3) **GEE models**: use generalized estimating equations. This allows us,
for example to fit an auto-regressive correlation structure, such as we
might expect to see between adjacent probes. This was implemented nicely
in `A-clust` (for exchange-able correlation structure only).4) **stouffer-liptak** calculate the p-values for adjacent probes
independently then combine the p-values, taking the correlation
among the probes into account.This package attempts to implement (some form of) all of those so that we can
model our data with any one of them with a small change of parameters. To do
so, it first **clusters the data** using `aclust`.Tutorial
========For the command-line, we assume the data is set up so that we have a file of
covariates with rows of samples and a file of methylation data that is of shape
**n_probes X n_samples**.For example, we may wish to use the stouffer-liptak correction on our data in
this repository to find regions where methylation is related to disease. We use the
model in R syntax: *methylation ~ disease* where the first covariate on the RHS is
always our covariate of interest (where we get the p-value).```Shell
python -m clustermodel 'methylation ~ disease' \
clustermodel/tests/example-covariates.txt \
clustermodel/tests/example-methylation.txt.gz \
--combine liptak \
--min-cluster-size 3 | head
```The output from that looks like this:
#chrom start end coef p n_probes model method
chr1 2043760 2043853 0.0560277777778 0.576888128824 3 methylation ~ disease liptak
chr1 2058229 2059086 -0.0795509259259 0.155520943237 6 methylation ~ disease liptak
chr1 2063798 2064441 -0.042287037037 0.508509656661 6 methylation ~ disease liptak
chr1 2066005 2066446 -0.0178055555556 0.102638244865 4 methylation ~ disease liptak
chr1 2081983 2082349 -0.0995648148148 0.103685852362 3 methylation ~ disease liptak
chr1 2084318 2084595 -0.0962291666667 0.119301491347 4 methylation ~ disease liptak
chr1 2100231 2100435 -0.0035 0.222498206941 3 methylation ~ disease liptak
chr1 2106221 2106620 -0.0490555555556 0.424566020253 4 methylation ~ disease liptak
chr1 2117424 2118131 -0.029962962963 0.421006269346 3 methylation ~ disease liptakwhere the `p` column is for the combined probes. We can repeat the same using a mixed model with
a random intercept by `CpG` (this covariate becomes available automatically if your data
follows our conventions (see example data for details).```Shell
python -m clustermodel 'methylation ~ disease + (1|CpG)' \
clustermodel/tests/example-covariates.txt \
clustermodel/tests/example-methylation.txt.gz \
--min-cluster-size 3 | head#chrom start end coef p n\_probes model method
chr1 2043760 2043853 0.0560277777778 0.385222268513 3 methylation ~ disease + (1|CpG) mixed-model
chr1 2058229 2059086 -0.0795509259259 0.018857108935 6 methylation ~ disease + (1|CpG) mixed-model
chr1 2063798 2064441 -0.042287037037 0.157777826794 6 methylation ~ disease + (1|CpG) mixed-model
chr1 2066005 2066446 -0.0178055555556 0.370739004967 4 methylation ~ disease + (1|CpG) mixed-model
chr1 2081983 2082349 -0.0995648148148 0.0460294040919 3 methylation ~ disease + (1|CpG) mixed-model
chr1 2084318 2084595 -0.0962291666667 0.0115980612503 4 methylation ~ disease + (1|CpG) mixed-model
chr1 2100231 2100435 -0.0035 0.8958980765 3 methylation ~ disease + (1|CpG) mixed-model
chr1 2106221 2106620 -0.0490555555556 0.14617070189 4 methylation ~ disease + (1|CpG) mixed-model
chr1 2117424 2118131 -0.029962962963 0.195491513628 3 methylation ~ disease + (1|CpG) mixed-model
```Note that we change the model to add the random intercept in `lme4` syntax.
For most cases, we have similar p-values (and identical coefficients) as we did
for the liptak method.clustering
----------If we run the same model with different clustering parameters, we may get different results:
```Shell
python -m clustermodel 'methylation ~ disease + (1|CpG)' \
clustermodel/tests/example-covariates.txt \
clustermodel/tests/example-methylation.txt.gz \
--min-cluster-size 2 \
--linkage single \
--rho-min 0.2 | head#chrom start end coef p n\_probes model method
chr1 2043438 2043853 0.0254555555556 0.536712997985 5 methylation ~ disease + (1|CpG) mixed-model
chr1 2046505 2046928 0.0907055555556 9.47135696294e-06 5 methylation ~ disease + (1|CpG) mixed-model
chr1 2058229 2059086 -0.0795509259259 0.018857108935 6 methylation ~ disease + (1|CpG) mixed-model
chr1 2060013 2060642 -0.0391574074074 0.165767587931 3 methylation ~ disease + (1|CpG) mixed-model
chr1 2063798 2064441 -0.042287037037 0.157777826794 6 methylation ~ disease + (1|CpG) mixed-model
chr1 2064764 2064819 0.0565277777778 0.107191645281 2 methylation ~ disease + (1|CpG) mixed-model
chr1 2065241 2065263 -0.0539583333333 0.163734202808 2 methylation ~ disease + (1|CpG) mixed-model
chr1 2066005 2066981 -0.0329768518519 0.0612285551598 6 methylation ~ disease + (1|CpG) mixed-model
chr1 2081983 2082522 -0.0603722222222 0.149962950932 5 methylation ~ disease + (1|CpG) mixed-model
```Existing Regions
================
We may have a list of regions from one study to compare to another study. We
can do this by specifying the --regions arg that gives a BED file of regions
to test. In this case, the clustering is not performed (so clustering
parameters are not used. Probes that fall into each region in the
--regions argument will be assumed as a cluster and the model(s) will be run
for that cluster.
Here is an example call:python -m clustermodel \
'methylation ~ disease + (1|CpG)' \
clustermodel/tests/example-covariates.txt \
clustermodel/tests/example-methylation.txt.gz \
--regions r.bedNote that the arguments are the same except for the --regions argument
giving the regions to test. This method can also work for e *X* pression.Assumptions
===========If you use the python as a module rather than from the command-line, these
assumptions do not need to be met, you'll just have to do some programming.The command-line assumes that your methylation data looks like this:
```Shell
$ zless clustermodel/tests/example-methylation.txt.gz | cut -f 1-6 | headprobe TF0 FF1 TM2 TF3 FM4
chr1:2041228 3.096 3.678 3.032 3.186 3.565
chr1:2041764 2.623 2.357 2.466 2.436 2.427
chr1:2043037 2.364 2.559 2.508 2.451 2.744
chr1:2043439 2.568 3.142 3.102 2.954 2.627
chr1:2043450 3.310 3.367 3.374 3.977 3.298
chr1:2043761 1.616 1.503 1.880 1.567 0.674
chr1:2043799 3.537 2.968 4.254 3.319 1.197
chr1:2043853 2.804 3.118 3.298 2.981 1.415
chr1:2045413 2.883 2.999 3.199 3.136 3.246
```
where rows are probes and columns are samples. The first column
should be in the form chrom:position and *must be sorted*
The columns of the methylation matrix must match the rows of the
covariates (here: `TF0`, `FF1`, `TM2`, `TF3`, `FM4`, ...):```Shell
$ head -6 clustermodel/tests/example-covariates.txtsample_id disease gender anumber
TF0 T F 0
FF1 F F 1
TM2 T M 2
TF3 T F 3
FM4 F M 4
```and the first column of the covariates file must match the first row
of the methylation data. (this should be pretty standard).
Any columns in the covariates file can be specified in the model.So main points:
1. data must be sorted
2. methylation is n_probes * n_samples (+1 for index)
a. first column is index of chrom:postion
b. column headers are sample-ids matching first column from covariates
3. covariates is n_samples * n_covariates (can have extra covariates even if
not used in model.
a. first column must match first row from methylation fileMethylation-eQTR
================
Methylation::expression Quantitative Trait Region. Generally, we perform a SNP-eQTL with
a model like:expression ~ genotype + age + gender ...
To find how gene expression is affected by genotype. We are always interested
in how methylation affects expression, but we often *lack power* to test each
*single methylation* probe against many expression probes. This software
enables testing the relation between expression and methylation in the context
of a mixed-effects model (or GEE, or liptak, or bumphunting).
The syntax looks like::python -m clustermodel \
'methylation ~ disease + gender + (1|id) + (1|CpG)' \
clustermodel/tests/example-covariates.txt \
clustermodel/tests/example-methylation.txt.gz \
--X clustermodel/tests/example-expression.txt.gz \
--X-locs clustermodel/tests/example-expression-probe-locs.bed.gz \
--X-dist 150000where the first 3 arguments are as before: the model, the covariates, and the
methylation data. The last 3 arguments are the expression info. The first
is the expression matrix with samples that match the covariates and the
methylation matrix. --X-locs is a BED file giving the location of each probe
and --X-dist gives the maximum distance between a methylation and expression
probe to be tested. --X-locs and --X-dist are optional, but performing all
expression::methylation comparisons is likely going to take a long time,
despite the automatic parallelization.The first few lines of output should look like::
#chrom start end coef p n_probes model method Xstart Xend Xstrand distance
chr1 2043438 2043450 -0.05407 0.6498 2 methylation ~ A_24_P49214 + disease + gender + (1 | id) + (1 | CpG) mixed-model 2116522 2116463 - 73072
chr1 2043438 2043450 -0.01590 0.7547 2 methylation ~ A_33_P3410123 + disease + gender + (1 | id) + (1 | CpG) mixed-model 1920067 1920008 - -123430
chr1 2043438 2043450 -0.01726 0.8491 2 methylation ~ A_23_P51187 + disease + gender + (1 | id) + (1 | CpG) mixed-model 2116703 2116762 + -73253
chr1 2043438 2043450 -0.12156 0.0962 2 methylation ~ A_33_P3359344 + disease + gender + (1 | id) + (1 | CpG) mixed-model 2125138 2125079 - 81688
chr1 2043438 2043450 0.030371 0.3701 2 methylation ~ A_33_P3410121 + disease + gender + (1 | id) + (1 | CpG) mixed-model 1915812 1915753 - -127685
chr1 2043438 2043450 -0.04481 0.2563 2 methylation ~ A_33_P3359354 + disease + gender + (1 | id) + (1 | CpG) mixed-model 2116762 2116703 - 73312
chr1 2043760 2043853 -0.05612 0.8589 3 methylation ~ A_24_P49214 + disease + gender + (1 | id) + (1 | CpG) mixed-model 2116522 2116463 - 72669
chr1 2043760 2043853 0.064542 0.6318 3 methylation ~ A_33_P3410123 + disease + gender + (1 | id) + (1 | CpG) mixed-model 1920067 1920008 - -123752
chr1 2043760 2043853 0.201022 0.4006 3 methylation ~ A_23_P51187 + disease + gender + (1 | id) + (1 | CpG) mixed-model 2116703 2116762 + -72850
Note that each probe is automatically inserted into the model. The distance
column is negative if the methylation region is upstream of the expression
probe and positive otherwise.See the example data in `clustermodel/tests/` for how to set up your own data.
Clustermodel With Bisulfite-Seq Data
====================================Most modes of use of `clustermodel` assume that the methylation data is a from a technology
that returns a proportion of methylated cells (450K, charm, etc), however, for sequencing
data, we know the proportion of methylation, but also the sequencing depth. To model this,
we use beta regression (similar to BiSeq from bioconductor). BiSeq and other
packages such as `bsseq` (also from bioconductor) use the sequencing depth as weights for
smoothing the methylation rate using loess and then model the smoothed methylation rates.
In `clustermodel`, we instead use the sequence depth to perform a weighted regression on
the original data such that **samples** with more supporting reads are weighted more heavily.
We then combine the p-values from each probe in a cluster (described below) using the z-score
method described in the BiSeq paper except that **sites** with more supporting reads are given
more weight.To find clusters used above, we utilize the method described in http://www.ncbi.nlm.nih.gov/pubmed/23990415
and utilized throughout `clustermodel`. Again, this differs from `BiSeq` and `BSSeq` which
use peak-finding to delineate regions after asigning per-CpG p-values. The advantage of our approach is
that all clusters are reported so users can choose their own cutoff and multiple-testing correction is
more straight-forward.An example invocation to use `clustermodel` would be:
```Shell
python -m clustermodel \
--png-path results/png/spaghetti \
--counts --min-cluster-size 5 \
--combine z-score --outlier-sds 3 \
--betareg --weights counts.txt.gz \
'methylation ~ case' \
covariates.txt \
methylation.txt.gz \
> results/pvals.bedWhere `counts.txt.gz` has the same shape as `methylation.txt.gz` and the former has the read-depths
while the latter has the proportion of methylation for rows of sites and columns of samples.
The columns of those files will correspond to the rows in covariates.txtThe `counts` and `methylation` can be created by sending a list of Bismark or bwa-meth.py output files (tabulated methylation)
to `scripts/meth-matrix.py`.```
INSTALLATION
============You should have all of the R packages in scripts/check-R.R installed and should
be able to run that script without problems. You will have to do in R:install.packages(devtools)
library(devtools)
install_github("brentp/clustermodelr")The rest can be installed with install.packages or biocLite().
Generating Correlated Data
==========================see simulate.py
Coming Soon
===========comparison