https://github.com/russHyde/coxpresdbr
Parses and summarises data from coexpresdb.jp
https://github.com/russHyde/coxpresdbr
Last synced: 4 months ago
JSON representation
Parses and summarises data from coexpresdb.jp
- Host: GitHub
- URL: https://github.com/russHyde/coxpresdbr
- Owner: russHyde
- License: gpl-3.0
- Created: 2018-02-27T19:11:44.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2020-03-12T11:25:36.000Z (about 5 years ago)
- Last Synced: 2024-10-31T22:32:24.332Z (5 months ago)
- Language: R
- Size: 171 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - russHyde/coxpresdbr - Parses and summarises data from coexpresdb.jp (R)
README
---
output: github_document
-------
# coxpresdbr: a package for importing data from coxpresdb.jp and for comparing your own datasets against coxpresdb
----
[

](
https://travis-ci.org/russHyde/coxpresdbr
)[

](
https://codecov.io/gh/russHyde/coxpresdbr?branch=master
)----
[`coxpresdb.jp`](https://coxpresdb.jp/)
is a website / database that summarises the correlation between pairs of genes
across thousands of different gene-expression studies. As of 2019, it contains
downloadable coexpression data from humans and a range of model organisms
(rodents, nematodes, flies, yeast etc). In humans, for example, the dataset may
contain all correlations amongst ~ 20k genes, as derived from summarising ~
160k samples.This package provides some data-access and statistical tools for working with
`coxpresdb.jp` in _R_.Why?
Several tools exist for combining network analysis with gene expression data or
for finding some other higher-level context for changes observed in gene
expression data (_GSEA_ / _GSVA_). These may construct a network _de novo_ (eg,
_WGCNA_) or identify interesting subcomponents of an existing network after
superimposing some expression statistics upon that network (eg,
_jActiveModules_ in _Cytoscape_).To use _WGCNA_ requires tons of expression data. And no-one I know has that
many good quality samples._jActiveModules_ and related tools are great. But they don't provide much
statistical context for the submodules that they select. Would they find an
_interesting_ subnetwork for white-noise input? How many subnetworks are
considered before selecting the small number of optimal modules. And when
applied, how relevant is network construction to the results obtained: if there
are non-uniformities in data-acquisition (eg, literature searches) or sampling
across the network (eg, funding / publication biases, sub-genome-scale
studies), if technical issues influence network structure, and if the network
is constructed from a data source that may be only loosely-coupled to gene
expression (eg, protein-protein interaction data) could a good network analysis
tool still return meaningless results?As an alternative, to put gene expression data in some network context, one
could use coexpression databases. This provides a network structure upon which
gene expression data can be overlaid (so can still work when samples are
limited in your experiment), that is built from genome-scale studies and where
the approach is closely connected to gene-expression studies and where we can
perform both gene-focussed and network-wide analyses. (admittedly the tissue
types and cellular treatments used in the datasets from which coexpression
databases are constructed may still be biased).Here's what I might do:
- compute fold-changes and p-values from an RNA-Seq experiment
- for each significant gene, have a see what genes it tends to be coexpressed
with / correlated with
- have a look if those coexpression partners are also differentially expressed
... in the same direction ... and have shared ChIP marks.Let's formalise that.
## Importing data
### Data formats
`coxpresdbr` currently only works with downloaded copies of the `coxpresdb.jp`
databases. A given database comes as a compressed archive and within that
archive, a file is present for each gene. In recent memory there have been two
related file formats and two compression types (`.tar.bz2` and, more recently,
`.zip`).The first file format had three columns (tab-separated)
(`target_gene, mutual_rank, correlation_coefficient`)
but has been replaced by a two column (`target_gene, mutual_rank`) format.
Column names are not present and the `source_gene` (the name of the file) and
`target_gene` (contents of the first column) are numeric Entrez IDs.When working with the coxpresdb files, I tend to make a smaller archive
setting up the file for each source-gene to contain only the top 100 most
correlated genes.### The `CoxpresDbAccessor` Class
We first define an importer object that permits access to the gene-specific
files stored in the coxpresdb archive.```
my_importer <- CoxpresDbAccessor(path_to_coxpresdb_archive)
```### Importing the coexpression data & filtering
Then to pull out the top 10 coexpression partners for a given set of genes,
you can call:```
get_coex_partners(my_gene_set, importer = my_importer, n_partners = 10)
```This returns a data-frame with columns: (
`source_id, target_id, mutual_rank`; correlation coefs, if present in the
coxpresdb archive, are disregarded since they aren't present in all coxpresdb
releases)You can specify the universe of genes from which the top-N coexpression
partners are chosen by using the argument `universe=...` and you can
filter to ensure that the returned gene-pairs have a mutual-rank score below
some threshold using `mr_threshold`.## Gene-focussed statistical summaries
## Network-level analysis workflows