https://github.com/russHyde/polyply

`polyply` allows you to manipulate multiple data-frames within a single magrittr / dplyr pipeline
https://github.com/russHyde/polyply

Last synced: 7 months ago
JSON representation

`polyply` allows you to manipulate multiple data-frames within a single magrittr / dplyr pipeline

Host: GitHub
URL: https://github.com/russHyde/polyply
Owner: russHyde
License: other
Created: 2018-06-05T12:23:51.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2019-07-12T15:22:15.000Z (about 6 years ago)
Last Synced: 2024-11-27T12:21:40.615Z (8 months ago)
Language: R
Homepage:
Size: 83 KB
Stars: 8
Watchers: 2
Forks: 2
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

jimsghstars - russHyde/polyply - `polyply` allows you to manipulate multiple data-frames within a single magrittr / dplyr pipeline (R)

README

        # `polyply` - an R package for manipulation of multiple data-frames in a single magrittr pipeline

[![Build Status](

  https://travis-ci.org/russHyde/polyply.svg?branch=master

  )

](

  https://travis-ci.org/russHyde/polyply

)

[![Coverage Status](

  https://img.shields.io/codecov/c/github/russHyde/polyply/master.svg

  )

](

  https://codecov.io/github/russHyde/polyply?branch=master

)

Contributors are more than welcome - but you've got to be nice - see the

code-of-conduct (CONDUCT.md)

----

I've talked about these ideas on twitter and biostars recently.

`dplyr` really shines in the manipulation of single data-frames and has

functions for merging existing data-frames together, similar to how tables are

combined in relational databases. The `dplyr` syntax gets rather heavy-handed

when you need to both manipulate and merge more than one data-frame in a single

pipeline.

An example might be illustrative:

Suppose data-frames `A`, `B`, and `C` exist that contain information related to

a gene-expression experiment. Any number of other applications could have been

chosen.

`A` might contain annotation information for the genes that were studied (IDs

in external databases, gene lengths etc). `B` might contain information about

the experimental samples that were studied (was the sample treated with a

particular treatment; where were the samples sourced etc). `C` might contain

the expression level for each gene in each sample.

Assume the three datasets are 'tidy'. There's a single row for each gene in

`A`, there's a single row for each sample in `B` and there's a single row for

each gene/sample combination in `C`.

Suppose I want to extract the expression information for all patients sourced

from Glasgow, for genes that are longer than 2 kilobases. Then I want to pass

that expression data into ggplot2 and make a well-annotated scatter plot (the

annotations using information from both the gene-metadata `A` and the

biological-samples-metadata `B`).

There's many ways to do this using standard tidyverse approaches.

~~~~

# prefilter, constructing superfluous data-frames

glasgow_samples <- filter(B, source == "Glasgow")

long_genes <- filter(A, gene_length > 2000)

expression_data <- C %>%

  inner_join(glasgow_samples, by = "sample_id") %>%

  inner_join(long_genes, by = "feature_id") %>%

  ggplot(...)

~~~~

~~~~

# filter within the join

C %>%

  inner_join(filter(B, source == "Glasgow"), by = "sample_id") %>%

  inner_join(filter(A, gene_length > 2000)) %>%

  ggplot(...)

~~~~

~~~~

# post-filter, making a huge temporary data-frame

C %>%

  inner_join(B, by = "sample_id") %>%

  inner_join(A, by = "feature_id") %>%

  filter(source == "Glasgow" & gene_length > 2000) %>%

  ggplot(...)

~~~~

All of the above are perfectly valid approaches.

If you do them once.

And if your data-frames are sufficently small.

But 'doing things just once' always seems like the exception rather than the

rule, and working with manageably-sized data-frames is another rarity.

So, to  mitigate against duplication, which bits of the above code should be

abstracted away?

Since the 'datasets used' will change less rapidly than the 'questions asked',

I'm more likely to need to change the filters / selections / mutations that are

applied to the individual data-frames than I am to change the pipeline for

joining-together the different datasets. Hell, there's a logical connection

between the different data-frames that is unaffected by filtering any given

data-frame - so surely joining on that logical connection should be abstracted

out first.

With-respect-to-the-above, we want to define a single merging function that

will take the collestion of data-frames (assume a list of data-frames for now),

and return a single data-frame for filtering / selection etc or for use in

`ggplot`. This might look like:

~~~~

list(genes = A, samples = B, expressions = C) %>%

  my_merging_function() %>%

  filter(source == "Glasgow" & gene_length > 2000) %>%

  ggplot(...)

~~~~

Here, the merging function works similarly to how a View works in SQL - it's a

virtual specification for how data-tables should be combined together. But

unlike in SQL, there's no optimisation performed in R and so that query would

create the same huge temporary inner-join data-frame described above. Given

that, something like the following would be more memory efficient in R:

~~~~

list(

  genes = filter(A, ...),

  samples = filter(B, ...),

  expressions = C

) %>%

  my_merging_function() %>%

  ggplot(...)

~~~~

But but, what if you could do this:

~~~~

list(

  genes = A,

  samples = B,

  expressions = C

) %>%

  filter_a_specific_dataframe_within_that_collection(...) %>%

  filter_a_different_dataframe(...) %>%

  my_merging_function() %>%

  ggplot(...)

~~~~

... and have it behave algebraically identically to the two previous calls.

This would limit memory burden, limit the need to store intermediate results

but also allow you store intermediate results etc.

But but but, doesn't tidygraph do something similar already? in tidygraph,

there are a couple of data-frames (one for edges and one for nodes) stored

inside a tbl_graph object, and you can mutate / filter / etc each of these

data-frames independently. To indicate which of the tables you want to work

with, you use 'activate'. So we could generalise the code from tidygraph in a

way that allows the following workflow:

~~~~

some_collection(genes = A, samples = B, expressions = C) %>%

  activate(genes) %>%

    filter(gene_length > 2000) %>%

  activate(samples) %>%

    filter(source == "Glasgow") %>%

  my_merging_function() %>%

  [... other filtering / mutation / selection steps ...] %>%

  [... downstream output maker ...]

~~~~

Happy to take criticism of the idea.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/russHyde/polyply

Awesome Lists containing this project

README