https://github.com/tidyomics/tidybulk

Brings bulk and pseudobulk transcriptomics to the tidyverse
https://github.com/tidyomics/tidybulk
bioconductor bulk-transcriptional-analyses deseq2 differential-expression edger ensembl-ids entrez gene-symbols gsea mds-dimensions pca pipe redundancy tibble tidy tidy-data tidyverse transcriptomics transcripts tsne
Last synced: 3 months ago
JSON representation
Brings bulk and pseudobulk transcriptomics to the tidyverse
Host: GitHub
URL: https://github.com/tidyomics/tidybulk
Owner: tidyomics
Created: 2019-08-28T23:46:30.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2025-11-12T01:06:31.000Z (6 months ago)
Last Synced: 2026-01-26T16:37:06.779Z (4 months ago)
Topics: bioconductor, bulk-transcriptional-analyses, deseq2, differential-expression, edger, ensembl-ids, entrez, gene-symbols, gsea, mds-dimensions, pca, pipe, redundancy, tibble, tidy, tidy-data, tidyverse, transcriptomics, transcripts, tsne
Language: R
Homepage: https://stemangiola.github.io/tidybulk/
Size: 115 MB
Stars: 176
Watchers: 4
Forks: 25
Open Issues: 17
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          tidybulk: An R tidy framework for modular transcriptomic data analysis

================

[![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)

[![R build

status](https://github.com/stemangiola/tidybulk/workflows/R-CMD-check/badge.svg)](https://github.com/stemangiola/tidybulk/actions/)

[![Bioconductor

status](https://bioconductor.org/shields/build/release/bioc/tidybulk.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/tidybulk/)

**tidybulk** is a powerful R package designed for modular transcriptomic

data analysis that brings transcriptomics to the tidyverse.

## Why tidybulk?

Tidybulk provides a unified interface for comprehensive transcriptomic

data analysis with seamless integration of SummarizedExperiment objects

and tidyverse principles. It streamlines the entire workflow from raw

data to biological insights.

## Functions/utilities available

### Abundance Normalization Functions

| Function                         | Description                             |

|----------------------------------|-----------------------------------------|

| `scale_abundance()`              | Scale abundance data                    |

| `quantile_normalise_abundance()` | Quantile normalization                  |

| `adjust_abundance()`             | Adjust abundance for unwanted variation |

| `fill_missing_abundance()`       | Fill missing abundance values           |

| `impute_missing_abundance()`     | Impute missing abundance values         |

### Filtering and Selection Functions

| Function              | Description                                         |

|-----------------------|-----------------------------------------------------|

| `identify_abundant()` | Identify abundant transcripts without removing them |

| `keep_abundant()`     | Keep abundant transcripts                           |

| `keep_variable()`     | Keep variable transcripts                           |

| `filterByExpr()`      | Filter by expression                                |

### Dimensionality Reduction Functions

| Function              | Description                              |

|-----------------------|------------------------------------------|

| `reduce_dimensions()` | Reduce dimensions with PCA/MDS/tSNE/UMAP |

| `rotate_dimensions()` | Rotate dimensions                        |

| `remove_redundancy()` | Remove redundant features                |

### Clustering Functions

| Function                  | Description                           |

|---------------------------|---------------------------------------|

| `cluster_elements()`      | Cluster elements with various methods |

| `kmeans clustering`       | K-means clustering                    |

| `SNN clustering`          | Shared nearest neighbor clustering    |

| `hierarchical clustering` | Hierarchical clustering               |

| `DBSCAN clustering`       | Density-based clustering              |

### Differential Analysis Functions

| Function | Description |

|----|----|

| `test_differential_expression()` | Test differential expression with various methods |

### Cellularity Analysis Functions

| Function                   | Description                                 |

|----------------------------|---------------------------------------------|

| `deconvolve_cellularity()` | Deconvolve cellularity with various methods |

| `cibersort()`              | CIBERSORT analysis                          |

### Gene Enrichment Functions

| Function                         | Description                  |

|----------------------------------|------------------------------|

| `test_gene_enrichment()`         | Test gene enrichment         |

| `test_gene_overrepresentation()` | Test gene overrepresentation |

| `test_gene_rank()`               | Test gene rank               |

### Utility Functions

| Function | Description |

|----|----|

| `describe_transcript()` | Describe transcript characteristics |

| `get_bibliography()` | Get bibliography |

| `resolve_complete_confounders_of_non_interest()` | Resolve confounders |

### Validation and Utility Functions

| Function                      | Description                       |

|-------------------------------|-----------------------------------|

| `check_if_counts_is_na()`     | Check if counts contain NA values |

| `check_if_duplicated_genes()` | Check for duplicated genes        |

| `check_if_wrong_input()`      | Validate input data               |

| `log10_reverse_trans()`       | Log10 reverse transformation      |

| `logit_trans()`               | Logit transformation              |

All functions are directly compatible with `SummarizedExperiment`

objects and follow tidyverse principles for seamless integration with

the tidyverse ecosystem.

### Scientific Citation

Mangiola, Stefano, Ramyar Molania, Ruining Dong, Maria A. Doyle, and

Anthony T. Papenfuss. 2021. “Tidybulk: An R tidy framework for modular

transcriptomic data analysis.” Genome Biology 22 (42).

[Genome Biology - tidybulk: an R tidy framework for modular

transcriptomic data

analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7)

In this example we will use the `airway` dataset, a

`SummarizedExperiment` object containing RNA-seq data from an experiment

studying the effect of dexamethasone treatment on airway smooth muscle

cells. This dataset is available in the

[airway](https://bioconductor.org/packages/airway/) package.

``` r

library(airway)

data(airway)

```

### Comparison of  Differential Expression methods

``` r

# Standard differential expression analysis

airway = airway |>

# Use QL method

    test_differential_expression(~ dex, method = "edgeR_quasi_likelihood", prefix = "ql__") |>

    

    # Use edger_robust_likelihood_ratio

    test_differential_expression(~ dex, method = "edger_robust_likelihood_ratio", prefix = "lr_robust__") |>

    

# Use DESeq2 method

    test_differential_expression(~ dex, method = "DESeq2", prefix = "deseq2__") |>

    

    # Use limma_voom

    test_differential_expression(~ dex, method = "limma_voom", prefix = "voom__") |>

# Use limma_voom_sample_weights

    test_differential_expression(~ dex, method = "limma_voom_sample_weights", prefix = "voom_weights__") 

```

### Histograms of p-values across methods

Inspection of the raw p-value histogram provides a rapid check of

differential-expression results. When no gene is truly differentially

expressed, the p-values follow a uniform U(0,1) distribution across the

interval 0–1, so the histogram appears flat

[Source](https://bioconductor.org/help/course-materials/2014/useR2014/Workflows.html).

In a more realistic scenario where only a subset of genes changes, this

uniform background is still present but an obvious spike emerges close

to zero, created by the genuine signals.

Thanks to the modularity of the `tidybulk` workflow, that can multiplex

different methods, we can easily compare the p-values across methods.

``` r

airway |>

  pivot_transcript() |> 

  select(

    ql__PValue, 

    lr_robust__PValue, 

    voom__P.Value, 

    voom_weights__P.Value, 

    deseq2__pvalue

  ) |> 

  pivot_longer(everything(), names_to = "method", values_to = "pvalue") |>

  ggplot(aes(x = pvalue, fill = method)) +

  geom_histogram(binwidth = 0.01) +

  facet_wrap(~method) +

  my_theme +

  labs(title = "Histogram of p-values across methods")

```

    ## Warning: Removed 30 rows containing non-finite outside the scale range

    ## (`stat_bin()`).

![](README_files/figure-gfm/differential-expression-pvalue-histograms-1.png)

### Compare Results Across Methods

``` r

# Summary statistics

airway |> pivot_transcript() |> select(contains("ql|lr_robust|voom|voom_weights|deseq2")) |> select(contains("logFC")) |> 

summarise(across(everything(), list(min = min, median = median, max = max), na.rm = TRUE))

```

    ## Warning: There was 1 warning in `summarise()`.

    ## ℹ In argument: `across(...)`.

    ## Caused by warning:

    ## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.

    ## Supply arguments directly to `.fns` through an anonymous function instead.

    ## 

    ##   # Previously

    ##   across(a:b, mean, na.rm = TRUE)

    ## 

    ##   # Now

    ##   across(a:b, \(x) mean(x, na.rm = TRUE))

    ## # A tibble: 1 × 0

### Pairplot of pvalues across methods (GGpairs)

``` r

library(GGally)

```

    ## Warning: package 'GGally' was built under R version 4.5.1

``` r

airway |> 

  pivot_transcript() |> 

  select(ql__PValue, lr_robust__PValue, voom__P.Value, voom_weights__P.Value, deseq2__pvalue) |> 

  ggpairs(columns = 1:5) +

  scale_x_continuous(trans = tidybulk::log10_reverse_trans()) +

  scale_y_continuous(trans = tidybulk::log10_reverse_trans()) +

  my_theme +

  labs(title = "Pairplot of p-values across methods")

```

![](README_files/figure-gfm/differential-expression-pvalue-pairplot-1.png)

### Pairplot of effect sizes across methods (GGpairs)

``` r

library(GGally)

airway |> 

  pivot_transcript() |> 

  select(ql__logFC, lr_robust__logFC, voom__logFC, voom_weights__logFC, deseq2__log2FoldChange) |> 

  ggpairs(columns = 1:5) +

  my_theme +

  labs(title = "Pairplot of effect sizes across methods")

```

![](README_files/figure-gfm/differential-expression-effectsize-pairplot-1.png)

### Quality Control 

It is important to check the quality of the fit. All methods produce a

fit object that can be used for quality control. The fit object produced

by each underlying method is stored as an attribute of the

`airway_mini` object. We can use them, for example, to perform quality

control of the fit.

#### For edgeR

Plot the biological coefficient of variation (BCV) trend. This plot is

helpful in understanding the dispersion of the data.

``` r

library(edgeR)

```

``` r

metadata(airway)$tidybulk$edgeR_quasi_likelihood_object |>

  plotBCV()

```

![](README_files/figure-gfm/differential-expression-edgeR-object-1.png)

Plot the log-fold change vs mean plot.

``` r

library(edgeR)

metadata(airway)$tidybulk$edgeR_quasi_likelihood_fit |>

  plotMD()

```

![](README_files/figure-gfm/differential-expression-edgeR-fit-1.png)

#### For DESeq2

Plot the mean-variance trend.

``` r

library(DESeq2)

metadata(airway)$tidybulk$DESeq2_object |>

  plotDispEsts()

```

![](README_files/figure-gfm/differential-expression-DESeq2-object-1.png)

Plot the log-fold change vs the mean plot.

``` r

library(DESeq2)

metadata(airway)$tidybulk$DESeq2_object |>

  plotMA()

```

![](README_files/figure-gfm/differential-expression-DESeq2-fit-1.png)

### Volcano Plots for Each Method

Visualising the significance and effect size of the differential

expression results as a volcano plot, we appreciate that some methods

have much lower p-value distributions than other methods, for the same

model and data.

``` r

# Create volcano plots

airway |>

    # Select the columns we want to plot

    pivot_transcript() |> 

    select(

            .feature,

      ql__logFC, ql__PValue,

      lr_robust__logFC, lr_robust__PValue,

      voom__logFC, voom__P.Value,

      voom_weights__logFC, voom_weights__P.Value,

      deseq2__log2FoldChange, deseq2__pvalue

    ) |>

    # Pivot longer to get a tidy data frame

    pivot_longer(

      - .feature,

      names_to = c("method", "stat"),

      values_to = "value", names_sep = "__"

    ) |>

    # Harmonize column names

    mutate(stat  = case_when(

        stat %in% c("logFC", "log2FoldChange") ~ "logFC",

        stat %in% c("PValue", "pvalue", "P.Value", "p.value") ~ "PValue"

    )) |>

  pivot_wider(names_from = "stat", values_from = "value") |>

  unnest(c(logFC, PValue)) |> 

    # Plot

  ggplot(aes(x = logFC, y = PValue)) +

  geom_point(aes(color = PValue < 0.05, size = PValue < 0.05)) +

  scale_y_continuous(trans = tidybulk::log10_reverse_trans()) +

  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +

  scale_size_manual(values = c("TRUE" = 0.5, "FALSE" = 0.1)) +

  facet_wrap(~method) +

  my_theme +

  labs(title = "Volcano Plots by Method")

```

    ## Warning: Removed 30 rows containing missing values or values outside the scale range

    ## (`geom_point()`).

![](README_files/figure-gfm/differential-expression-volcano-plots-1-1.png)

Plotting independent y-axis scales for the p-values and effect sizes

allows us to compare the top genes across methods.

**Methods:**

- **edgeR quasi-likelihood:** Quasi-likelihood F-tests for differential

  expression

- **edgeR robust likelihood ratio:** Robust likelihood ratio tests

- **DESeq2:** Negative binomial distribution with dispersion estimation

- **limma-voom:** Linear modeling with empirical Bayes moderation

- **limma-voom with sample weights:** Enhanced voom with quality weights

  **References:**

- Robinson et al. (2010) edgeR: a Bioconductor package for differential

  expression analysis

- Chen et al. (2016) From reads to genes to pathways: differential

  expression analysis of RNA-Seq experiments using Rsubread and the

  edgeR quasi-likelihood pipeline

- Love et al. (2014) Moderated estimation of fold change and dispersion

  for RNA-seq data with DESeq2

- Law et al. (2014) voom: precision weights unlock linear model analysis

  tools for RNA-seq read counts

- Liu et al. (2015) Why weight? Modelling sample and observational level

  variability improves power in RNA-seq analyses

## Automatic bibliography

`tidybulk` allows you to get the bibliography of all methods used in our

workflow.

``` r

# Get bibliography of all methods used in our workflow

airway |> get_bibliography()

```

    ##  @Article{tidybulk,

    ##   title = {tidybulk: an R tidy framework for modular transcriptomic data analysis},

    ##   author = {Stefano Mangiola and Ramyar Molania and Ruining Dong and Maria A. Doyle & Anthony T. Papenfuss},

    ##   journal = {Genome Biology},

    ##   year = {2021},

    ##   volume = {22},

    ##   number = {42},

    ##   url = {https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7},

    ##   }

    ## @article{wickham2019welcome,

    ##   title={Welcome to the Tidyverse},

    ##   author={Wickham, Hadley and Averick, Mara and Bryan, Jennifer and Chang, Winston and McGowan, Lucy D'Agostino and Francois, Romain and Grolemund, Garrett and Hayes, Alex and Henry, Lionel and Hester, Jim and others},

    ##   journal={Journal of Open Source Software},

    ##   volume={4},

    ##   number={43},

    ##   pages={1686},

    ##   year={2019}

    ##  }

    ## @article{robinson2010edger,

    ##   title={edgeR: a Bioconductor package for differential expression analysis of digital gene expression data},

    ##   author={Robinson, Mark D and McCarthy, Davis J and Smyth, Gordon K},

    ##   journal={Bioinformatics},

    ##   volume={26},

    ##   number={1},

    ##   pages={139--140},

    ##   year={2010},

    ##   publisher={Oxford University Press}

    ##  }

    ## @article{robinson2010scaling,

    ##   title={A scaling normalization method for differential expression analysis of RNA-seq data},

    ##   author={Robinson, Mark D and Oshlack, Alicia},

    ##   journal={Genome biology},

    ##   volume={11},

    ##   number={3},

    ##   pages={1--9},

    ##   year={2010},

    ##   publisher={BioMed Central}

    ##  }

    ## @incollection{smyth2005limma,

    ##   title={Limma: linear models for microarray data},

    ##   author={Smyth, Gordon K},

    ##   booktitle={Bioinformatics and computational biology solutions using R and Bioconductor},

    ##   pages={397--420},

    ##   year={2005},

    ##   publisher={Springer}

    ##  }

    ## @Manual{,

    ##     title = {R: A Language and Environment for Statistical Computing},

    ##     author = {{R Core Team}},

    ##     organization = {R Foundation for Statistical Computing},

    ##     address = {Vienna, Austria},

    ##     year = {2020},

    ##     url = {https://www.R-project.org/},

    ##   }

    ## @article{lund2012detecting,

    ##   title={Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates},

    ##   author={Lund, Steven P and Nettleton, Dan and McCarthy, Davis J and Smyth, Gordon K},

    ##   journal={Statistical applications in genetics and molecular biology},

    ##   volume={11},

    ##   number={5},

    ##   year={2012},

    ##   publisher={De Gruyter}

    ##     }

    ## @article{zhou2014robustly,

    ##   title={Robustly detecting differential expression in RNA sequencing data using observation weights},

    ##   author={Zhou, Xiaobei and Lindsay, Helen and Robinson, Mark D},

    ##   journal={Nucleic acids research},

    ##   volume={42},

    ##   number={11},

    ##   pages={e91--e91},

    ##   year={2014},

    ##   publisher={Oxford University Press}

    ##  }

    ## @article{love2014moderated,

    ##   title={Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2},

    ##   author={Love, Michael I and Huber, Wolfgang and Anders, Simon},

    ##   journal={Genome biology},

    ##   volume={15},

    ##   number={12},

    ##   pages={550},

    ##   year={2014},

    ##   publisher={Springer}

    ##  }

    ## @article{law2014voom,

    ##   title={voom: Precision weights unlock linear model analysis tools for RNA-seq read counts},

    ##   author={Law, Charity W and Chen, Yunshun and Shi, Wei and Smyth, Gordon K},

    ##   journal={Genome biology},

    ##   volume={15},

    ##   number={2},

    ##   pages={R29},

    ##   year={2014},

    ##   publisher={Springer}

    ##     }

    ## @article{liu2015weight,

    ##   title={Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses},

    ##   author={Liu, Ruijie and Holik, Aliaksei Z and Su, Shian and Jansz, Natasha and Chen, Kelan and Leong, Huei San and Blewitt, Marnie E and Asselin-Labat, Marie-Liesse and Smyth, Gordon K and Ritchie, Matthew E},

    ##   journal={Nucleic acids research},

    ##   volume={43},

    ##   number={15},

    ##   pages={e97--e97},

    ##   year={2015},

    ##   publisher={Oxford University Press}

    ##     }

    ## @article{leek2012sva,

    ##   title={The sva package for removing batch effects and other unwanted variation in high-throughput experiments},

    ##   author={Leek, Jeffrey T and Johnson, W Evan and Parker, Hilary S and Jaffe, Andrew E and Storey, John D},

    ##   journal={Bioinformatics},

    ##   volume={28},

    ##   number={6},

    ##   pages={882--883},

    ##   year={2012},

    ##   publisher={Oxford University Press}

    ##  }

    ## @article{newman2015robust,

    ##   title={Robust enumeration of cell subsets from tissue expression profiles},

    ##   author={Newman, Aaron M and Liu, Chih Long and Green, Michael R and Gentles, Andrew J and Feng, Weiguo and Xu, Yue and Hoang, Chuong D and Diehn, Maximilian and Alizadeh, Ash A},

    ##   journal={Nature methods},

    ##   volume={12},

    ##   number={5},

    ##   pages={453--457},

    ##   year={2015},

    ##   publisher={Nature Publishing Group}

    ##  }
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tidyomics/tidybulk

Awesome Lists containing this project

README