Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/hilldr/tsai_lgr5_endoderm_2015

Analysis scripts pertaining to the manuscript "LGR4 and LGR5 function redundantly during human endoderm differentiation" by Yu-Hwai Tsai et. al 2015
https://github.com/hilldr/tsai_lgr5_endoderm_2015
Last synced: about 1 month ago
JSON representation
Analysis scripts pertaining to the manuscript "LGR4 and LGR5 function redundantly during human endoderm differentiation" by Yu-Hwai Tsai et. al 2015
Host: GitHub
URL: https://github.com/hilldr/tsai_lgr5_endoderm_2015
Owner: hilldr
Created: 2015-11-18T20:35:40.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2015-11-20T15:55:00.000Z (about 9 years ago)
Last Synced: 2023-02-28T08:41:11.861Z (almost 2 years ago)
Language: R
Size: 20.7 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        LGR4 and LGR5 function redundantly during human endoderm differentiation

========================================================================

Yu-Hwai Tsai¹, David R. Hill¹, Namit Kumar⁴, Sha Huang¹, Briana

R. Dye², Michael P. Verzi⁴ and Jason R. Spence^1,2,3,*

¹Division of Gastroenterology, Department of Internal Medicine

²Department of Cell and Developmental Biology

³Center for Organogenesis University of Michigan Medical School, Ann

Arbor, Michigan, USA, 48109

⁴Department of Genetics Rutgers, The State University of New Jersey,

Piscataway, New Jersey 08854

^*Author for correspondence:

[email protected]

@TheSpenceLab

Author Contributions: YHT and JRS conceived the study YHT, DRH, SH, BRD,

NK and MPV conducted experiments All authors analyzed data JRS wrote the

manuscript All authors edited and approved final manuscript

Summary

=======

The Lgr family of transmembrane proteins (Lgr4, 5, 6) act as functional

receptors for R-spondin proteins (Rspo 1, 2, 3, 4), and have been shown

to potentiate Wnt signaling in different contexts. Lgr5, arguably the

best characterized of the Lgr family members, was originally identified

as a marker of crypt base columnar (CBC) stem cells of the adult murine

intestine, and has subsequently been show to act as a stem cell marker

in many contexts. Moreover, Lgr members play functionally redundant

roles in CBC maintenance. Here, we show that LGR5 is part of the human

definitive endoderm (DE) “gene signature”, as LGR5 transcripts are

robustly induced when human pluripotent stem cells (hPSCs) are

differentiated into DE. Our results demonstrate that LGR4 and 5 function

redundantly during human endoderm induction. Consistent with data in

human DE, we observe Lgr5 reporter (eGFP) activity in the E8.5 mouse

endoderm, and show the ability to linage trace Lgr5-creER cells labeled

as early as E8.5 into the adult intestine. ChIPseq shows that LGR5 is

directly bound by *β*-catenin early during endoderm induction, and

functional experiments support a role for LGR proteins as potentiators

of WNT signaling during DE induction. Taken together, our results show

that LGR proteins are induced in DE, are functionally required for DE

induction, and that they function to potentiate WNT signaling during

this process.

Bioinformatics methods and results

==================================

Required hardware and software

------------------------------

### Hardware

The analysis machine has the following characteristics:

-   OS: Debian 7.9 wheezy

-   Kernel: x86~64~ Linux 3.2.0-4-amd64

-   Shell: bash 4.2.37

-   CPU: Intel Xeon CPU E5504 @ 2GHz

-   GPU: Quadro FX 380

-   RAM: 48340MB

### Software

The code is intended for the UNIX environment. The following software is

required to execute the analysis

-   FastQC v0.11.3

-   Bowtie 2 version 2.2.5

-   cufflinks v2.2.1

-   R version 3.2.2 (2015-08-14) -- "Fire Safety"

Scripts and github page prepared using GNU Emacs 24.5.1 with Org-mode

version 8.2.10

#### R package dependencies

``` {.r .rundoc-block rundoc-language="R" rundoc-session="*R*" rundoc-eval="yes" rundoc-exports="code" rundoc-tangle="./src/expression_analysis.R"}

# R package dependencies -------------------------------------------------------

# Comment this line out if your machine already has these packages installed

install.packages(c("matrixStats","ggplot2","MASS","scales"),repos='https://watson.nci.nih.gov/cran_mirror/')

```

Tuxedo suite pipeline script

----------------------------

### Header

``` {.bash .rundoc-block rundoc-language="sh" rundoc-export="code" rundoc-eval="no" rundoc-tangle="./src/tuxedo_complete.sh"}

#! /bin/bash

GENES=/data/genomes/hg19_genes_refFlat.gtf

GENOME=/data/genomes/hg19.fa

INDEX=/data/genomes/hg19

```

### FastQC processing

``` {.bash .rundoc-block rundoc-language="sh" rundoc-export="code" rundoc-eval="no" rundoc-tangle="./src/tuxedo_complete.sh"}

#FASTQC quality control report generator

mkdir ../DATA/QC

for file in ../DATA/FASTQ/*.fastq* # will output filename as "$file"

do

    FILENAME="$file"       

    fastqc --outdir=../DATA/QC $FILENAME

done

```

### Tophat/Bowtie processing

``` {.bash .rundoc-block rundoc-language="sh" rundoc-export="code" rundoc-eval="no" rundoc-tangle="./src/tuxedo_complete.sh"}

#! /bin/bash

mkdir ../DATA/BAM

for file in ../DATA/FASTQ/*.fastq*

do

    SHORTNAME=$(basename "$file")

    NAME2="${SHORTNAME##*/}"

    DIRNAME="${NAME2%.*}"

    tophat2 -p 8 --b2-very-sensitive --no-coverage-search --no-novel-juncs --GTF $GENES -o ../DATA/BAM/$DIRNAME $INDEX $file

done

```

### Cufflinks processing

``` {.bash .rundoc-block rundoc-language="sh" rundoc-export="code" rundoc-eval="no" rundoc-tangle="./src/tuxedo_complete.sh"}

#! /bin/bash

for d in ../DATA/BAM/*/

do

    FILENAME="$file"       #set variable FILENAME equal to file from line 1

    SHORTNAME=$(basename "$file")

    NAME2="${SHORTNAME##*/}"

    DIRNAME="${d}"

    cufflinks -p 8 -o $DIRNAME --multi-read-correct --compatible-hits-norm --upper-quartile-norm --GTF $GENES ${d}*hits.bam

done

```

#### Merged transcriptome setup file

``` {.bash .rundoc-block rundoc-language="sh" rundoc-export="code" rundoc-eval="no" rundoc-tangle="./src/gtf_assembly.txt"}

../DATA/BAM/Sample_ES1/transcripts.gtf

../DATA/BAM/Sample_ES2/transcripts.gtf

../DATA/BAM/Sample_ES3/transcripts.gtf

../DATA/BAM/Sample_DE1/transcripts.gtf

../DATA/BAM/Sample_DE2/transcripts.gtf

../DATA/BAM/Sample_DE3/transcripts.gtf

```

### Cuffmerge/Cuffquant processing

``` {.bash .rundoc-block rundoc-language="sh" rundoc-export="code" rundoc-eval="no" rundoc-tangle="./src/tuxedo_complete.sh"}

#! /bin/bash

cuffmerge -g $GENES -s $GENOME -p 8 -o ../DATA/merged_asm gtf_assembly.txt

for d in ../DATA/BAM/*/

do

    FILENAME="$file"       #set variable FILENAME equal to file from line 1

    SHORTNAME=$(basename "$file")

    NAME2="${SHORTNAME##*/}"

    DIRNAME="${d}"

    cuffquant -p 8 -o $DIRNAME --max-mle-iterations 100000 -v --multi-read-correct ../DATA/merged_asm/merged.gtf ${d}*hits.bam

done

```

### Cuffdiff/Cuffnorm processing

``` {.bash .rundoc-block rundoc-language="sh" rundoc-export="code" rundoc-eval="no" rundoc-tangle="./src/tuxedo_complete.sh"}

mkdir ../RESULTS

# CUFFNORM

cuffnorm -o ../RESULTS/normout -p 8 -L ES,DefEnd ../DATA/merged_asm/merged.gtf \

../DATA/BAM/Sample_ES1/abundances.cxb,../DATA/BAM/Sample_ES2/abundances.cxb,../DATA/BAM/Sample_ES3/abundances.cxb \

../DATA/BAM/Sample_DE1/abundances.cxb,../DATA/BAM/Sample_DE2/abundances.cxb,../DATA/BAM/Sample_DE3/abundances.cxb 

```

R script

--------

### Import data from Cuffnorm output

``` {.r .rundoc-block rundoc-language="R" rundoc-session="*R*" rundoc-eval="yes" rundoc-exports="code" rundoc-tangle="./src/expression_analysis.R"}

## FPKM matrix input -----------------------------------------------------------

data1 <- read.table("../RESULTS/normout/genes.count_table",header=TRUE,sep="\t", stringsAsFactors = FALSE)

attr.table <- read.table("../RESULTS/normout/genes.attr_table",header=TRUE,sep="\t",stringsAsFactors = FALSE)

data1$gene_short_name <- attr.table$gene_short_name

# write out complete gene expression matrix

write.csv(data1, file="../RESULTS/ES&DE_dataset_cufflinksFPKM.csv")

```

### Generate row statistics

``` {.r .rundoc-block rundoc-language="R" rundoc-session="*R*" rundoc-eval="yes" rundoc-exports="code" rundoc-tangle="./src/expression_analysis.R"}

# Generate row stats --------------------------------------------------  

# rename data1 as database to fork downstream additions

database <- data1

# generate group shortnames for selecting columns

ES <- grep("ES", colnames(database),ignore.case=F)

DE <- grep("DefEnd", colnames(database),ignore.case=F)

 library(matrixStats)

## calculate mean by treatment type

database$es_mean <- rowMeans(database[,ES], na.rm=T)

database$de_mean <- rowMeans(database[,DE], na.rm=T)

 # calculate log2 change

database$xlog2 <- log2(database$de_mean/database$es_mean)

 ## function to compare by row, returns t distribution

## The function is currently defined as

row.t <- function(mat1,mat2){

  mat1 <- as.matrix(mat1)

  mat2 <- as.matrix(mat2)

  n1<-dim(mat1)[2]

  n2<-dim(mat2)[2] 

  n<-n1+n2 

  m1<-rowMeans(mat1,na.rm=TRUE) 

  m2<-rowMeans(mat2,na.rm=TRUE) 

  v1<-rowVars(mat1,na.rm=TRUE) 

  v2<-rowVars(mat2,na.rm=TRUE) 

  vpool<-(n1-1)/(n-2)*v1 + (n2-1)/(n-2)*v2 

  tstat<-sqrt(n1*n2/n)*(m1-m2)/sqrt(vpool) 

  return(tstat)}

 # calculate t-distribution for hES vs. Definitive endoderm

database$tstat <- row.t(database[,ES],database[,DE])

 # express t-dist as p-value

database$p <- 2*pt(-abs(database$tstat),df=2)

 # calculate Bonferroni correction

database$Bonf_p <- p.adjust(database$p, method = 'bonferroni', n = length(database$p))

database <- database[order(-database$xlog2),]

# output matrix with stats

write.csv(database,file="../RESULTS/ES&DE_dataset_cufflinksFPKM_stats.csv")

#--------------------------------------------------------------------------------

# SET CRITERIA FOR INCLUSION

# Fmpk cut-off 

fco <- 0.4

# P-value cut-off

pco <- 0.05

#--------------------------------------------------------------------------------

#Count the number of reads with Fmpk > 2 in each group 

database$es_count <- rowSums(database[,ES] >= fco)

database$de_count <- rowSums(database[,DE] >= fco)

# generate an output (present = 1, not present = 0) for each group

database$es_present <- ifelse(database$es_count > 1, 1, 0)

database$de_present <- ifelse(database$de_count > 1, 1, 0)

```

### "Volcano" expression plot

``` {.r .rundoc-block rundoc-language="R" rundoc-session="*R*" rundoc-eval="yes" rundoc-exports="code" rundoc-tangle="./src/expression_analysis.R"}

  # Generate "Volcano" expression plot -------------------------------------------

  # open png device

  png(filename = "../RESULTS/volcano_plot.png",

      width = 1200,

      height = 1200,

      units = "px",

      pointsize =10,

      bg = "white")

  par(lwd=6,

      mgp=c(9,3,0),

      # c(bottom, left, top, right)

      mar=c(12,15,1,1),

      font.lab=2,

      font.axis=1,

      cex.lab=5,

      font.main=2,

      cex.axis=5,

      bg="white")

  # Axis Labels

  ylab.name <- expression(paste("-log"[10],"(p-value)"))

  main.lab <- ""

  xlab.name <- expression(paste("Expression ratio DE/ES (log"[2]," FPKM)"))

  # subset for plotting

  # apply P-value cut-off

  vplot <- subset(database, database$p <= pco)

  # Gene expression must be present in at least one condition

  vplot.1 <- subset(vplot[51:nrow(vplot),],

                    vplot$es_present ==1 | vplot$de_present == 1)

  vplot.2 <- subset(database, database$p > pco)

  # subset to genes expressed in both conditions

  db <- subset(database, database$xlog2 != Inf|database$xlog2 != NA)

  # subset top 20 up-regulated genes (formerly top 50)

  db.50 <- db[1:20,]

  # subset LGR5

  lgr5 <- db[24,]

  plot(vplot.1$xlog2,-log10(vplot.1$p),

       type="n",

       ylab=ylab.name,

       xlab=xlab.name,

       col=rgb(0,0,139,90,maxColorValue=255),

       ylim= c(0,3.5),cex=2,xlim= c(-10,12))

  grid(lwd=3)

  lines(vplot.1$xlog2,-log10(vplot.1$p),

        type="p",

        col=rgb(65,105,225,100,maxColorValue=255),

          ylim= c(0,5),

          cex=2,

          xlim= c(-10,12))

  lines(vplot.2$xlog2,-log10(vplot.2$p),

          type="p",

          col=rgb(112,128,144,100,maxColorValue=255),

          cex=2)

  # points for top 20 upregulated

  lines(db.50$xlog2,-log10(db.50$p),

          type="p",

          col=rgb(178,34,35,150,maxColorValue=255),

          cex=2,

          pch=21,

          bg="white")

  # pont for LGR5

  lines(lgr5$xlog2,-log10(lgr5$p),

          type="p",

          col=rgb(178,34,35,150,maxColorValue=255),

          cex=4,

          pch=21,

          bg="red")

  n <- nrow(vplot)

  # add arrow for LGR5

  arrows(lgr5$xlog2-0.5,-log10(lgr5$p)+0.1,

         x1 = lgr5$xlog2,y1 = -log10(lgr5$p),

         lwd =5, col= "blue",length =0.1)

  #Add gene name labels to top 30 upregulated genes

  text(lgr5$xlog2-0.5,-log10(lgr5$p)+0.1,

       labels = lgr5$gene_short_name,

       col = "black",cex = 4,font = 4,pos = 2)

  text(db.50$xlog2[1:20],(-log10(db.50$p)[1:20]),

       labels = db.50$gene_short_name[1:20],

       col="black",cex=2,font=2)

  # Close plotting device

  dev.off()

```

![Volcano plot of the log~2~-transformed FPKM ratio (Definitive

endoderm/Embryonic stem cell) and the -log~10~ p-value. Significant

changes in expression (p <0.05; two-tailed Student's *t*-test) are

indicated in blue. There are a total of 6,063 significantly different

transcripts out of a total 23,999 expressed transcripts (25.3%). The top

50 upregulated transcripts are highlighted in

red.](./RESULTS/volcano_plot.png)

Top 50 up-regulated genes (Supplemental Table 1)

------------------------------------------------

``` {.r .rundoc-block rundoc-language="R" rundoc-session="*R*" rundoc-eval="yes" rundoc-exports="code" rundoc-tangle="./src/expression_analysis.R"}

  # Supplemental Table 1 ---------------------------------------------------------

  db <- subset(database, database$xlog2 != Inf|database$xlog2 != NA)

  db.50 <- db[1:50,]

  write.csv(db.50,file="../RESULTS/top50upregulated_ES&DE.csv")

```

Wnt gene expression plot and Supplemental Table 2

-------------------------------------------------

``` {.r .rundoc-block rundoc-language="R" rundoc-session="*R*" rundoc-eval="yes" rundoc-exports="code" rundoc-tangle="./src/expression_analysis.R"}

      # Import Curated list of Wnt targets -------------------------------------------

      list <- read.csv("../DATA/Wnt_Gene_List.csv",header=FALSE)

      wnt.genes <- list[,1]

      # retrieve rows with gene names matching wnt,genes

      genes <- grep(paste(wnt.genes,collapse="|"),database$gene_short_name)

      wnt.expression <- as.data.frame(database[genes,])

      wnt.expression <- wnt.expression[order(wnt.expression$xlog2),]

      write.csv(wnt.expression, file = "../RESULTS/wnt_gene_expression.csv") # Supplemental Table 2

      wnt.expression <- subset(wnt.expression, wnt.expression$xlog2 != abs(Inf))

      # sort by absolute value of log2 expression ratio (biggest differences first)

      wnt.expression <- wnt.expression[order(-abs(wnt.expression$xlog2)),]

      # Plot expression in ggplot2 ---------------------------------------------------

      library(ggplot2)

      library(MASS)

      library(scales)

      library(grid)

      png(filename = "../RESULTS/wnt_scatter.png",

            width = 1200,

            height = 1200,

            units = "px",

            pointsize =10,

            bg = "white") 

      print(ggplot(data = wnt.expression) + 

                geom_abline(intercept = 0 , slope = 1,

                            color = "grey", size = 2, linetype = "dashed") +

                geom_point(shape = 21,alpha = I(0.95),

                           aes(de_mean, es_mean, size = -log(p,10),fill = xlog2)) +

                guides(fill = guide_colorbar(title = expression(paste("DE/ES (log"[2]," FPKM)"))),

                       size = guide_legend(title = expression(paste("-log"[10],"(p-value)")))) +

                scale_fill_gradient2(low = "blue", high = "red") +

                scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),

                              labels = trans_format("log10", math_format(10^.x))) +

                scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),

                              labels = trans_format("log10", math_format(10^.x))) +

                annotation_logticks(size = 3) +

                geom_text(data = wnt.expression[c(1:(grep("RSPO2",wnt.expression$gene_short_name)-1),(grep("RSPO2",wnt.expression$gene_short_name)+2):23),],

                          aes(de_mean,es_mean,label=gene_short_name),

                          size = 10, fontface = 2,

                          # nudge labels to improve visibility

                          vjust = ifelse(wnt.expression[c(1:(grep("RSPO2",wnt.expression$gene_short_name)-1),(grep("RSPO2",wnt.expression$gene_short_name)+2):23),]$gene_short_name == "TP53TG5" | wnt.expression[c(1:(grep("RSPO2",wnt.expression$gene_short_name)-1),(grep("RSPO2",wnt.expression$gene_short_name)+2):23),]$gene_short_name == "WIF1", -0.45, 0.5),

                          # move labels for 3 genes with 0 expression in DE out of view rather than plot names underneath y-axis (unreadable). Expression data available in Supplemental Table 2

                          hjust = ifelse(wnt.expression[c(1:(grep("RSPO2",wnt.expression$gene_short_name)-1),(grep("RSPO2",wnt.expression$gene_short_name)+2):23),]$de_mean < 1, 2, 0.5)) +

                # Offset RSPO2 to improve visibility

                geom_text(data = wnt.expression[grep("RSPO2",wnt.expression$gene_short_name),],

                          aes(de_mean,es_mean+7,label = gene_short_name),

                          size = 10,fontface=2) +

                # Offset RSPO1 to improve visibility

                geom_text(data = wnt.expression[grep("RSPO1",wnt.expression$gene_short_name),],

                          aes(de_mean+0.1,es_mean+2,label = gene_short_name),

                          size = 10,fontface=2) +

                scale_size(range=c(0.25,25)) +

                labs(x = "\nDefinitive Endoderm (Mean FPKM)",

                     y = "ES cells (Mean FPKM)\n") +

                coord_fixed(ratio = 1) +

                theme(panel.grid.minor = element_blank(),

                      panel.grid.major = element_blank(),

                      legend.position = "bottom",

                      legend.key.size = unit(2, "cm"),

                      legend.text = element_text(size = 20),

                      legend.title = element_text(size = 20, face =2),

                      axis.text = element_text(size=30, color = "black", face = 2),

                      axis.title = element_text(size = 35, face = 2),

                      panel.background = element_rect(fill = "grey85", color = "black"))

            ) # close print window

      dev.off()

```

![](./RESULTS/wnt_scatter.png)

Complete workflow (Sequence alignment --> Expression plots)

==============================================================

    #! /bin/bash

    # First execute tuxedo pipeline to generate FPKM data (this will take a long time)

    ./tuxedo_complete.sh

    # Then execute the R script

    Rscript expression_analysis.R