https://github.com/fpsom/ngs-data-integration
A pipeline for integrating downstream data analysis across NGS data technologies (RNA-Seq, WES, 450k, etc)
https://github.com/fpsom/ngs-data-integration
Last synced: 12 months ago
JSON representation
A pipeline for integrating downstream data analysis across NGS data technologies (RNA-Seq, WES, 450k, etc)
- Host: GitHub
- URL: https://github.com/fpsom/ngs-data-integration
- Owner: fpsom
- Created: 2016-05-17T08:56:55.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2017-09-04T14:58:04.000Z (almost 9 years ago)
- Last Synced: 2025-05-21T11:14:48.926Z (about 1 year ago)
- Language: R
- Size: 113 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NGS Data Integration
Data integration is a key objective in biomedical research, as it allows the identification of hidden relationships and correlations between heterogeneous biomolecular data. This is a data integration prototype using R, currently supporting two NGS technologies (450K methylation array and RNA sequencing).
## Input Data Sources
Currently the process supports two technologies;
The DNA methylation profiling was performed using the Infinium Human Methylation 450k array (Illumina) interrogating 485,577 CpG sites and data were analyzed using RnBeads (R package). RNA sequencing was performed using NextSeq 500 (Illumina) and data were analyzed using TopHat in Unix environment. The DNA methylation and expression levels were measured using b-values and Fragments Per Kilobase Million (FPKM), respectively.
### 450K methylation data
The sample data (`betas_450k_Meth.csv`) is located within the `sample_data` folder, in a `csv` format, containing the following columns:
`ID`, `Chromosome`, `Start`, `End` and `Strand`
The first column (`ID`) corresponds to the identifier assigned to each methylation site (CpG site), the next four columns (`Chromosome`, `Start`, `End` and `Strand`) define the exact chromosomal position of the site
A snippet of the data is the following:
```
cg13869341,chr1,15865,15866,+
cg14008030,chr1,18827,18828,+
cg12045430,chr1,29407,29408,+
```
### RNA-Seq data
The sample data (`gene_expr_RNA_Seq.csv`) is located within the `sample_data` folder, in a `csv` format, containing the following columns:
`LOC`, `Chromosome`, `Start`, `End`, `gene.features.locus`, `Genes` and `Strand`
The first column (`LOC`) corresponds to the identifier assigned to each gene after the tuxedo protocol [1], columns `Chromosome`, `Start`, `End`, `gene.features.locus` and `Strand` define the exact chromosomal position of the site and column `Genes` contains the gene names that the particular loci has been annotated with.
A snippet of the data is the following:
```
XLOC_000001,chr1,11873,29370,chr1:11873-29370,DDX11L1,+
XLOC_000002,chr1,11873,29370,chr1:11873-29370,WASH7P,+
XLOC_000003,chr1,30365,30503,chr1:30365-30503,MIR1302-10,+
```
## Steps involved
The `R` script developed is based on the `GenomicRanges` package (`library(GenomicRanges)`)
### Stage 1: Find the overlap within the transcript
- **step A**. find the overlapping ranges between the CpG and the LOC
- **step B**. add a column on betas file (`betas_m`) and on expression file(`expr_m`) with the overlapping region in each case and create a total matrix (`info.within.1.2`)
- **step C**. save the total matrix within the transcript
### Stage 2: find the overlap within the TSS and the `+` strand
- **step A**. find the TSS of the 5'- 3' transcript
- **step B**. find the overlapping ranges between the CpG and the LOC
- **step C**. add a column on betas file (`betas_m`) and on expression file(`expr_m`) with the overlapping region in each case and create a total matrix (`info.tss.pos.1`)
- **step D**. save the total matrix within the TSS and the `+` strand
### Stage 3: find the overlap within the TSS the `-` strand
- **step A**. find the TSS of the 3'- 5' transcript
- **step B**. find the overlapping ranges between the CpG and the LOC
- **step C**. add a column on betas file (`betas_m`) and on expression file(`expr_m`) with the overlapping region in each case and create a total matrix (`info.tss.neg.1.2`)
- **step D**. save the total matrix within the TSS and the `-` strand
## References
[1] Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L. Salzberg, John L. Rinn & Lior Pachter, "_Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks_", Nature Protocols 7, 562–578 (2012) doi:10.1038/nprot.2012.016 ¶6.
_Note: Testing data can be retrieved from [genome/gms repository](https://github.com/genome/gms/wiki/HCC1395-WGS-Exome-RNA-Seq-Data )_