{"id":25520162,"url":"https://github.com/fpsom/ngs-data-integration","last_synced_at":"2025-07-02T17:07:18.810Z","repository":{"id":145089990,"uuid":"59006534","full_name":"fpsom/ngs-data-integration","owner":"fpsom","description":"A pipeline for integrating downstream data analysis across NGS data technologies (RNA-Seq, WES, 450k, etc)","archived":false,"fork":false,"pushed_at":"2017-09-04T14:58:04.000Z","size":116,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-05-21T11:14:48.926Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fpsom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-05-17T08:56:55.000Z","updated_at":"2017-08-18T08:56:12.000Z","dependencies_parsed_at":"2023-06-03T04:15:34.294Z","dependency_job_id":null,"html_url":"https://github.com/fpsom/ngs-data-integration","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/fpsom/ngs-data-integration","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fpsom%2Fngs-data-integration","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fpsom%2Fngs-data-integration/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fpsom%2Fngs-data-integration/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fpsom%2Fngs-data-integration/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fpsom","download_url":"https://codeload.github.com/fpsom/ngs-data-integration/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fpsom%2Fngs-data-integration/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263182184,"owners_count":23426633,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-19T17:47:24.300Z","updated_at":"2025-07-02T17:07:18.771Z","avatar_url":"https://github.com/fpsom.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NGS Data Integration\n\nData integration is a key objective in biomedical research, as it allows the identification of hidden relationships and correlations between heterogeneous biomolecular data. This is a data integration prototype using R, currently supporting two NGS technologies (450K methylation array and RNA sequencing).\n\n## Input Data Sources\n\nCurrently the process supports two technologies; \n\nThe DNA methylation profiling was performed using the Infinium Human Methylation 450k array (Illumina) interrogating 485,577 CpG sites and data were analyzed using RnBeads (R package). RNA sequencing was performed using NextSeq 500 (Illumina) and data were analyzed using TopHat in Unix environment. The DNA methylation and expression levels were measured using b-values and Fragments Per Kilobase Million (FPKM), respectively. \n\n### 450K methylation data\n\nThe sample data (`betas_450k_Meth.csv`) is located within the `sample_data` folder, in a `csv` format, containing the following columns:\n\n`ID`, `Chromosome`, `Start`, `End` and `Strand`\n\nThe first column (`ID`) corresponds to the identifier assigned to each methylation site (CpG site), the next four columns (`Chromosome`, `Start`, `End` and `Strand`) define the exact chromosomal position of the site\n\nA snippet of the data is the following:\n\n```\ncg13869341,chr1,15865,15866,+\ncg14008030,chr1,18827,18828,+\ncg12045430,chr1,29407,29408,+\n```\n\n### RNA-Seq data\n\nThe sample data (`gene_expr_RNA_Seq.csv`) is located within the `sample_data` folder, in a `csv` format, containing the following columns:\n\n`LOC`, `Chromosome`, `Start`, `End`, `gene.features.locus`, `Genes` and `Strand`\n\nThe first column (`LOC`) corresponds to the identifier assigned to each gene after the tuxedo protocol [1], columns `Chromosome`, `Start`, `End`, `gene.features.locus` and `Strand` define the exact chromosomal position of the site and column `Genes` contains the gene names that the particular loci has been annotated with.\n\nA snippet of the data is the following:\n\n```\nXLOC_000001,chr1,11873,29370,chr1:11873-29370,DDX11L1,+\nXLOC_000002,chr1,11873,29370,chr1:11873-29370,WASH7P,+\nXLOC_000003,chr1,30365,30503,chr1:30365-30503,MIR1302-10,+\n```\n\n## Steps involved\n\nThe `R` script developed is based on the `GenomicRanges` package (`library(GenomicRanges)`)\n\n### Stage 1: Find the overlap within the transcript\n\n- **step A**. find the overlapping ranges between the CpG and the LOC\n- **step B**. add a column on betas file (`betas_m`) and on expression file(`expr_m`) with the overlapping region in each case and create a total matrix (`info.within.1.2`)\n- **step C**. save the total matrix within the transcript\n\n### Stage 2: find the overlap within the TSS and the `+` strand\n\n- **step A**. find the TSS of the 5'- 3' transcript\n- **step B**. find the overlapping ranges between the CpG and the LOC\n- **step C**. add a column on betas file (`betas_m`) and on expression file(`expr_m`) with the overlapping region in each case and create a total matrix (`info.tss.pos.1`)\n- **step D**. save the total matrix within the TSS and the `+` strand\n\n### Stage 3: find the overlap within the TSS the `-` strand\n\n- **step A**. find the TSS of the 3'- 5' transcript\n- **step B**. find the overlapping ranges between the CpG and the LOC\n- **step C**. add a column on betas file (`betas_m`) and on expression file(`expr_m`) with the overlapping region in each case and create a total matrix (`info.tss.neg.1.2`)\n- **step D**. save the total matrix within the TSS and the `-` strand\n\n## References\n\n[1] Cole Trapnell,\tAdam Roberts,\tLoyal Goff,\tGeo Pertea,\tDaehwan Kim,\tDavid R Kelley, Harold Pimentel,\tSteven L. Salzberg,\tJohn L. Rinn\t\u0026 Lior Pachter, \"_Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks_\", Nature Protocols 7, 562–578 (2012) doi:10.1038/nprot.2012.016 ¶6.\n\n_Note: Testing data can be retrieved from [genome/gms repository](https://github.com/genome/gms/wiki/HCC1395-WGS-Exome-RNA-Seq-Data )_\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffpsom%2Fngs-data-integration","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffpsom%2Fngs-data-integration","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffpsom%2Fngs-data-integration/lists"}