Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/asmagen/spagefinder
Computational approach to identify Survival associated Pairwise Gene Expression states.
https://github.com/asmagen/spagefinder
genetic genetic-interactions interaction survival-analysis tcga-data
Last synced: 2 months ago
JSON representation
Computational approach to identify Survival associated Pairwise Gene Expression states.
- Host: GitHub
- URL: https://github.com/asmagen/spagefinder
- Owner: asmagen
- License: mit
- Created: 2018-12-26T18:55:34.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-07-16T20:50:22.000Z (over 1 year ago)
- Last Synced: 2024-05-21T09:39:17.116Z (8 months ago)
- Topics: genetic, genetic-interactions, interaction, survival-analysis, tcga-data
- Language: R
- Homepage: http://dx.doi.org/10.1016/j.celrep.2019.06.067
- Size: 39.4 MB
- Stars: 8
- Watchers: 2
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
[![DOI](https://zenodo.org/badge/163208790.svg)](https://zenodo.org/badge/latestdoi/163208790)
# SPAGE-finder Manual (*[Magen](https://assafmagen.com) et al*)
This repository contains code and documentation for a multi-step computational pipeline to search for Survival associated Pairwise Gene Expression states finder (SPAGE-finder).
Previously, the repository was called EnGIne. The [paper](http://dx.doi.org/10.1016/j.celrep.2019.06.067) describing the project can be found at:
http://dx.doi.org/10.1016/j.celrep.2019.06.067The repository can be retrieved from GitHub via the command
*git clone https://github.com/asmagen/SPAGEfinder*This creates a new directory called SPAGEfinder where this README.md can be found. If one reads README.md as a text file, there will be asterisks around the variable names; do not copy the asterisks when copying a command to assign a value to a variable.
SPAGEfinder has two subdirectories:
1. data
2. R# Code files
There are seven code files:
- aggregateLogRankGene.cpp
- analyze.pairwise.significance.R
- calculate.base.cox.model.R
- calculate.candidates.cox.fdr.R
- main.script.functions.R
- main.script.R
- merge.pancancer.results.RSix of these are in R and one in C++.
The C++ code is compiled from within R. Therefore, one need not compile it before running the pipeline.## Input format
The input includes mRNA expression matrix and patient clinical-demographic information which are stored in a single data source object and corresponding file. The TCGA dataset (filtered to Census Cancer Genes) is provided here for quick analysis (Example dataset available: 'data/data.mRNA.RData').## Output format
Final SPAGEs list is generated in a matrix format where each row represents a SPAGE that is annotated by a quadruple *(x,y,bin,effect)*, where x and y are the two interacting genes, bin is a number indicating the bin annotation and effect annotated the significance level where the sign of the effect represents the direction of the interaction; Positive sign represents higher survival risk while negative sign represents lower survival risk.## Analysis setup
### UNIX commands
Connect to remote server [Example: `ssh USER_NAME@SERVER_ADDRESS`]
```git clone https://github.com/asmagen/SPAGEfinder.git```Request an interactiveexecution session if required by your system [Example: `sinteractive`]
Invoke R version 3.3.1 [Example (may be different across systems): `module purge; module add R/3.3.1; R`]
The following commands are set and run in the R environment.### Install R packages
Install the required R packages into the default location (no need to specify where to install, enter 'yes' to indicate installation to personal library if asked).
```install.packages(pkgs = c('Rcpp','RcppArmadillo','survival','rslurm','foreach','doMC','data.table','igraph','whisker','foreach'))```
Specify a repository of choice and verify successful installations by loading packages (Example: library('Rcpp')).
`source("https://bioconductor.org/biocLite.R"); biocLite("survcomp")`
Verify successful installation.### Define relevant analysis paths (in R)
```
r.package.path = 'USER_SET_PACKAGE_PATH' # Define path for the downloaded pipeline scripts and data (Example: '/USER/SPAGEfinder')
results.path = 'USER_SET_ANALYSIS_DIRECTORY_PATH' # Define path for analysis results set by user (Example: '/USER/analysis/TCGA_analysis')
```
### Assign values to additional analysis and slurm parametersThe suggested values shown below may be adjusted by the user as needed.
*p.val.quantile.threshold* = 0.8 # Log-Rank threshold (p value quantile)
The next 7 parameters may need to be adjusted based on the specification of your high-performance computing system. Run the following command to obtain the info about the available queues, memory, walltime and num.jobs (number of concurrent jobs) resources:
```
sacctmgr show qos format=name,MaxJobs,MaxWall,MaxTRESName MaxJobs MaxWall MaxTRES
---------- ------- ----------- -------------
normal
default 16 01:00:00 mem=4G
throughput 125 18:00:00 mem=36G
high_thro+ 300 08:00:00 mem=8G
large 5 11-00:00:00 mem=128G
xlarge 1 21-00:00:00 mem=512G
long 16 7-00:00:00 mem=12G
workstati+ 4 7-00:00:00 mem=48G
```Based on this information and the scope of the analysis (whole-genome or in this example case, only about 500 genes) you would define the following parameters:
```
queues = 'throughput' # SLURM HPC queue
num.jobs = 50 # Number of concurrent jobs
walltime = '1:00:00' # Time limit per job
memory = '4GB' # Memory allocation per job
```
And the following parameters specifically for the merge.pancancer.results function as it requires more memory than the usual:
```
large.queues = 'throughput' # SLURM HPC queue
large.walltime = '1:00:00' # Time limit per job
large.memory = '36GB' # Memory allocation per job
```
The appropriate parameters for whole genome analysis (analysis of about 20k genes) are:
*num.jobs* = 120, *walltime* = '8:00:00', *memory* = '8GB', *large.queues* = 'large', *large.walltime* = '5:00:00', *large.memory* = '120GB'Note that in some systems there is no need to specify the queues parameter. If the queues specification above results in an error, use *queues = NA* and *large.queues = NA* to let the system choose the appropriate queues by itself.
Continue the analysis by executing the commands in 'R/main.script.R'
### Creating new datasets for analysis
The function preprocess.genomic.data (r.package.path) can be used to process and perform binning of a 'dataset' object located at 'data/dataset.RData') and constructed in the following fields:
- *mRNA* - RSEM normalized mRNA measurements (rows corresponding to genes and columns to samples)
- *scna* - copy-number variation measurements (rows corresponding to genes and columns to samples)
- *samples* - sample IDs as factors
- *type* - cancer types as factors
- *sex* - sex annotation as (two) factors
- *race* - race annotation as factors
- *time* - patient survival as number of days to death
- *status* - patient death/alive status as 0 or 1, respectivelyThe annotation or format of *samples, type, sex and race* is not important as long as the variables are converted to factors.
### Potential problems
- The following error may rarely come up in the *get_slurm_output* function:
```'slurm_load_jobs error: Socket timed out on send/recv operation'```
The error does not reflect the failure of the analysis but only a failure of monitoring the job execution. Simply rerun the *get_slurm_output* function.
- The following error may come up due to insufficient resource allocation:
'The following files are missing: ... Check failed jobs error outputs'[Assaf Magen Ph.D.](https://assafmagen.com) and [Ask Mendel AI](https://askmendel.ai)