https://github.com/gersteinlab/svfx
Machine learning framework to quantify pathogenicity of structural variants
https://github.com/gersteinlab/svfx
Last synced: 8 days ago
JSON representation
Machine learning framework to quantify pathogenicity of structural variants
- Host: GitHub
- URL: https://github.com/gersteinlab/svfx
- Owner: gersteinlab
- License: mit
- Created: 2019-08-08T17:27:26.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2020-12-03T12:54:38.000Z (over 4 years ago)
- Last Synced: 2025-04-03T04:41:24.686Z (about 2 months ago)
- Language: Python
- Homepage:
- Size: 4.08 MB
- Stars: 12
- Watchers: 25
- Forks: 5
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# SVFX: Using Random Forests to Prioritize Structural Variants
SVFX is a machine learning based tool to assign pathogenic scores to large deletions and duplications.
## Setup
Install `conda` using these [directions](https://docs.conda.io/projects/conda/en/latest/user-guide/install/).
On MacOS, you can activate the provided environment with
`conda env create -f SVFXenv.yml && conda activate SVFXenv`otherwise, run
`conda env create -f SVFXenv_all_platforms.yml && conda activate SVFXenv`The SVFX pipeline requires Python3.6+.
## Configuring the pipeline
The pipeline is built using [Snakemake](https://snakemake.readthedocs.io/); to configure
elements of the pipeline, modify the `config.yaml` file from the root directory.### Configurable elements
**`run_id`**: A string to identify this run; this string will be prepended to the files generated by this run
so that multiple runs can be executed without cleanup.**`coordinates`**: `control` and `disease` should have the file paths for the `control` and `disease` coordinates, respectively.
**`bigwig_files`**: A hyphenated list of paths to BigWig files to be used in
feature computation.**`overlap_coordinate_files`**: A hyphenated list of paths to coordinate files for computing overlap as a feature (e.g. coordinates of certain genes)
**`normalized`**: Boolean flag for whether to length-normalize the feature values.
**`variant_type`**: What kind of variants are in the input files (either `DEL` or `DUP`)
**`ref_genome`**: Index file (`*.fai`) for a reference genome
**`randomized_num`**: If `normalized` is `True`, the number of pseudo-matrices of variants to normalize with.
**`output_matrix`**: `control` and `disease` should have the root output names for the control feature matrix and the disease feature matrix, respectively (these will be combined into a file with the root name `combined_matrix`).
**`columns_to_remove`**: File with indices for the columns in the feature matrix that should not be considered in the model (e.g. the indices of the
feature coordinates). The default file should be fine for the matrices generated from the scripts we provide.**`class_label_index`**: Column with the class label. Again, the default value should be fine for matrices generated from our scripts.
**`num_trees`**: Number of trees in the random forest
**`max_depth`**: The maximum depth of a tree in the random forest
**`min_samples_split`**: The minimum number of samples to split an internal node in a tree in the random forest
## Running the entire pipeline
Simply run `snakemake --cores=8` from the root directory.
### Running parts of the pipeline
To just run a part of the pipeline, modify `config.yaml` and `Snakefile` with the configuration and input/output you want, then run `snakemake --cores=8 -R --until (rule name)`.
For example, if you have a list of coordinates and just want to generate the feature matrix for it (which is useful if you want to use a preexisting model to score additional variants), you can modify `Snakefile` to set the control or disease input to your file (for the other input, you can just put an arbitrary coordinates file, then delete the output matrix corresponding to that input). You can then run `snakemake --cores=8 -R --until generate_feature_matrix`, and the matrix will be generated.
## Using procedures outside of the pipeline
Usage of the individual programs in the pipeline is explained below.
### necessary python package installations:
`pip3 install scikit-allel`
`pip3 install pyBigWig`
`pip3 install argparse`
`pip3 install sklearn`### `generate_feature_matrix.py`
Usage: `python3 src/generate_feature_matrix.py -c [SV files in tab-separated coordinate format (each line should start with chr)] -b [Feature files in BigWig format] -g [Coordinates for interval overlap-based features] -o [File root for output] -f [Present if class label 1, False otherwise] -t [SV type (e.g. DEL, DUP, INV)] -z [Present if matrix should be Z-Score normalized (length feature will not be normalized)] -r [Number of shuffled coordinates to generate for the randomization process] -rg [Reference genome index file - necessary for the -r flag] -l [If present, length will be included as a feature (and will be ignored during normalization)]`
Sample command (run from the root directory): `python3 src/generate_feature_matrix.py -c data/sample_input_SV.txt -b data/gcPercentage.bw -g data/gc19_pc.prom.nr.bed data/gc19_pc.3utr.nr.bed data/gc19_pc.5utr.nr.bed data/gc19_pc.cds.nr.bed data/gc19_pc.ss.nr.bed data/sensitive.nc.bed data/ultra.conserved.hg19.bed data/wgEncodeBroadHmmGm12878HMM.Heterochrom.bed data/H1-ESC_Dixon2015-raw_TADs.bed -o sample_output_matrix -t DEL -z -r 1 -rg data/hg19.fa.fai`
Given an input list of structural variants, generates a matrix of feature values calculated from the given BigWig and coordinate-overlap files for input to a model.
### `rf_model.py`
Usage: `python3 src/rf_model.py -i [input feature matrix] -d [File of indices of features to ignore (one per line)] -t [Index of the class label in the feature matrix] -n [Number of trees in the forest] -m [Maximum depth of a tree in the forest] -s [Minimum number of samples required to split an internal node] -c [Number of pathogenic SV's in the matrix] -l [Total number of SV's in the matrix] -o [Output root filename]`
Sample command: `python3 src/rf_model.py -i input_del_matrix.ZscoreNormalized.training.tsv -d remove_indices.txt -t 0 -n 150 -c 3613 -l 7225 -o output_model_`
Given an input feature matrix, trains a set of 10 random forest classifiers on disjoint tenths of the dataset, then predicts class labels for each member of the dataset with the models that did not train on them. Saves the created models, as a Python list of trained sklearn models, to `[file root]_ten_models.pkl`. Saves the indices used to split the diseased data into tenths in `[file root]_disease_indices.pkl`, and those used to split the control data in `[file root]_kg_indices.pkl`.
Predictions on the training data (from the models that did not train on each specific SV) are outputted in dictionary format to `[file root]_predictions.pkl`. Finally, AU(ROC/PRC) on this data are calculated
and outputted to `[file root]_ROC_PRC.txt`.Note: The input matrix must be formatted in such a way that all control group rows follow all non-control rows (in other words, all rows with label 0 follow all rows with label 1). The easiest way to do this is to use `generate_feature_matrix.py` to generate the 0-label and 1-label matrices, then append the 0-label matrix (without title row) to the 1-label matrix.
### `src/load_model.py`
Tests the model on a matrix. Note that the matrix must have the same format as the matrices used to train the model (every single feature in the same order as the training data, with no additional features). To generate a matrix given coordinates, see (here)[###running-parts-of-the-pipeline].
You will need to `pip3 install joblib` before this can be run. If any dependencies are still missing, install them with pip3.
Usage: `python3 src/load_model.py -i [input matrix] -d [File of columns to ignore for model training; the same as columns_to_remove in the config file. remove_inds.txt should be find for matrices generated from our scripts.] -m [the .pkl file with all 10 models. This is output by the Snakemake pipeline] -t [Index of the class label in the matrix - use 0 if the matrix was generated using our code] -o [filename to write the predictions to]`
### `roc_prc_gen.py`
Usage: `python3 roc_prc_gen.py [input file] [output file]`
Given a tab-separated two-column input file of predicted scores and true labels (with a header, as with the following example):
`Predicted True`
`0.755 1`
`...`
`0.211 0`Outputs AU(ROC/PRC) information, along with a plot of the ROC, saved to the specified output file.