An open API service indexing awesome lists of open source software.

https://github.com/k3jph/coms4761


https://github.com/k3jph/coms4761

Last synced: 7 months ago
JSON representation

Awesome Lists containing this project

README

          

# Computational Genomics Final Project

This repository contains a final project for COMS 4761 -- Computational
Genomics at Columbia University in the City of New York.

* GitHub repo:
* Free software: MIT

Our project replicates aspects of the
[GRADIS](https://github.com/MonaRazaghi/GRADIS) for the supervised
learning of gene regulatory networks based on graph distance profile
of transcriptomics data. For this, we have replicated the basic
analysis in R, and used methods other than SVM. In addition, to
simplify some of the processing, we have exported the data from
Excel to CSVs.

## Features

All referenced files can be found in `src`, which is organized:

* `data` folder containing *E. coli* and *S. cerevisiae* inputs in the correct format for GRADIS use. Also includes the raw DREAM5 data for both organisms in `raw` and the pre-processing script used to transform to the proper input form/files.
* `gnn` folder containing the GRGNN implementation from , as well as the output results used in our analysis. Includes a separate README file with quickstart instructions.
* `R` folder contains the majority of our GRADIS implementation scripts, as translated from the original MATLAB implementation here: .

## Quickstart

**Data Transformation**

The GRADIS scripts expect three input files:
* *Genes.csv* with the list of gene names for the organism
* *Network.csv* listing labeled TF-gene interactions (where 1 indicates a positive interaction)
* *Expression.txt* with collected gene expression levels. Each row represents a sample.

Raw DREAM5 data for *E. coli* (Network 3) and *S. cerevisiae* (Network 4) is available in `data\raw`, as pulled from . To transform these files into the expected format for our GRADIS implementation (discussed below), navigate to the `data` directory, and run the following command:
```
python data_processing.py
```
where network_id = 3 for *E. coli* and network_id = 4 for *S. cerevisiae*.

## Usage

### GNN

bash install.sh

to install the required software and libraries. [Node2vec](https://github.com/aditya-grover/node2vec) and [DGCNN](https://github.com/muhanzhang/pytorch_DGCNN) are included in software folder.

Unzip DREAM5 data

cd data/dream

unzip dreamdata.zip

cd ../../

(Optional): Preprocessing DREAM5 data

cd preprocessing

python Preprocessing_DREAM5.py 3

python Preprocessing_DREAM5.py 4

In this program, data3 means E.coli dataset, data4 means S. cerevisae dataset

Train and test E. coli with hop 1 and embedding, Type:

python Main_inductive_ensemble.py --traindata-name data3 --testdata-name data3 --hop 1 --use-embedding

Train and test S. cerevisae with hop 1 and embedding, Type:

python Main_inductive_ensemble.py --traindata-name data4 --testdata-name data4 --hop 1 --use-embedding

### R

To test three different methods for differentiating
between positive and negative, for *E. coli* use:

Rscript src/R/glmmer-ec.R

For *S. cerevisiae*, use:

Rscript src/R/glmmer-sc.R

Both scripts include a vanilla GLM, random forest, and
naïve Bayes as classifiers. The code is built on top
of the R library [caret](https://topepo.github.io/caret/),
so switching classifiers is trivial.

To identify and train negatives over the *E. coli* data,
use:

Rscript src/R/gradis-neg-ec.R

To identify and train negatives over the *S. cerevisiae* data,
use:

Rscript src/R/gradis-neg-sc.R

Both scripts use random forest by default, but also
use caret, so switching is, again trivial. However,
even rapidly training models will take multiple hours.
Using random forest requires up to 24 hours on new
Apple silicon.

## For more information

* "[Supervised learning of gene-regulatory networks based on graph distance profiles of transcriptomics data](https://www.nature.com/articles/s41540-020-0140-1)"
* "[Inductive inference of gene regulatory network using supervised and semi-supervised graph neural networks](https://www.sciencedirect.com/science/article/pii/S200103702030444X)"
* James P. Howard, II <>