https://github.com/soedinglab/tejaas

Tejaas - a tool for discovering trans-eQTLs
https://github.com/soedinglab/tejaas
eqtl eqtl-mapping gene-expression gtex reverse-regression trans-eqtl
Last synced: about 1 year ago
JSON representation
Tejaas - a tool for discovering trans-eQTLs
Host: GitHub
URL: https://github.com/soedinglab/tejaas
Owner: soedinglab
License: gpl-3.0
Created: 2017-10-13T09:11:45.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2022-10-14T18:42:25.000Z (over 3 years ago)
Last Synced: 2025-04-24T11:52:00.417Z (about 1 year ago)
Topics: eqtl, eqtl-mapping, gene-expression, gtex, reverse-regression, trans-eqtl
Language: C
Homepage: https://github.com/soedinglab/tejaas/wiki
Size: 36.8 MB
Stars: 11
Watchers: 4
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Tejaas: Discover trans-eQTLs

## Description

Tejaas is a command line tool to find trans-eQTLs from eQTL data.

It is released under the GNU General Public License version 3.

Tejaas is based on the hypothesis that a trans-eQTL should regulate the expression levels of multiple genes.

In brief, it implements two statistical methods to find trans-eQTLs:

- **RR-score (Reverse Regression)**: It performs a multiple linear regression with L₂-regularization 

using expression levels of all genes to explain the genotype of a candidate SNP.

In contrast to conventional methods, the direction of the regression is reversed, with the gene expressions as explanatory variables.

RR-score is a statistic which estimates whether more genes are required to explain the allele counts of a SNP than expected by chance.

- **JPA-score (Joint P-value Analysis)**: It evaluates the distribution of p-values of the pairwise linear association of a candidate SNP with all available gene expression levels. 

Any null SNP (no trans-effect) will have a uniform distribution of p-values,

while a trans-eQTL will be associated with more genes than expected by chance, leading to overdispersion near zero. 

The JPA-score is a statistic which estimates whether the distribution of p-values is significantly overdispersed near zero.

Additionally, it also implements a non-linear unsupervised confounder correction using k-nearest neighbors called **KNN correction**.

## Dependencies

- Python version 3.6 or higher

- C compiler

- Any linear algebra routine, e.g. Intel MKL, OpenBLAS, etc.

- Any flavor of Message Passing Interface (MPI), e.g. OpenMPI, MPICH, etc.

- Other Python libraries:

  - [NumPy](http://www.numpy.org/) / array operations

  - [SciPy](https://www.scipy.org/) / optimization and other special functions

  - [statsmodel](http://www.statsmodels.org/stable/index.html) / used for ECDF calculation in JPA-score

  - [Pygtrie](https://pypi.org/project/pygtrie/) / used for reading MAF file in RR-score / maf null

  - [mpi4py](https://mpi4py.readthedocs.io/en/stable/) / linked to MPI and MKL for python parallelization

  - [scikit-learn](https://scikit-learn.org/stable/index.html) / used for PCA decomposition in KNN correction

## Installation

See the wiki for detailed [installation instructions](https://github.com/soedinglab/tejaas/wiki/Installation). 

Here is a quick start guide:

### 1. Install dependencies

Installation of Tejaas depends on other dependencies. If you are using `conda`, then they can be installed using

```

conda install numpy scipy statsmodels scikit-learn

pip install mpi4py

```

### 2. Install Tejaas

2a. Tejaas can be installed directly from the PyPI repository:

```

pip install tejaas

```

2b. You can also download this repository and build Tejaas:

```

git clone git@github.com:soedinglab/tejaas.git

cd tejaas

pip install -e .

```

### 3. Run Tejaas!

```

tejaas [OPTIONS]

```

See below for valid options or try `tejaas --help`.

#### Run an example to check installation

An example script `example/run_example.sh` is provided to check the installation.

```

cd tejaas/example

./run_example.sh  

```

The script will download some example input files in `/data` and run Tejaas on `` cores.

The output will be created in `/data`. 

Note that the `data` subdirectory is automatically created within the `` specified by the user.

Check if the output is correct:

```

python compare_with_gold.py --outdir 

```

This will check if the output in `/data` 

(again note that `data` subdirectory is automatically appended and need not be specified by the user)

matches with the results provided in the `example/gold` subdirectory.

## Input Files

- Gene expression file

- Genotype file

  - VCF

  - Oxford

  - Dosage

- GENCODE file

- Population minor allele frequency

## Tejaas [OPTIONS]

Option                  | Argument | Description | Priority | Default value

:--- | :--- | :---        |:---      | :--

`--vcf`       | `FILEPATH` | Input genotype file in vcf.gz format | Required (vcf or oxf) | --

`--oxf`       | `FILEPATH` | Input genotype file in Oxford format | Required (vcf or oxf) | --

`--dosage`    |            | Flag for reading dosage files. The file is specified with the `--oxf` option, e.g. `--oxf FILEPATH --dosage` | Optional | `False` 

`--fam`       | `FILEPATH` | Input fam file for samples names of Oxford genotype | Optional | -- 

`--chrom`     | `INT`      | Chromosome number of the genotype file | Required | -- 

`--include-SNPs` | `START:END` | Colon-separated index of SNPs to be included | Optional | -- 

`--gx`        | `FILEPATH` | Input gene expression file for trans-eQTL discovery | Required | --

`--gxcorr`    | `FILEPATH` | Input gene expression file for target gene discovery | Optional | `--gx` file

`--gxfmt`     | `OPTION`   | Input gene expression file format (see format details below). Supported options: `gtex`, `cardiogenics`, `geuvadis` | Optional | `gtex`

`--gtf`       | `FILEPATH` | Input GTF file from GENCODE to read gene Ensembl IDs. Used for selecting biotypes and getting genomic locations. | Required | --

`--trim`      |            | Flag to trim version number from GENCODE Ensembl IDs | Optional | `False`

`--biotype`   | `OPTION`   | Which biotypes to select from the GTF file. Supported options: `protein_coding`, `lncRNA`. | Optional | `protein_coding lncRNA`

`--outprefix` | `STRING`   | Full path to output file names. The extensions are generated by Tejaas. | Optional | `out`  

`--method`    | `OPTION`   | Name of method to run. Supported options: `jpa` or `rr` | Optional | `rr`

`--null`      | `OPTION`   | Null model to use for RR-score. Supported options: `perm` or  `maf` | Optional | `perm` 

`--cismask`   |            | Flag to mask cis-Genes within a window  for each candidate SNP. Gene positions are obtained from the GENCODE annotation file.  | Optional | `False`

`--window`    | `FLOAT`    | Window (number of base pairs) used for masking cis genes | Optional | 1e6 

`--prior-sigma`| `FLOAT`   | Standard deviation of the normal prior for reverse multiple linear regression | Optional | 0.1

`--knn`       | `INT`      | Number of neighbours for KNN (use 0 if you do not want to use KNN) | Optional | 0

`--psnpthres` | `FLOAT`    | Target genes will be reported only for trans-eQTLs below this threshold p-value for RR/JPA-score | Optional | 0.0001

`--pgenethres`| `FLOAT`    | Target genes will be reported only if their association with trans-eQTLs are below this threshold p-value | Optional | 0.05

`--jpanull`   | `FILEPATH` | File containing list of null model JPA-scores | Optional | -- 

`--maf-file`  | `FILEPATH` | Read minor allele frequency (MAF) of SNPs from this file, e.g. to read population MAF for `maf` null (see documentation for file format) | Optional | -- 

`--shuffle`   |            | Flag to randomly shuffle the genotypes to obtain a null distribution | Optional | `False`

`--shuffle-with` | `FILEPATH` | Shuffle the genotypes in the same order of donor IDs specified in `FILEPATH` | Optional | --

## Usage Examples

1. For quick start or installation check, run Tejaas with all default options:

```

tejaas --vcf ${VCFFILE} --chrom ${CHRM} --gx ${GXFILE} --gtf ${GTFFILE} --cismask --outprefix ${OUTPREFIX}

```

This will create RR-scores at γ=0.1 and masking all genes within 1Mb of each SNP. The p-values will be computed from the permuted null model.

Default format for the gene expression is the same as the GTEx format, and default gtf file is the [GENCODE v26](https://www.gencodegenes.org/human/release_26.html) release.

For target gene discovery, it will use the same file as used for trans-eQTL discovery.

2. Example of running Tejaas RR-score.

We recommend using the `perm` null model for calcuting p-values from the RR-score

and a separate confounder-corrected gene expression file for target gene discovery.

In this example, RR-score is calculated for first 1000 SNPs excluding the first 20 `--include-SNPs 21:1000`.

KNN correction is performed with 20 nearest neighbors `--knn 20`.

All cis-genes within +2MB and -2Mb are masked during analysis `--cismask --window 2e6`.

RR-score calculation uses a prior normal distribution with standard deviation of 0.05 `--prior-sigma 0.05`.

The output reports target genes only for SNPs with p-value < 1e-6 `--psnpthres 0.000001`.

Here, `GXFILE` is the raw gene expression file, `GXCORRFILE` is the confounder-corrected gene expression file,

`VCFFILE` is the genotype file in `.vcf.gz` format and `GTFFILE` is the GENCODE annotation file.

```

mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 21:1000 --gx ${GXFILE} --gxcorr ${GXCORRFILE} \

                   --gxfmt gtex --gtf ${GTFFILE} --trim  --outprefix ${OUTPREFIX} \

                   --cismask --window 2e6 --psnpthres 0.000001 \

                   --knn 20 --method rr --null perm --prior-sigma 0.05

```

3. Example of running JPA-score with no KNN correction. 

Empirical p-values are calculated from the null scores loaded from `NULLFILE` specified by the `--jpanull` option.

If `NULLFILE` does not exist, then it will create `100000` null scores and write them in the `NULLFILE` before calculating JPA-scores.

If `--jpanull` option is not used, then p-values for the JPA-scores are calculated from an analytical construction of null model.

```

mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 1:100 \

                   --gx ${GXFILE} --gxfmt gtex --gtf ${GTFFILE} --outprefix ${OUTPREFIX} \

                   --knn 0 --method jpa --jpanull ${NULLFILE}

```

4. Example of parallelizing job submission.

```

NMAX=20000 # number of SNPs per job

for CHRM in $( seq 1 22 ); do

    VCFFILE="file_path_here_${CHRM}.vcf.gz"

    NTOT=$( calculate_no_of_SNPs_in_this_chromosome )

    NJOB=$( echo $(( (NTOT + NMAX - 1)/NMAX )) )

    for (( i=0; i < ${NJOB}; i++ )); do

        STARTSNP=$(( NMAX * i + 1 ))

        ENDSNP=$(( NMAX * (i + 1) ))

        if [ ${ENDSNP} -gt ${NTOT} ]; then

            ENDSNP=${NTOT}

        fi

        mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs ${STARTSNP}:${ENDSNP} --gx ${GXFILE} --gxcorr ${GXCORRFILE} \

                           --gxfmt gtex --gtf ${GTFFILE} --trim  --outprefix ${OUTPREFIX} \

                           --cismask --psnpthres 0.000001 --knn 20 --method rr --null perm --prior-sigma 0.05

    done

done

```

## Contributors



  



## Version History

1.0.2 Bump release version
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/soedinglab/tejaas

Awesome Lists containing this project

README