Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/soedinglab/tejaas

Tejaas - a tool for discovering trans-eQTLs
https://github.com/soedinglab/tejaas

eqtl eqtl-mapping gene-expression gtex reverse-regression trans-eqtl

Last synced: about 2 months ago
JSON representation

Tejaas - a tool for discovering trans-eQTLs

Awesome Lists containing this project

README

        

# Tejaas: Discover trans-eQTLs

## Description

Tejaas is a command line tool to find trans-eQTLs from eQTL data.
It is released under the GNU General Public License version 3.

Tejaas is based on the hypothesis that a trans-eQTL should regulate the expression levels of multiple genes.
In brief, it implements two statistical methods to find trans-eQTLs:
- **RR-score (Reverse Regression)**: It performs a multiple linear regression with L2-regularization
using expression levels of all genes to explain the genotype of a candidate SNP.
In contrast to conventional methods, the direction of the regression is reversed, with the gene expressions as explanatory variables.
RR-score is a statistic which estimates whether more genes are required to explain the allele counts of a SNP than expected by chance.
- **JPA-score (Joint P-value Analysis)**: It evaluates the distribution of p-values of the pairwise linear association of a candidate SNP with all available gene expression levels.
Any null SNP (no trans-effect) will have a uniform distribution of p-values,
while a trans-eQTL will be associated with more genes than expected by chance, leading to overdispersion near zero.
The JPA-score is a statistic which estimates whether the distribution of p-values is significantly overdispersed near zero.

Additionally, it also implements a non-linear unsupervised confounder correction using k-nearest neighbors called **KNN correction**.

## Dependencies

- Python version 3.6 or higher
- C compiler
- Any linear algebra routine, e.g. Intel MKL, OpenBLAS, etc.
- Any flavor of Message Passing Interface (MPI), e.g. OpenMPI, MPICH, etc.
- Other Python libraries:
- [NumPy](http://www.numpy.org/) / array operations
- [SciPy](https://www.scipy.org/) / optimization and other special functions
- [statsmodel](http://www.statsmodels.org/stable/index.html) / used for ECDF calculation in JPA-score
- [Pygtrie](https://pypi.org/project/pygtrie/) / used for reading MAF file in RR-score / maf null
- [mpi4py](https://mpi4py.readthedocs.io/en/stable/) / linked to MPI and MKL for python parallelization
- [scikit-learn](https://scikit-learn.org/stable/index.html) / used for PCA decomposition in KNN correction

## Installation

See the wiki for detailed [installation instructions](https://github.com/soedinglab/tejaas/wiki/Installation).
Here is a quick start guide:

### 1. Install dependencies
Installation of Tejaas depends on other dependencies. If you are using `conda`, then they can be installed using
```
conda install numpy scipy statsmodels scikit-learn
pip install mpi4py
```

### 2. Install Tejaas
2a. Tejaas can be installed directly from the PyPI repository:
```
pip install tejaas
```
2b. You can also download this repository and build Tejaas:
```
git clone [email protected]:soedinglab/tejaas.git
cd tejaas
pip install -e .
```

### 3. Run Tejaas!
```
tejaas [OPTIONS]
```
See below for valid options or try `tejaas --help`.

#### Run an example to check installation
An example script `example/run_example.sh` is provided to check the installation.
```
cd tejaas/example
./run_example.sh
```
The script will download some example input files in `/data` and run Tejaas on `` cores.
The output will be created in `/data`.
Note that the `data` subdirectory is automatically created within the `` specified by the user.
Check if the output is correct:
```
python compare_with_gold.py --outdir
```
This will check if the output in `/data`
(again note that `data` subdirectory is automatically appended and need not be specified by the user)
matches with the results provided in the `example/gold` subdirectory.

## Input Files
- Gene expression file
- Genotype file
- VCF
- Oxford
- Dosage
- GENCODE file
- Population minor allele frequency

## Tejaas [OPTIONS]

Option                  | Argument | Description | Priority | Default value
:--- | :--- | :--- |:--- | :--
`--vcf` | `FILEPATH` | Input genotype file in vcf.gz format | Required (vcf or oxf) | --
`--oxf` | `FILEPATH` | Input genotype file in Oxford format | Required (vcf or oxf) | --
`--dosage` | | Flag for reading dosage files. The file is specified with the `--oxf` option, e.g. `--oxf FILEPATH --dosage` | Optional | `False`
`--fam` | `FILEPATH` | Input fam file for samples names of Oxford genotype | Optional | --
`--chrom` | `INT` | Chromosome number of the genotype file | Required | --
`--include-SNPs` | `START:END` | Colon-separated index of SNPs to be included | Optional | --
`--gx` | `FILEPATH` | Input gene expression file for trans-eQTL discovery | Required | --
`--gxcorr` | `FILEPATH` | Input gene expression file for target gene discovery | Optional | `--gx` file
`--gxfmt` | `OPTION` | Input gene expression file format (see format details below). Supported options: `gtex`, `cardiogenics`, `geuvadis` | Optional | `gtex`
`--gtf` | `FILEPATH` | Input GTF file from GENCODE to read gene Ensembl IDs. Used for selecting biotypes and getting genomic locations. | Required | --
`--trim` | | Flag to trim version number from GENCODE Ensembl IDs | Optional | `False`
`--biotype` | `OPTION` | Which biotypes to select from the GTF file. Supported options: `protein_coding`, `lncRNA`. | Optional | `protein_coding lncRNA`
`--outprefix` | `STRING` | Full path to output file names. The extensions are generated by Tejaas. | Optional | `out`
`--method` | `OPTION` | Name of method to run. Supported options: `jpa` or `rr` | Optional | `rr`
`--null` | `OPTION` | Null model to use for RR-score. Supported options: `perm` or `maf` | Optional | `perm`
`--cismask` | | Flag to mask cis-Genes within a window for each candidate SNP. Gene positions are obtained from the GENCODE annotation file. | Optional | `False`
`--window` | `FLOAT` | Window (number of base pairs) used for masking cis genes | Optional | 1e6
`--prior-sigma`| `FLOAT` | Standard deviation of the normal prior for reverse multiple linear regression | Optional | 0.1
`--knn` | `INT` | Number of neighbours for KNN (use 0 if you do not want to use KNN) | Optional | 0
`--psnpthres` | `FLOAT` | Target genes will be reported only for trans-eQTLs below this threshold p-value for RR/JPA-score | Optional | 0.0001
`--pgenethres`| `FLOAT` | Target genes will be reported only if their association with trans-eQTLs are below this threshold p-value | Optional | 0.05
`--jpanull` | `FILEPATH` | File containing list of null model JPA-scores | Optional | --
`--maf-file` | `FILEPATH` | Read minor allele frequency (MAF) of SNPs from this file, e.g. to read population MAF for `maf` null (see documentation for file format) | Optional | --
`--shuffle` | | Flag to randomly shuffle the genotypes to obtain a null distribution | Optional | `False`
`--shuffle-with` | `FILEPATH` | Shuffle the genotypes in the same order of donor IDs specified in `FILEPATH` | Optional | --

## Usage Examples

1. For quick start or installation check, run Tejaas with all default options:
```
tejaas --vcf ${VCFFILE} --chrom ${CHRM} --gx ${GXFILE} --gtf ${GTFFILE} --cismask --outprefix ${OUTPREFIX}
```
This will create RR-scores at γ=0.1 and masking all genes within 1Mb of each SNP. The p-values will be computed from the permuted null model.
Default format for the gene expression is the same as the GTEx format, and default gtf file is the [GENCODE v26](https://www.gencodegenes.org/human/release_26.html) release.
For target gene discovery, it will use the same file as used for trans-eQTL discovery.

2. Example of running Tejaas RR-score.
We recommend using the `perm` null model for calcuting p-values from the RR-score
and a separate confounder-corrected gene expression file for target gene discovery.
In this example, RR-score is calculated for first 1000 SNPs excluding the first 20 `--include-SNPs 21:1000`.
KNN correction is performed with 20 nearest neighbors `--knn 20`.
All cis-genes within +2MB and -2Mb are masked during analysis `--cismask --window 2e6`.
RR-score calculation uses a prior normal distribution with standard deviation of 0.05 `--prior-sigma 0.05`.
The output reports target genes only for SNPs with p-value < 1e-6 `--psnpthres 0.000001`.
Here, `GXFILE` is the raw gene expression file, `GXCORRFILE` is the confounder-corrected gene expression file,
`VCFFILE` is the genotype file in `.vcf.gz` format and `GTFFILE` is the GENCODE annotation file.
```
mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 21:1000 --gx ${GXFILE} --gxcorr ${GXCORRFILE} \
--gxfmt gtex --gtf ${GTFFILE} --trim --outprefix ${OUTPREFIX} \
--cismask --window 2e6 --psnpthres 0.000001 \
--knn 20 --method rr --null perm --prior-sigma 0.05
```

3. Example of running JPA-score with no KNN correction.
Empirical p-values are calculated from the null scores loaded from `NULLFILE` specified by the `--jpanull` option.
If `NULLFILE` does not exist, then it will create `100000` null scores and write them in the `NULLFILE` before calculating JPA-scores.
If `--jpanull` option is not used, then p-values for the JPA-scores are calculated from an analytical construction of null model.
```
mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 1:100 \
--gx ${GXFILE} --gxfmt gtex --gtf ${GTFFILE} --outprefix ${OUTPREFIX} \
--knn 0 --method jpa --jpanull ${NULLFILE}
```

4. Example of parallelizing job submission.
```
NMAX=20000 # number of SNPs per job
for CHRM in $( seq 1 22 ); do
VCFFILE="file_path_here_${CHRM}.vcf.gz"
NTOT=$( calculate_no_of_SNPs_in_this_chromosome )
NJOB=$( echo $(( (NTOT + NMAX - 1)/NMAX )) )
for (( i=0; i < ${NJOB}; i++ )); do
STARTSNP=$(( NMAX * i + 1 ))
ENDSNP=$(( NMAX * (i + 1) ))
if [ ${ENDSNP} -gt ${NTOT} ]; then
ENDSNP=${NTOT}
fi
mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs ${STARTSNP}:${ENDSNP} --gx ${GXFILE} --gxcorr ${GXCORRFILE} \
--gxfmt gtex --gtf ${GTFFILE} --trim --outprefix ${OUTPREFIX} \
--cismask --psnpthres 0.000001 --knn 20 --method rr --null perm --prior-sigma 0.05
done
done
```

## Contributors


## Version History
1.0.2 Bump release version