Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yaacoo/multipgs_py
multiPGS_py is a fast, simple and low-memory python method to calculate polygenic scores (PGS/PRS)
https://github.com/yaacoo/multipgs_py
bioinformatics genomics-data polygenic-scores python3 risk-prediction
Last synced: about 2 months ago
JSON representation
multiPGS_py is a fast, simple and low-memory python method to calculate polygenic scores (PGS/PRS)
- Host: GitHub
- URL: https://github.com/yaacoo/multipgs_py
- Owner: yaacoo
- License: mit
- Created: 2024-10-21T21:07:37.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-10-24T21:50:03.000Z (about 2 months ago)
- Last Synced: 2024-10-26T00:36:37.259Z (about 2 months ago)
- Topics: bioinformatics, genomics-data, polygenic-scores, python3, risk-prediction
- Language: Python
- Homepage:
- Size: 27.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# multiPGS_py: Fast, simple and low-memory PGS scoring
multiPGS_py is a fast, simple, and low-memory Python method to calculate polygenic scores (PGS/PRS) from any PGS Catalog weights (pgscatalog.org) and single-sample VCF files.
## Motivation
Current available tools that calculate PGS still require many manual steps (e.g., to account for flips), require a very strict format, or only accept a cohort VCF, instead of individual genomes (single-sample indexed VCF files).
This method can score up to 5 different PGS on an individual genome and can easily be applied to many VCF files using a simple bash script or as part of a workflow/job scheduler.## Low-memory for efficient parallelization
The program reads both the PGS file and the VCF file one line at a time, and only keeps a PGS dictionary in memory, which is no more than 60-70Mb. This allows parallelization over many VCF files without running into memory issues.
## What we account for
1. Strand flips (when the DNA strand orientation differs between the reference and the sample)
2. Allele flips (including beta flip to -beta)
3. Strand + allele flips and ambiguous variants
4. After these adjustments, the calculation is simple:
```math
PGS_{individual} = \sum_{i=1}^{n} (dosage_i \times \beta_i)
```## What you still need to check and verify
1. Filter/QC of the imputation quality, filter by max(genotype probability) if needed.
2. That you know the VCF genome build, and which column in the PGS file corresponds to the relevant genome build. This should be in the PGS header, and you should edit line 10 of the Python file if needed.
3. That your VCF is bgzipped and indexed by tabix, having the index file in the same path and prefix (e.g. S001.vcf.gz + S001.vcf.gz.tbi)## How to use
```
python multiPGS_py.py
```Example:
```
python multiPGS_py.py sample.vcf.gz PGS000001.txt.gz PGS000002.txt.gz
```* Don't forget to check line 7 of the Python file to verify the correct column name for your genome build!
* Make sure the VCF has a dosage (DS) field.
* The output file is a single line text file (for each VCF) and can easily be concatenated.
* Tested against Plink2 and pgsc_calc, and provided similar results.## Parallel processing of multiple genomes simultaneously (experimental)
Set the number of CPU cores available per VM (default n_cpu=20) and process genomes in batches based on the number of available CPU cores.
After each batch is processed, the next batch is loaded.
Using a cluster of 10 VMs, each with 20 CPU cores and 1 GB of RAM per CPU, you can expect to process 1,000 genomes in less than 45 minutes.```
python parallel_pgs.py [ ...]
```## Dependencies
python 3.6.8
pysam 0.16.0.1
numpy 1.18.5## Limitations of PGS to be aware of
1. PGS are not always meaningful for an individual, but their rank/percentile/Z-score across a population can indicate risk groups.
2. PGS are not always transferrable between different population ancestries. Please read the paper that published the PGS to PGS Catalog to understand its limitations.