https://github.com/veeara282/cs4775-structure-rolloff
STRUCTURE and ROLLOFF reimplementations for CS 4775 (Computational Genetics and Genomics)
https://github.com/veeara282/cs4775-structure-rolloff
admixture computational-genetics genomics population-genetics population-structure
Last synced: 8 months ago
JSON representation
STRUCTURE and ROLLOFF reimplementations for CS 4775 (Computational Genetics and Genomics)
- Host: GitHub
- URL: https://github.com/veeara282/cs4775-structure-rolloff
- Owner: veeara282
- License: mit
- Created: 2020-12-02T22:54:19.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2020-12-21T23:32:15.000Z (almost 5 years ago)
- Last Synced: 2025-01-02T00:15:27.857Z (9 months ago)
- Topics: admixture, computational-genetics, genomics, population-genetics, population-structure
- Language: Python
- Homepage:
- Size: 3.55 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# STRUCTURE and ROLLOFF
STRUCTURE and ROLLOFF reimplementations for CS 4775 (Computational Genetics and Genomics) at Cornell
Original STRUCTURE paper: "[Inference of Population Structure Using Multilocus Genotype Data](https://www.genetics.org/content/155/2/945)"
Original ROLLOFF paper: "[The History of African Gene Flow into Southern Europeans, Levantines, and Jews](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001373#s4)"
## Installation
To install in a virtual environment:
```
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```## Usage
`structure.py` infers admixture proportions given a dataset of genetic variants in Variant Call Format. The output is written to an HDF5 file, which is used by `visualize.py` and `rolloff.py`.
```
python structure.py [-h] [-k num_populations] [-o output.hdf5]
[--profile] [-d drop_frac] [-m num_burn_in_rounds]
[-s num_samples] [-c num_rounds_btwn_samples] data_file
```Command-line arguments:
- `data_file`: the input file, either a VCF file or an Eigenstrat (`.phgeno`) file
- `-k`: the number of populations
- `-o` or `--out`: the HDF5 file to which the output will be written
- `-d` or `--drop-frac`: the fraction of loci to drop
- `-m` or `--burn-in`: the burn-in period
- `-s` or `--num-samples`: the number of samples to collect
- `-c` or `--sample-interval`: number of rounds between samples`structure.py` can also take a `.phgeno` data file as its primary argument.
`visualize.py` creates charts to display the output from `structure.py`. It takes one command-line argument, the location of the HDF5 file.
`rolloff.py` estimates the ROLLOFF statistic for two populations. It takes two command-line arguments:
```
rolloff.py [-h] [--profile] [-m centimorgans] data_file.hdf5
```- `data_file.hdf5`: the file generated by `structure.py`
- `-m` or `--min-bin-size`: the minimum bin size to use, in centimorgans## Data files
The data files are compressed and stored in the `data/` folder. They are from the [1000 Genomes](https://www.internationalgenome.org/) repository. Use of these datasets is subject to [these terms](https://www.internationalgenome.org/IGSR_disclaimer).
- `data/1.1-200000.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz` (the 200K dataset) contains variant calls from the first 200,000 positions in chromosome 1.
- `data/1.1-2500000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz` (the 2.5M dataset) contains variant calls from the first 2.5 million positions in chromosome 1.