https://github.com/illumina/platinumgenomes
The Platinum Genomes Truthset
https://github.com/illumina/platinumgenomes
indels snvs variant-analysis
Last synced: about 1 year ago
JSON representation
The Platinum Genomes Truthset
- Host: GitHub
- URL: https://github.com/illumina/platinumgenomes
- Owner: Illumina
- Created: 2017-08-29T13:23:11.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2017-11-08T19:25:45.000Z (over 8 years ago)
- Last Synced: 2025-01-24T09:27:42.450Z (over 1 year ago)
- Topics: indels, snvs, variant-analysis
- Homepage: https://illumina.github.io/PlatinumGenomes
- Size: 8.79 KB
- Stars: 85
- Watchers: 10
- Forks: 9
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Platinum Genomes
This repo contains the Platinum Genomes small variant truthset for samples NA12878 (also known as hg001) and NA12877.
Platinum Genomes truthset variants were validated using haplotype inheritance information through a well studied
17-member pedigree (CEPH 1463).
## Truthsets
Truthsets are made up of a VCF of validated variant records and a BED file of confident regions. These
files aren't huge (00s of MB) but are too large to play nicely with git and github, here's a few ways to download:
### AWS CLI
Truthset files are stored in an AWS S3 bucket called `platinum-genomes`, one way to download is via the [AWS CLI](https://aws.amazon.com/cli/):
```bash
aws s3 cp s3://platinum-genomes/2017-1.0 pg2017 --recursive
```
To download without AWS credentials, add the `--no-sign-request` flag. You can also explore the bucket and download individual files with this [S3 bucket display](https://illumina.github.io/PlatinumGenomes/).
### wget
Alternatively, use `wget` or similar with the [file URIs in this repo](files/), e.g.:
```bash
wget -xi files/2017-1.0.files
```
You can then use the relevant md5 checksum in each release to validate data integrity.
Finally, truthset files can also be downloaded via [FTP](ftp://platgene_ro:''@ussd-ftp.illumina.com/), e.g.:
```sh
wget ftp://platgene_ro:''@ussd-ftp.illumina.com/2017-1.0/hg38/small_variants/NA12878/NA12878.vcf.gz
```
## Usage
To compare a VCF against these truthsets, we recommend using [hap.py](https://github.com/Illumina/hap.py) which
performs a sophisticated haplotype comparison rather than a simpler tool such as `bcftools isec`.
Applications wrapping hap.py and containing these truthsets are available on the following platforms:
* BaseSpace Sequence Hub ([Hap.py Benchmarking](https://basespace.illumina.com/apps/3067064/Hap-py-Benchmarking) and [VCAT](https://basespace.illumina.com/apps/2106105/Variant-Calling-Assessment-Tool))
* PrecisionFDA (GA4GH Benchmarking)
## Details
See the attached [wiki](../../wiki) for technical information.
## Raw data
Sequencing data for NA12878, NA12877 and samples NA12889-NA12892 (grandparents) are available through
the ENA:
* [ENA Study: PRJEB3381](https://www.ebi.ac.uk/ena/data/view/PRJEB3381)
BaseSpace users can access this data via a shared BaseSpace project:
* [BaseSpace project share](https://basespace.illumina.com/s/2K7LqNG7Mt1h)
Sequencing data for the remaining pedigree members is not consented for public release and so is
made available through the dbGaP database:
* [dbGaP: phs001224.v1.p1](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001224.v1.p1)
## Issues
Please open an [issue](/../../issues/) for comments, issues and other feedback.
## Citation
For further information or to cite Platinum Genomes resources, see:
* Eberle, MA _et al._ (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. _Genome Research_, 27:157-164. doi:[10.1101/gr.210500.116](http://dx.doi.org/10.1101/gr.210500.116)