Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bluegranite/azure-synapse-vcf-analysis

Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.
https://github.com/bluegranite/azure-synapse-vcf-analysis

azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf

Last synced: 26 days ago
JSON representation

Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.

Awesome Lists containing this project

README

        

# VCF Analysis in Azure Synapse
Sample code for analyzing VCF files in Azure Synapse (once converted to Parquet using [Glow](http://projectglow.io/)).

Colby T. Ford, Ph.D.

## Pipeline

## Sample Code
1. Convert VCF files to Parquet: [ConvertVCFsToParquet.md](ConvertVCFsToParquet.md)
2. Create External Table to VCF-based Parquet Files in Azure Synapse: [CreateVCFTable.md](CreateVCFTable.md)
3. Sample SQL Queries: [SampleQueries.md](SampleQueries.md)

## Sample Data
The sample VCF data used in this demo is from the Phase 3 release of the [1000 Genomes Project](https://www.internationalgenome.org/data/).
This includes ~168GB of data in VCFs, which can be downloaded from their [FTP site](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).

## BlueGranite Resources
- This repository accompanies the BlueGranite blog post: https://www.bluegranite.com/blog/query-millions-of-genomic-variants-in-minutes-with-azure-synapse
- Demo video on YouTube: [https://www.youtube.com/watch?v=4B-8cviFPYU](https://www.youtube.com/watch?v=4B-8cviFPYU)
- _Building a Genomics Data Lake in Azure_ eBook: https://www.bluegranite.com/genomics-data-lake-ebook
- BlueGranite Genomics Page: https://www.bluegranite.com/genomics