https://github.com/lifeomic/spark-vcf

Spark VCF data source implementation for Dataframes
https://github.com/lifeomic/spark-vcf

dataframe genomics genotype lifeomic spark spark-sql team-clinical-intelligence variants vcf vcf-files

Last synced: 24 days ago
JSON representation

Spark VCF data source implementation for Dataframes

Host: GitHub
URL: https://github.com/lifeomic/spark-vcf
Owner: lifeomic
License: mit
Created: 2017-11-05T20:08:41.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2022-07-15T15:09:33.000Z (about 3 years ago)
Last Synced: 2025-04-05T21:51:10.615Z (6 months ago)
Topics: dataframe, genomics, genotype, lifeomic, spark, spark-sql, team-clinical-intelligence, variants, vcf, vcf-files
Language: Scala
Homepage:
Size: 314 KB
Stars: 14
Watchers: 23
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# spark-vcf
Spark VCF data source implementation in native spark.

# Introduction

Spark VCF allows you to natively load VCFs into an Apache Spark Dataframe/Dataset. To get started with Spark-VCF, you can
clone or download this repository, then run `mvn package` and use the jar. We are also now in Maven central.

Since spark-vcf is written specifically for Spark, there is less overhead and performance gains in many areas.

# Installation

Spark-vcf can be packaged from source or added as a dependency to your Maven based project.

To install spark vcf, add the following to your pom:

```xml

com.lifeomic
spark-vcf
0.3.0

```

For sbt:
```
libraryDependencies += "com.lifeomic" % "spark-vcf" % "0.3.0"
```

If you are using gradle, the dependency is:
```
compile group: 'com.lifeomic', name: 'spark-vcf', version: '0.3.0'
```

# Getting Started

Getting started with Spark VCF is as simple as:

```scala
val myVcf = spark.read
.format("com.lifeomic.variants")
.load("src/test/resources/example.vcf")
```

The schema contains the standard vcf columns and has the options to expand INFO and/or FORMAT columns. An example schema
from 1000 genomes is shown below:

```

There are options that you can use as well for the `Format` and `Info` columns. To return the format fields as a map,
instead of separate fields, you can set the `use.format.map` variable to `true`. This can be used to speed up the spark
job even more, as it doesn't have to read the header file for type and column information.

```scala
val mappedFormat = spark.read
.format("com.lifeomic.variants")
.option("use.format.map", "true")
.load("src/test/resources/example.vcf")
```

You can also stringly type the formats as well by setting `use.format.type` to false.

One more note worth mentioning: while the core of spark-vcf is written as a Spark data source, it is still advisable to use
the BGZFEnhancedGzipCodec from Hadoop-BAM for splitting bgzip files, so that Spark can properly partition the files. For example:

```scala
val sparkConf = new SparkConf()
.setAppName("testing")
.setMaster("local[8]")
.set("spark.hadoop.io.compression.codecs", "org.seqdoop.hadoop_bam.util.BGZFEnhancedGzipCodec")
```

# TODO
* Provide performance benchmarks compared to other libraries
* Get Travis CI set up

# License

The MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lifeomic/spark-vcf

Awesome Lists containing this project

README