https://github.com/biojava/biojava-spark

:collision: Algorithms that are built around BioJava and run on Apache Spark
https://github.com/biojava/biojava-spark

Last synced: 5 months ago
JSON representation

:collision: Algorithms that are built around BioJava and run on Apache Spark

Host: GitHub
URL: https://github.com/biojava/biojava-spark
Owner: biojava
License: lgpl-2.1
Created: 2016-04-29T18:06:39.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2022-01-10T21:08:11.000Z (over 4 years ago)
Last Synced: 2025-09-09T16:34:49.590Z (9 months ago)
Language: Java
Homepage:
Size: 64.7 MB
Stars: 8
Watchers: 10
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # BioJava-Spark

Algorithms that are built around BioJava and are running on Apache Spark

[![Build Status](https://travis-ci.org/biojava/biojava-spark.svg?branch=master)](https://travis-ci.org/biojava/biojava-spark)

[![License](http://img.shields.io/badge/license-LGPL_2.1-blue.svg?style=flat)](https://github.com/biojava/biojava/blob/master/LICENSE)

[![Status](http://img.shields.io/badge/status-experimental-red.svg?style=flat)](https://github.com/biojava/biojava-spark)

[![Version](http://img.shields.io/badge/version-0.2.1-blue.svg?style=flat)](https://github.com/biojava/biojava-spark/)

## Starting up

### Some initial instructions can be found on the mmtf-spark project

https://github.com/sbl-sdsc/mmtf-spark

## First download and untar a Hadoop sequence file of the PDB (~7 GB download) 

```bash

wget http://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar

tar -xvf full.tar

```

Or you can get a C-alpha, phosphate, ligand only version (~800 Mb download)

```bash

wget http://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar

tar -xvf reduced.tar

```

### Second add the biojava-spark dependecy to your pom

```xml

	org.biojava

	biojava-spark

	0.2.1

```

## Extra Biojava examples

### Do some simple quality filtering

```java

float maxResolution = 3.0f;

float maxRfree = 0.3f;

StructureDataRDD structureData = new StructureDataRDD("/path/to/file")

			.filterResolution(maxResolution)

			.filterRfree(maxRfree);

```

### Summarsing the elements in the PDB

```java

Map elementCountMap = BiojavaSparkUtils.findAtoms(structureData).countByElement();

```

### Finding inter-atomic contacts from the PDB

```java

Double mean = BiojavaSparkUtils.findContacts(structureData,

		new AtomSelectObject()

				.groupNameList(new String[] {"PRO","LYS"})

				.elementNameList(new String[] {"C"})

				.atomNameList(new String[] {"CA"}),

				cutoff)

		.getDistanceDistOfAtomInts("CA", "CA")

		.mean();

System.out.println("\nMean PRO-LYS CA-CA distance: " + mean);

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/biojava/biojava-spark

Awesome Lists containing this project

README