https://github.com/biojava/biojava-spark
:collision: Algorithms that are built around BioJava and run on Apache Spark
https://github.com/biojava/biojava-spark
Last synced: about 1 month ago
JSON representation
:collision: Algorithms that are built around BioJava and run on Apache Spark
- Host: GitHub
- URL: https://github.com/biojava/biojava-spark
- Owner: biojava
- License: lgpl-2.1
- Created: 2016-04-29T18:06:39.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2022-01-10T21:08:11.000Z (about 4 years ago)
- Last Synced: 2025-09-09T16:34:49.590Z (6 months ago)
- Language: Java
- Homepage:
- Size: 64.7 MB
- Stars: 8
- Watchers: 10
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# BioJava-Spark
Algorithms that are built around BioJava and are running on Apache Spark
[](https://travis-ci.org/biojava/biojava-spark)
[](https://github.com/biojava/biojava/blob/master/LICENSE)
[](https://github.com/biojava/biojava-spark)
[](https://github.com/biojava/biojava-spark/)
## Starting up
### Some initial instructions can be found on the mmtf-spark project
https://github.com/sbl-sdsc/mmtf-spark
## First download and untar a Hadoop sequence file of the PDB (~7 GB download)
```bash
wget http://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar
```
Or you can get a C-alpha, phosphate, ligand only version (~800 Mb download)
```bash
wget http://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar
```
### Second add the biojava-spark dependecy to your pom
```xml
org.biojava
biojava-spark
0.2.1
```
## Extra Biojava examples
### Do some simple quality filtering
```java
float maxResolution = 3.0f;
float maxRfree = 0.3f;
StructureDataRDD structureData = new StructureDataRDD("/path/to/file")
.filterResolution(maxResolution)
.filterRfree(maxRfree);
```
### Summarsing the elements in the PDB
```java
Map elementCountMap = BiojavaSparkUtils.findAtoms(structureData).countByElement();
```
### Finding inter-atomic contacts from the PDB
```java
Double mean = BiojavaSparkUtils.findContacts(structureData,
new AtomSelectObject()
.groupNameList(new String[] {"PRO","LYS"})
.elementNameList(new String[] {"C"})
.atomNameList(new String[] {"CA"}),
cutoff)
.getDistanceDistOfAtomInts("CA", "CA")
.mean();
System.out.println("\nMean PRO-LYS CA-CA distance: " + mean);
```