Skip to content

💥 Algorithms that are built around BioJava and run on Apache Spark

License

Notifications You must be signed in to change notification settings

biojava/biojava-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioJava-Spark

Algorithms that are built around BioJava and are running on Apache Spark

Build Status License Status Version

Starting up

Some initial instructions can be found on the mmtf-spark project

https://github.com/sbl-sdsc/mmtf-spark

First download and untar a Hadoop sequence file of the PDB (~7 GB download)

wget http://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar

Or you can get a C-alpha, phosphate, ligand only version (~800 Mb download)

wget http://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar

Second add the biojava-spark dependecy to your pom

<dependency>
	<groupId>org.biojava</groupId>
	<artifactId>biojava-spark</artifactId>
	<version>0.2.1</version>
</dependency>

Extra Biojava examples

Do some simple quality filtering

float maxResolution = 3.0f;
float maxRfree = 0.3f;
StructureDataRDD structureData = new StructureDataRDD("/path/to/file")
			.filterResolution(maxResolution)
			.filterRfree(maxRfree);

Summarsing the elements in the PDB

Map<String, Long> elementCountMap = BiojavaSparkUtils.findAtoms(structureData).countByElement();

Finding inter-atomic contacts from the PDB

Double mean = BiojavaSparkUtils.findContacts(structureData,
		new AtomSelectObject()
				.groupNameList(new String[] {"PRO","LYS"})
				.elementNameList(new String[] {"C"})
				.atomNameList(new String[] {"CA"}),
				cutoff)
		.getDistanceDistOfAtomInts("CA", "CA")
		.mean();
System.out.println("\nMean PRO-LYS CA-CA distance: " + mean);

About

💥 Algorithms that are built around BioJava and run on Apache Spark

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages