Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/drylikov/spark_cassandra_connector

If you write a Spark application that needs access to Cassandra, this library is for you.
https://github.com/drylikov/spark_cassandra_connector

Last synced: 8 days ago
JSON representation

If you write a Spark application that needs access to Cassandra, this library is for you.

Awesome Lists containing this project

README

        

# Spark Cassandra Connector

## Important notice : Do NOT use GitHub issue tracker!
We are going to disable it and no issues created there will be accessible.

## Lightning-fast cluster computing with Spark and Cassandra

This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and
execute arbitrary CQL queries in your Spark applications.

## Features

- Compatible with Apache Cassandra version 2.0 or higher and DataStax Enterprise 4.5 (see table below)
- Compatible with Apache Spark 1.0 and 1.1 (see table below)
- Compatible with Scala 2.10 and 2.11
- Exposes Cassandra tables as Spark RDDs
- Maps table rows to CassandraRow objects or tuples
- Offers customizable object mapper for mapping rows to objects of user-defined classes
- Saves RDDs back to Cassandra by implicit `saveToCassandra` call
- Join with a subset of Cassandra data using `joinWithCassandraTable` call
- Partition RDDs according to Cassandra replication using `repartitionByCassandraReplica` call
- Converts data types between Cassandra and Scala
- Supports all Cassandra data types including collections
- Filters rows on the server side via the CQL `WHERE` clause
- Allows for execution of arbitrary CQL statements
- Plays nice with Cassandra Virtual Nodes

## Version Compatibility

The connector project has several branches, each of which map into different supported versions of Spark and Cassandra. Refer to the compatibility table below which shows the major.minor version range supported between the connector, Spark, Cassandra, and the Cassandra Java driver:

| Connector | Spark | Cassandra | Cassandra Java Driver |
| --------- | ------------- | --------- | --------------------- |
| 1.2 | 1.2 | 2.1, 2.0 | 2.1 |
| 1.1 | 1.1, 1.0 | 2.1, 2.0 | 2.1 |
| 1.0 | 1.0, 0.9 | 2.0 | 2.0 |

## Download
This project has been published to the Maven Central Repository.
For SBT to download the connector binaries, sources and javadoc, put this in your project
SBT config:

libraryDependencies += "com.datastax.spark" %% "Spark_Cassandra_connector" % "1.2.0"

If you want to access the functionality of Connector from Java, you may want to add also a Java API module:

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector-java" % "1.2.0"

## Documentation

- [Quick-start guide](doc/0_quick_start.md)
- [Connecting to Cassandra](doc/1_connecting.md)
- [Loading datasets from Cassandra](doc/2_loading.md)
- [Server-side data selection and filtering](doc/3_selection.md)
- [Working with user-defined case classes and tuples](doc/4_mapper.md)
- [Saving datasets to Cassandra](doc/5_saving.md)
- [Customizing the object mapping](doc/6_advanced_mapper.md)
- [Using Connector in Java](doc/7_java_api.md)
- [Spark Streaming with Cassandra](doc/8_streaming.md)
- [About The Demos](doc/9_demos.md)
- [The spark-cassandra-connector-embedded Artifact](doc/10_embedded.md)
- [Performance monitoring](doc/11_metrics.md)
- [Building And Artifacts](doc/12_building_and_artifacts.md)
- [The Spark Shell](doc/13_spark_shell.md)
- [Frequently Asked Questions](doc/FAQ.md)

### Contributing
To develop this project, we recommend using IntelliJ IDEA.
Make sure you have installed and enabled the Scala Plugin.
Open the project with IntelliJ IDEA and it will automatically create the project structure
from the provided SBT configuration.

Before contributing your changes to the project, please make sure that all unit tests and integration tests pass.
Don't forget to add an appropriate entry at the top of CHANGES.txt.
Finally open a pull-request on GitHub and await review.

If your pull-request is going to resolve some opened issue, please add *Fixes \#xx* at the
end of each commit message (where *xx* is the number of the issue).

## Testing
To run unit and integration tests:

./sbt/sbt test
./sbt/sbt it:test

By default, integration tests start up a separate, single Cassandra instance and run Spark in local mode.
It is possible to run integration tests with your own Cassandra and/or Spark cluster.
First, prepare a jar with testing code:

./sbt/sbt test:package

Then copy the generated test jar to your Spark nodes and run:

export IT_TEST_CASSANDRA_HOST=
export IT_TEST_SPARK_MASTER=
./sbt/sbt it:test