Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/instaclustr/cassandra-sstable-generator

Tool for programmatic generation of Cassandra SSTable
https://github.com/instaclustr/cassandra-sstable-generator

bulk cassandra csv generation load netapp-public random sstable

Last synced: 2 months ago
JSON representation

Tool for programmatic generation of Cassandra SSTable

Lists

README

        

# Cassandra-SStable-Generator

_CLI tool for programmatic generation of Cassandra SSTables_

image:https://img.shields.io/maven-central/v/com.instaclustr/sstable-generator.svg?label=Maven%20Central[link=https://search.maven.org/search?q=g:%22com.instaclustr%22%20AND%20a:%22sstable-generator%22]
image:https://circleci.com/gh/instaclustr/cassandra-sstable-generator.svg?style=svg["Instaclustr",link="https://circleci.com/gh/instaclustr/cassandra-sstable-generator"]

- Website: https://www.instaclustr.com/
- Documentation: https://www.instaclustr.com/support/documentation/

This tool simply generates SSTables programmatically. It uses Cassandra's `CQLSSTableWriter`.
After the generation of SSTables is finished, you can load them by `sstableloader` tool as usual.

The project consists of these modules:

* api—impl is coded against this module
* impl—the implementation of your population logic, depends on `api`
* generator—the implementation of whole generator CLI application
* cassandra-3—the implementation of SSTable generator using internals of Cassandra 3 artifact
* cassandra- 4—the implementation of SSTable generator using internals of Cassandra 4 artifact

## Build

`mvn clean install` (or `mvn clean install -DskipTests`)

## Run

Let's guide you through an example. We want to generate a SSTable by Cassandra 3 API so we can load it
to Cassandra afterwards. The components you need to have on a class path are as follows:

* generator jar
* cassandra-3 module jar
* jar with the implementation of your generation logic

----
java \
-cp /path/to/impl-1.0.jar:/path/to/generator-1.0.jar:/path/to/cassandra-3.jar \
com.instaclustr.sstable.generator.LoaderApplication \
_command_ \
_arguments_
----

The concrete example of the invocation would be:

----
java \
-cp impl/target/sstable-generator-impl-1.0:generator/target/sstable-generator-1.0.jar:cassandra-3/target/cassandra-3-1.0.jar \
com.instaclustr.sstable.generator.LoaderApplication \
fixed \
--keyspace mykeyspace \
--table mytable \
--output-dir=/tmp/output \
--schema cassandra-3/src/test/resources/cassandra/cql/table.cql \
--threads 2
----

**Please be aware that you need to have all libraries of Apache Cassandra on the classpath as well. For
that reason, please use `./run.sh` script and modify it to suit your needs in order to generate SStables.**

No `command` executes default command—`help`:

----
Usage: [-V] COMMAND
-V, --version print version information and exit
Commands:
csv tool for bulk-loading of data from csv
random tool for bulk-loading of random data
fixed tool for bulk-loading of fixed data
----

### `random` Command

By this command, you are expected to provide data which represents a row in random fashion.

### `csv` Command

`csv` command has same arguments as `random` but `--file` is mandatory. There is supposed to be CSV file which
represents rows. Each row will be parsed into a list of strings passed to `RowMapper` implementation where you
have to map them to list of objects for a Cassandra INSERT statement as values.

### `fixed` Command

By `fixed` command, we will generate a SSTable by using the exact list of "rows" with columns. This
will be obvious from the documentation which follows.

## Row Generation

In order to generate data for all three cases above you have to implement interface
`com.instaclustr.cassandra.bulkloader.RowMapper` in `api` module. This implementation should
be placed in `impl` (or its equivalent) and it should be on a class path.

## RowMapper Interface

----
public interface RowMapper {

/**
* Maps list of strings from whatever input representing
* a row to list of objects to insert into Cassandra.
*


* This method e.g. called upon generation from CSV file.
*


* @param row where values are consisting of list of strings
* @return list of objects to put to insert statement
*/
List map(final List row);

/**
* Used when we do not want to generate data randomly but we have exact list of what to insert.
*
* @return list of rows to be created containing list of cells
*/
Stream> get();

/**
* Logically same as {@link #map(List)} but all data per row
* needs to be generated inside of the method. The number
* of items in the returned list has to match number of columns
* in a row. Each such object represents value which will be
* passed to Cassandra INSERT statement.
*


* This method is called repeatedly. Number of calls
* is equal to paramter `--numberOfRecords`.
*


* This method is called upon "random" generation.
* @return list of objects to put to insert statement
*/
List random();

/**
* @return string representation of INSERT INTO statement. Question marks in VALUES are not
* meant to be replaced.
*


* For example: 'INSERT INTO keyspace.table (field1, field2, field3) VALUES (?, ?, ?)'
*/
String insertStatement();
}
----

The implementation of `RowMapper` you are supposed to place on the class path would look like this:

----
public class RowMapper1 implements RowMapper {

public static final String KEYSPACE = "mykeyspace";
public static final String TABLE = "mytable";

public static final UUID UUID_1 = UUID.randomUUID();
public static final UUID UUID_2 = UUID.randomUUID();
public static final UUID UUID_3 = UUID.randomUUID();

@Override
public List map(final List row) {
return null;
}

@Override
public Stream> get() {
return Stream.of(
new ArrayList() {{
add(UUID_1);
add("John");
add("Doe");
}},
new ArrayList() {{
add(UUID_2);
add("Marry");
add("Poppins");
}},
new ArrayList() {{
add(UUID_3);
add("Jim");
add("Jack");
}});
}

@Override
public List random() {
return null;
}

@Override
public String insertStatement() {
return format("INSERT INTO %s.%s (id, name, surname) VALUES (?, ?, ?);", KEYSPACE, TABLE);
}
}
----

## SPI Mechanism

There is a Java SPI mechanism for implementation discovery, so it means that besides implementing API
you have to change the `impl/src/main/resources/META-INF/services/com.instaclustr.sstable.generator.RowMapper`
file containing FQCN of your implemenation on one line.

Once the `impl` jar is placed on the class path, it will be automatically discovered by the `generator` module so
you do not need to use any command-line arguments. Merely putting that JAR on the class path does the job.

The same mechanism works for `cassandra-3/4` jar. In case you want to generate jars by `CQLSSTableWriter`
for Cassandra 3, just put that jar on the class path. If you want to generate "Cassandra 4 SSTables", place the
respective `cassandra-4.jar` on the class path as shown above.

In practice this means that you need to compile only an `impl` module which contains one class so the compilation
and JAR building will take literally a few seconds (less than 1 sec here). The command line arguments for all will look
the same.

## Further Information
- See blog by Anup Shirolkar ["A Comprehensive Guide to Cassandra Architecture"](https://www.instaclustr.com/cassandra-architecture/)
- See blog by Anup Shirolkar ["Apache Cassandra Compaction Strategies
"](https://www.instaclustr.com/apache-cassandra-compaction/)
- Please see https://www.instaclustr.com/support/documentation/announcements/instaclustr-open-source-project-status/ for Instaclustr support status of this project