Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/instaclustr/cassandra-sstable-generator
Tool for programmatic generation of Cassandra SSTable
https://github.com/instaclustr/cassandra-sstable-generator
bulk cassandra csv generation load netapp-public random sstable
Last synced: about 1 month ago
JSON representation
Tool for programmatic generation of Cassandra SSTable
- Host: GitHub
- URL: https://github.com/instaclustr/cassandra-sstable-generator
- Owner: instaclustr
- License: apache-2.0
- Created: 2020-02-24T20:39:21.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-12-14T15:04:07.000Z (almost 2 years ago)
- Last Synced: 2024-09-30T04:05:53.425Z (about 1 month ago)
- Topics: bulk, cassandra, csv, generation, load, netapp-public, random, sstable
- Language: Java
- Homepage: https://instaclustr.com
- Size: 93.8 KB
- Stars: 6
- Watchers: 5
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
# Cassandra-SStable-Generator
_CLI tool for programmatic generation of Cassandra SSTables_
image:https://img.shields.io/maven-central/v/com.instaclustr/sstable-generator.svg?label=Maven%20Central[link=https://search.maven.org/search?q=g:%22com.instaclustr%22%20AND%20a:%22sstable-generator%22]
image:https://circleci.com/gh/instaclustr/cassandra-sstable-generator.svg?style=svg["Instaclustr",link="https://circleci.com/gh/instaclustr/cassandra-sstable-generator"]- Website: https://www.instaclustr.com/
- Documentation: https://www.instaclustr.com/support/documentation/This tool simply generates SSTables programmatically. It uses Cassandra's `CQLSSTableWriter`.
After the generation of SSTables is finished, you can load them by `sstableloader` tool as usual.The project consists of these modules:
* api—impl is coded against this module
* impl—the implementation of your population logic, depends on `api`
* generator—the implementation of whole generator CLI application
* cassandra-3—the implementation of SSTable generator using internals of Cassandra 3 artifact
* cassandra- 4—the implementation of SSTable generator using internals of Cassandra 4 artifact## Build
`mvn clean install` (or `mvn clean install -DskipTests`)
## Run
Let's guide you through an example. We want to generate a SSTable by Cassandra 3 API so we can load it
to Cassandra afterwards. The components you need to have on a class path are as follows:* generator jar
* cassandra-3 module jar
* jar with the implementation of your generation logic----
java \
-cp /path/to/impl-1.0.jar:/path/to/generator-1.0.jar:/path/to/cassandra-3.jar \
com.instaclustr.sstable.generator.LoaderApplication \
_command_ \
_arguments_
----The concrete example of the invocation would be:
----
java \
-cp impl/target/sstable-generator-impl-1.0:generator/target/sstable-generator-1.0.jar:cassandra-3/target/cassandra-3-1.0.jar \
com.instaclustr.sstable.generator.LoaderApplication \
fixed \
--keyspace mykeyspace \
--table mytable \
--output-dir=/tmp/output \
--schema cassandra-3/src/test/resources/cassandra/cql/table.cql \
--threads 2
----**Please be aware that you need to have all libraries of Apache Cassandra on the classpath as well. For
that reason, please use `./run.sh` script and modify it to suit your needs in order to generate SStables.**No `command` executes default command—`help`:
----
Usage: [-V] COMMAND
-V, --version print version information and exit
Commands:
csv tool for bulk-loading of data from csv
random tool for bulk-loading of random data
fixed tool for bulk-loading of fixed data
----### `random` Command
By this command, you are expected to provide data which represents a row in random fashion.
### `csv` Command
`csv` command has same arguments as `random` but `--file` is mandatory. There is supposed to be CSV file which
represents rows. Each row will be parsed into a list of strings passed to `RowMapper` implementation where you
have to map them to list of objects for a Cassandra INSERT statement as values.### `fixed` Command
By `fixed` command, we will generate a SSTable by using the exact list of "rows" with columns. This
will be obvious from the documentation which follows.## Row Generation
In order to generate data for all three cases above you have to implement interface
`com.instaclustr.cassandra.bulkloader.RowMapper` in `api` module. This implementation should
be placed in `impl` (or its equivalent) and it should be on a class path.## RowMapper Interface
----
public interface RowMapper {/**
* Maps list of strings from whatever input representing
* a row to list of objects to insert into Cassandra.
*
* This method e.g. called upon generation from CSV file.
*
* @param row where values are consisting of list of strings
* @return list of objects to put to insert statement
*/
List map(final List row);/**
* Used when we do not want to generate data randomly but we have exact list of what to insert.
*
* @return list of rows to be created containing list of cells
*/
Stream> get();/**
* Logically same as {@link #map(List)} but all data per row
* needs to be generated inside of the method. The number
* of items in the returned list has to match number of columns
* in a row. Each such object represents value which will be
* passed to Cassandra INSERT statement.
*
* This method is called repeatedly. Number of calls
* is equal to paramter `--numberOfRecords`.
*
* This method is called upon "random" generation.
* @return list of objects to put to insert statement
*/
List random();/**
* @return string representation of INSERT INTO statement. Question marks in VALUES are not
* meant to be replaced.
*
* For example: 'INSERT INTO keyspace.table (field1, field2, field3) VALUES (?, ?, ?)'
*/
String insertStatement();
}
----The implementation of `RowMapper` you are supposed to place on the class path would look like this:
----
public class RowMapper1 implements RowMapper {public static final String KEYSPACE = "mykeyspace";
public static final String TABLE = "mytable";public static final UUID UUID_1 = UUID.randomUUID();
public static final UUID UUID_2 = UUID.randomUUID();
public static final UUID UUID_3 = UUID.randomUUID();@Override
public List map(final List row) {
return null;
}@Override
public Stream> get() {
return Stream.of(
new ArrayList() {{
add(UUID_1);
add("John");
add("Doe");
}},
new ArrayList() {{
add(UUID_2);
add("Marry");
add("Poppins");
}},
new ArrayList() {{
add(UUID_3);
add("Jim");
add("Jack");
}});
}@Override
public List random() {
return null;
}@Override
public String insertStatement() {
return format("INSERT INTO %s.%s (id, name, surname) VALUES (?, ?, ?);", KEYSPACE, TABLE);
}
}
----## SPI Mechanism
There is a Java SPI mechanism for implementation discovery, so it means that besides implementing API
you have to change the `impl/src/main/resources/META-INF/services/com.instaclustr.sstable.generator.RowMapper`
file containing FQCN of your implemenation on one line.Once the `impl` jar is placed on the class path, it will be automatically discovered by the `generator` module so
you do not need to use any command-line arguments. Merely putting that JAR on the class path does the job.The same mechanism works for `cassandra-3/4` jar. In case you want to generate jars by `CQLSSTableWriter`
for Cassandra 3, just put that jar on the class path. If you want to generate "Cassandra 4 SSTables", place the
respective `cassandra-4.jar` on the class path as shown above.In practice this means that you need to compile only an `impl` module which contains one class so the compilation
and JAR building will take literally a few seconds (less than 1 sec here). The command line arguments for all will look
the same.## Further Information
- See blog by Anup Shirolkar ["A Comprehensive Guide to Cassandra Architecture"](https://www.instaclustr.com/cassandra-architecture/)
- See blog by Anup Shirolkar ["Apache Cassandra Compaction Strategies
"](https://www.instaclustr.com/apache-cassandra-compaction/)
- Please see https://www.instaclustr.com/support/documentation/announcements/instaclustr-open-source-project-status/ for Instaclustr support status of this project