Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/smiklosovic/cassandra-bulkloader
CLI tool generating Cassandra SSTables
https://github.com/smiklosovic/cassandra-bulkloader
bulk cassandra cli loader sstable sstables
Last synced: about 1 month ago
JSON representation
CLI tool generating Cassandra SSTables
- Host: GitHub
- URL: https://github.com/smiklosovic/cassandra-bulkloader
- Owner: smiklosovic
- License: apache-2.0
- Created: 2019-09-15T17:20:56.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-02-26T00:46:28.000Z (almost 3 years ago)
- Last Synced: 2024-10-27T22:32:55.783Z (3 months ago)
- Topics: bulk, cassandra, cli, loader, sstable, sstables
- Language: Java
- Size: 23.4 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# cassandra-bulkloader
CLI tool generating Cassandra SSTablesThis tool is simply generating SSTables programmatically. It uses Cassandra's `CQLSSTableWriter`.
After generation of SSTables is finished, you can load them by `sstableloader` tool as usually.The project consists of three modules:
* api - impl is coded against this module
* impl - implementation of your population logic, depends on `api`
* loader - implementation of whole loader CLI application, depends on `impl` and `api`.## Build
`mvn clean instal`
## Run
```
java \
-cp /path/to/impl-1.0.jar:/path/to/loader-1.0.jar \
com.instaclustr.cassandra.bulkloader.CLIApplication \
_command_ \
_arguments_
```No `command` executes default command - `help`:
```
Usage: [-V] COMMAND
-V, --version print version information and exit
Commands:
csv tool for bulk-loading of data from csv
random tool for bulk-loading of random data
```### `random` command
```
tool for bulk-loading of random data
-d, --output-dir=[DIRECTORY]
Destination where SSTables will be generated.
-k, --keyspace=[KEYSPACE] Keyspace for which SSTables will be generated.
-t, --table=[TABLE] Table for which SSTables will be generated.
-s, --schema=[PATH] Path to CQL schema where CREATE TABLE statement is
specified.
--sorted Whether input data are already sorted (in terms of
CQL)
--partitioner=
Paritioner used for SSTable generation, defaults to
'murmur'
--bufferSize=
How much data will be buffered before being written as
a new SSTable, in megabytes. Defaults to 128
--numberOfRecords=
Number of records to generate when using random
command
--threads= Number of threads to use for generation.
-f, --file= file to digest, irrelevant for random loader
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
```### `csv` command
`csv` command has same arguments as `random` but `--file` is mandatory. There is supposed to be CSV file which
is representing rows. Each row will be parsed into list of strings passed to `RowMapper` implementation where you
have to map them to list of objects for Cassandra INSERT statement as values.## Row generation
In order to generate data, in case of `random` generator, you have to implement interface
`com.instaclustr.cassandra.bulkloader.RowMapper` in `api` module. This implementation should
be placed in `impl` module.## RowMapper interface
```
package com.instaclustr.cassandra.bulkloader;import java.util.List;
public interface RowMapper {
/**
* Maps list of strings from whatever input representing
* a row to list of objects to insert into Cassandra.
*
* @param row where values are consisting of list of strings
* @return list of objects to put to insert statement
*/
List map(final List row);/**
* Logically same as {@link #map(List)} but all data per row
* needs to be generated inside of the method. The number
* of items in the returned list has to match number of columns
* in a row. Each such object represents value which will be
* passed to Cassandra INSERT statement.
*
* This method is called repeatedly. Number of calls
* is equal to paramter `--numberOfRecords`.
*
* @return list of objects to put to insert statement
*/
List random();/**
* @return string representation of INSERT INTO statement. Question marks in VALUES are not
* meant to be replaced.
*
* For example: 'INSERT INTO keyspace.table ("field1, "field2", ...) VALUES (?, ?, ?)'
*/
String insertStatement();
}```
## SPI mechanism
There is Java SPI mechanism for implementation discovery so it means that besides implementing API,
you have to change `impl/src/main/resources/META-INF/services/com.instaclustr.cassandra.bulkloader.RowMapper`
file containing FQCN of your implemenation on one line.Once impl jar is placed on the class path, it will be automatically discovered by `loader` module so
you do not need to use any command-line arguments. Mere putting of that JAR on the class path does the job.This in practice means that you need to compile only `impl` module which contains one class so the compilation
and JAR building will take literally few seconds (less the 1 sec here). The command line arguments and all will look
just same.