{"id":15056896,"url":"https://github.com/instaclustr/cassandra-sstable-generator","last_synced_at":"2026-01-11T17:57:26.584Z","repository":{"id":37743651,"uuid":"242838356","full_name":"instaclustr/cassandra-sstable-generator","owner":"instaclustr","description":"Tool for programmatic generation of Cassandra SSTable","archived":false,"fork":false,"pushed_at":"2022-12-14T15:04:07.000Z","size":96,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-24T06:11:27.707Z","etag":null,"topics":["bulk","cassandra","csv","generation","load","netapp-public","random","sstable"],"latest_commit_sha":null,"homepage":"https://instaclustr.com","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/instaclustr.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-24T20:39:21.000Z","updated_at":"2024-08-01T11:59:01.000Z","dependencies_parsed_at":"2023-01-28T23:40:11.482Z","dependency_job_id":null,"html_url":"https://github.com/instaclustr/cassandra-sstable-generator","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fcassandra-sstable-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fcassandra-sstable-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fcassandra-sstable-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instaclustr%2Fcassandra-sstable-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/instaclustr","download_url":"https://codeload.github.com/instaclustr/cassandra-sstable-generator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248161256,"owners_count":21057554,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bulk","cassandra","csv","generation","load","netapp-public","random","sstable"],"created_at":"2024-09-24T21:57:55.855Z","updated_at":"2026-01-11T17:57:26.538Z","avatar_url":"https://github.com/instaclustr.png","language":"Java","funding_links":[],"categories":["Packages"],"sub_categories":["Tools"],"readme":"# Cassandra-SStable-Generator\n\n_CLI tool for programmatic generation of Cassandra SSTables_\n\nimage:https://img.shields.io/maven-central/v/com.instaclustr/sstable-generator.svg?label=Maven%20Central[link=https://search.maven.org/search?q=g:%22com.instaclustr%22%20AND%20a:%22sstable-generator%22]\nimage:https://circleci.com/gh/instaclustr/cassandra-sstable-generator.svg?style=svg[\"Instaclustr\",link=\"https://circleci.com/gh/instaclustr/cassandra-sstable-generator\"]\n\n- Website: https://www.instaclustr.com/\n- Documentation: https://www.instaclustr.com/support/documentation/\n\nThis tool simply generates SSTables programmatically. It uses Cassandra's `CQLSSTableWriter`.\nAfter the generation of SSTables is finished, you can load them by `sstableloader` tool as usual.\n\nThe project consists of these modules:\n\n* api—impl is coded against this module\n* impl—the implementation of your population logic, depends on `api`\n* generator—the implementation of whole generator CLI application\n* cassandra-3—the implementation of SSTable generator using internals of Cassandra 3 artifact\n* cassandra- 4—the implementation of SSTable generator using internals of Cassandra 4 artifact\n\n## Build\n\n`mvn clean install` (or `mvn clean install -DskipTests`)\n\n## Run\n\nLet's guide you through an example. We want to generate a SSTable by Cassandra 3 API so we can load it\nto Cassandra afterwards. The components you need to have on a class path are as follows:\n\n* generator jar\n* cassandra-3 module jar\n* jar with the implementation of your generation logic\n\n----\njava \\\n  -cp /path/to/impl-1.0.jar:/path/to/generator-1.0.jar:/path/to/cassandra-3.jar \\\n  com.instaclustr.sstable.generator.LoaderApplication \\\n  _command_ \\\n  _arguments_\n----\n\nThe concrete example of the invocation would be:\n\n----\njava \\\n    -cp impl/target/sstable-generator-impl-1.0:generator/target/sstable-generator-1.0.jar:cassandra-3/target/cassandra-3-1.0.jar \\\n    com.instaclustr.sstable.generator.LoaderApplication \\\n    fixed \\\n    --keyspace mykeyspace \\\n    --table mytable \\\n    --output-dir=/tmp/output \\\n    --schema cassandra-3/src/test/resources/cassandra/cql/table.cql \\\n    --threads 2\n----\n\n**Please be aware that you need to have all libraries of Apache Cassandra on the classpath as well. For\nthat reason, please use `./run.sh` script and modify it to suit your needs in order to generate SStables.**\n\nNo `command` executes default command—`help`:\n\n----\nUsage: \u003cmain class\u003e [-V] COMMAND\n  -V, --version   print version information and exit\nCommands:\n  csv     tool for bulk-loading of data from csv\n  random  tool for bulk-loading of random data\n  fixed   tool for bulk-loading of fixed data\n----\n\n### `random` Command\n\nBy this command, you are expected to provide data which represents a row in random fashion.\n\n### `csv` Command\n\n`csv` command has same arguments as `random` but `--file` is mandatory. There is supposed to be CSV file which\nrepresents rows. Each row will be parsed into a list of strings passed to `RowMapper` implementation where you\nhave to map them to list of objects for a Cassandra INSERT statement as values.\n\n### `fixed` Command\n\nBy `fixed` command, we will generate a SSTable by using the exact list of \"rows\" with columns. This\nwill be obvious from the documentation which follows.\n\n## Row Generation\n\nIn order to generate data for all three cases above you have to implement interface\n`com.instaclustr.cassandra.bulkloader.RowMapper` in `api` module. This implementation should\nbe placed in `impl` (or its equivalent) and it should be on a class path.\n\n## RowMapper Interface\n\n----\npublic interface RowMapper {\n\n    /**\n     * Maps list of strings from whatever input representing\n     * a row to list of objects to insert into Cassandra.\n     *\u003cp\u003e\n     * This method e.g. called upon generation from CSV file.\n     *\u003cp\u003e\n     * @param row where values are consisting of list of strings\n     * @return list of objects to put to insert statement\n     */\n    List\u003cObject\u003e map(final List\u003cString\u003e row);\n\n    /**\n     * Used when we do not want to generate data randomly but we have exact list of what to insert.\n     *\n     * @return list of rows to be created containing list of cells\n     */\n    Stream\u003cList\u003cObject\u003e\u003e get();\n\n    /**\n     * Logically same as {@link #map(List)} but all data per row\n     * needs to be generated inside of the method. The number\n     * of items in the returned list has to match number of columns\n     * in a row. Each such object represents value which will be\n     * passed to Cassandra INSERT statement.\n     * \u003cp\u003e\n     * This method is called repeatedly. Number of calls\n     * is equal to paramter `--numberOfRecords`.\n     *\u003cp\u003e\n     * This method is called upon \"random\" generation.\n     * @return list of objects to put to insert statement\n     */\n    List\u003cObject\u003e random();\n\n    /**\n     * @return string representation of INSERT INTO statement. Question marks in VALUES are not\n     * meant to be replaced.\n     * \u003cp\u003e\n     * For example: 'INSERT INTO keyspace.table (field1, field2, field3) VALUES (?, ?, ?)'\n     */\n    String insertStatement();\n}\n----\n\nThe implementation of `RowMapper` you are supposed to place on the class path would look like this:\n\n----\npublic class RowMapper1 implements RowMapper {\n\n\n    public static final String KEYSPACE = \"mykeyspace\";\n    public static final String TABLE = \"mytable\";\n\n    public static final UUID UUID_1 = UUID.randomUUID();\n    public static final UUID UUID_2 = UUID.randomUUID();\n    public static final UUID UUID_3 = UUID.randomUUID();\n\n    @Override\n    public List\u003cObject\u003e map(final List\u003cString\u003e row) {\n        return null;\n    }\n\n    @Override\n    public Stream\u003cList\u003cObject\u003e\u003e get() {\n        return Stream.of(\n            new ArrayList\u003cObject\u003e() {{\n                add(UUID_1);\n                add(\"John\");\n                add(\"Doe\");\n            }},\n            new ArrayList\u003cObject\u003e() {{\n                add(UUID_2);\n                add(\"Marry\");\n                add(\"Poppins\");\n            }},\n            new ArrayList\u003cObject\u003e() {{\n                add(UUID_3);\n                add(\"Jim\");\n                add(\"Jack\");\n            }});\n    }\n\n    @Override\n    public List\u003cObject\u003e random() {\n        return null;\n    }\n\n    @Override\n    public String insertStatement() {\n        return format(\"INSERT INTO %s.%s (id, name, surname) VALUES (?, ?, ?);\", KEYSPACE, TABLE);\n    }\n}\n----\n\n## SPI Mechanism\n\nThere is a Java SPI mechanism for implementation discovery, so it means that besides implementing API\nyou have to change the `impl/src/main/resources/META-INF/services/com.instaclustr.sstable.generator.RowMapper`\nfile containing FQCN of your implemenation on one line.\n\nOnce the `impl` jar is placed on the class path, it will be automatically discovered by the `generator` module so\nyou do not need to use any command-line arguments. Merely putting that JAR on the class path does the job.\n\nThe same mechanism works for `cassandra-3/4` jar. In case you want to generate jars by `CQLSSTableWriter`\nfor Cassandra 3, just put that jar on the class path. If you want to generate \"Cassandra 4 SSTables\", place the\nrespective `cassandra-4.jar` on the class path as shown above.\n\nIn practice this means that you need to compile only an `impl` module which contains one class so the compilation\nand JAR building will take literally a few seconds (less than 1 sec here). The command line arguments for all will look\nthe same.\n\n## Further Information\n- See blog by Anup Shirolkar [\"A Comprehensive Guide to Cassandra Architecture\"](https://www.instaclustr.com/cassandra-architecture/)\n- See blog by Anup Shirolkar [\"Apache Cassandra Compaction Strategies\n\"](https://www.instaclustr.com/apache-cassandra-compaction/)\n- Please see https://www.instaclustr.com/support/documentation/announcements/instaclustr-open-source-project-status/ for Instaclustr support status of this project\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finstaclustr%2Fcassandra-sstable-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finstaclustr%2Fcassandra-sstable-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finstaclustr%2Fcassandra-sstable-generator/lists"}