{"id":20564437,"url":"https://github.com/tarantool/cartridge-spark","last_synced_at":"2025-04-14T15:13:07.853Z","repository":{"id":39615835,"uuid":"295963986","full_name":"tarantool/cartridge-spark","owner":"tarantool","description":"Tarantool connector for Apache Spark","archived":false,"fork":false,"pushed_at":"2024-02-07T21:25:58.000Z","size":218,"stargazers_count":4,"open_issues_count":8,"forks_count":2,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-04-14T15:12:57.181Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tarantool.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null}},"created_at":"2020-09-16T07:58:54.000Z","updated_at":"2024-01-29T19:50:12.000Z","dependencies_parsed_at":"2024-02-06T23:49:41.028Z","dependency_job_id":null,"html_url":"https://github.com/tarantool/cartridge-spark","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarantool%2Fcartridge-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarantool%2Fcartridge-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarantool%2Fcartridge-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarantool%2Fcartridge-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tarantool","download_url":"https://codeload.github.com/tarantool/cartridge-spark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248904637,"owners_count":21180835,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T04:26:37.554Z","updated_at":"2025-04-14T15:13:07.827Z","avatar_url":"https://github.com/tarantool.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://github.com/tarantool/cartridge-spark/workflows/ubuntu-master/badge.svg)](https://github.com/tarantool/cartridge-spark/actions)\n[![CodeCov](https://codecov.io/gh/tarantool/cartridge-spark/branch/master/graph/badge.svg)](https://codecov.io/gh/tarantool/cartridge-spark)\n\n# spark-tarantool-connector\n\nApache Spark connector for Tarantool and Tarantool Cartridge\n\n## Building\n\nBuild the project using [sbt](https://www.scala-sbt.org/) (just run command `sbt test`).\n\n## Linking\n\nYou can link against this library for Maven in your program at the following coordinates:\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003eio.tarantool\u003c/groupId\u003e\n  \u003cartifactId\u003espark-tarantool-connector\u003c/artifactId\u003e\n  \u003cversion\u003e0.7.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nor for `sbt`:\n\n```\nlibraryDependencies += \"io.tarantool\" %% \"spark-tarantool-connector\" % \"0.7.0\"\n```\n\n## Version Compatibility\n\n| Connector | Scala   | Apache Spark | Tarantool Server | Tarantool Cartridge Driver |\n| --------- | ------- |--------------| ---------------- |----------------------------|\n| 0.x.x     | 2.11.12 | 2.4          | 1.10.9+,  2.4+   | 0.10.1+                    |\n| 0.x.x     | 2.12.16 | 3.2          | 1.10.9+,  2.4+   | 0.10.1+                    |\n| 0.x.x     | 2.13.10 | 3.2          | 1.10.9+,  2.4+   | 0.10.1+                    |\n\n## Getting Started\n\n### Configuration properties\n\n| property-key                  | description                                                                                                    | default value  |\n|-------------------------------|----------------------------------------------------------------------------------------------------------------|----------------|\n| tarantool.hosts               | comma separated list of Tarantool hosts                                                                        | 127.0.0.1:3301 |\n| tarantool.username            | basic authentication user                                                                                      | guest          |\n| tarantool.password            | basic authentication password                                                                                  |                |\n| tarantool.connectTimeout      | server connect timeout, in milliseconds                                                                        | 1000           |\n| tarantool.readTimeout         | socket read timeout, in milliseconds                                                                           | 1000           |\n| tarantool.requestTimeout      | request completion timeout, in milliseconds                                                                    | 2000           |\n| tarantool.connections         | number of connections established with each host                                                               | 1              |\n| tarantool.cursorBatchSize     | default limit for prefetching tuples in RDD iterator                                                           | 1000           |\n| tarantool.retries.errorType   | configures automatic retry of requests to Tarantool cluster. Possible values: \"network\", \"none\"                | none           |\n| tarantool.retries.maxAttempts | maximum number of retry attempts for each request. Mandatory if errorType is set to \"network\"                  |                |\n| tarantool.retries.delay       | delay between subsequent retries of each request (in milliseconds). Mandatory if errorType is set to \"network\" |                |\n\n### Dataset API request options\n\n| property-key                  | description                                                                                                                                                                                    | default value |\n|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|\n| tarantool.space               | Tarantool space name. Mandatory option                                                                                                                                                         |               |\n| tarantool.batchSize           | limit of records to be read or written at once                                                                                                                                                 | 1000          |\n| tarantool.stopOnError         | stop writing immediately after a batch fails with an exception or not all tuples are written                                                                                                   | true          |\n| tarantool.rollbackOnError     | rollback all changes written in scope of the last batch to a replicaset where an exception occurred                                                                                            | true          |\n| tarantool.transformFieldNames | possible values: none (default), snake_case, lower_case, upper_case. Necessary if the field names in datasets built from Spark SQL queries does not correspond to the field names in Tarantool | none          |\n\n#### Prerequisites\n\nThe Spark connector requires a deployed [Tarantool Cartridge](https://github.com/tarantool/cartridge) application with [tarantool/crud](https://github.com/tarantool/crud) module installed. See the version compatibility table in the previous section.\n\nYou may also use an official [Tarantool Docker image](https://hub.docker.com/r/tarantool/tarantool/tags), but it requires configuring the Cartridge cluster, so it is recommended to take the example configuration [from the connector tests](https://github.com/tarantool/cartridge-spark/blob/master/src/test/resources/Dockerfile).\n\n#### Example\n\nUsing Scala:\n```scala\n    // 1. Set up the Spark session\n    val spark = SparkSession.builder()\n       .config(\"tarantool.hosts\", \"127.0.0.1:3301\")\n       .config(\"tarantool.username\", \"admin\")\n       .config(\"tarantool.password\", \"password\")\n       .getOrCreate()\n    \n    val sc = spark.sparkContext\n    \n    // 2. Load the whole space\n    val rdd: Array[TarantoolTuple] = sc.tarantoolSpace(\"test_space\").collect()\n\n    // 3. Filter using conditions\n    // This mapper will be used implicitly for tuple conversion\n    val mapper = DefaultMessagePackMapperFactory.getInstance().defaultComplexTypesMapper()\n    \n    val startTuple = new DefaultTarantoolTupleFactory(mapper).create(List(1).asJava)\n    val cond: Conditions = Conditions\n        .indexGreaterThan(\"id\", List(1).asJava)\n        .withLimit(2)\n        .startAfter(startTuple)\n    val tuples: Array[TarantoolTuple] = sc.tarantoolSpace(\"test_space\", cond).collect()\n\n    // 4. Load the whole space into a DataFrame\n    val df = spark.read\n      .format(\"org.apache.spark.sql.tarantool\")\n      .option(\"tarantool.space\", \"test_space\")\n      .load()\n    \n    // Space schema from Tarantool will be used for mapping the tuple fields\n    val tupleIDs: Array[Int] = df.select(\"id\").rdd.map(row =\u003e row.get(0)).collect()\n\n    // 5. Write a Dataset to a Tarantool space\n\n    // Convert objects to Rows\n    val rows = Seq(\n      Book(1, null, \"Don Quixote\", \"Miguel de Cervantes\", 1605),\n      Book(2, null, \"The Great Gatsby\", \"F. Scott Fitzgerald\", 1925),\n      Book(2, null, \"War and Peace\", \"Leo Tolstoy\", 1869)\n    ).map(obj =\u003e Row(obj.id, obj.bucketId, obj.bookName, obj.author, obj.year))\n\n    // Extract an object schema using build-in Encoders\n    val orderSchema = Encoders.product[Book].schema\n\n    // Populate the Dataset\n    val ds = spark.createDataFrame(rows, orderSchema)\n\n    // Write to the space. Different modes are supported\n    ds.write\n      .format(\"org.apache.spark.sql.tarantool\")\n      .mode(SaveMode.Overwrite)\n      .option(\"tarantool.space\", \"test_space\")\n      .save()\n```\n\nor Java:\n```java\n    // 1. Set up the Spark context\n    SparkConf conf = new SparkConf()\n        .set(\"tarantool.hosts\", \"127.0.0.1:3301\")\n        .set(\"tarantool.username\", \"admin\")\n        .set(\"tarantool.password\", \"password\");\n\n    JavaSparkContext jsc = new JavaSparkContext(conf);\n\n    // 2. Load all tuples from a space using custom tuple to POJO conversion\n    List\u003cBook\u003e tuples = TarantoolSpark.contextFunctions(jsc)\n        .tarantoolSpace(\"test_space\", Conditions.any(), t -\u003e {\n            Book book = new Book();\n            book.id = t.getInteger(\"id\");\n            book.name = t.getString(\"name\");\n            book.author = t.getString(\"author\");\n            book.year = t.getInteger(\"year\");\n            return book;\n        }, Book.class).collect();\n    \n    // 3. Load all tuples from a space into a Dataset\n    Dataset\u003cRow\u003e ds = spark().read()\n        .format(\"org.apache.spark.sql.tarantool\")\n        .option(\"tarantool.space\", \"test_space\")\n        .load();\n\n    ds.select(\"id\").rdd().toJavaRDD().map(row -\u003e row.get(0)).collect();\n    \n    // 4. Write a Dataset to a Tarantool space\n        \n    // Create the schema first\n    StructField[] structFields = new StructField[5];\n    structFields[0] = new StructField(\"id\", DataTypes.IntegerType, false, Metadata.empty());\n    structFields[1] = new StructField(\"bucket_id\", DataTypes.IntegerType, false, Metadata.empty());\n    structFields[2] = new StructField(\"book_name\", DataTypes.StringType, false, Metadata.empty());\n    structFields[3] = new StructField(\"author\", DataTypes.StringType, false, Metadata.empty());\n    structFields[4] = new StructField(\"year\", DataTypes.IntegerType, true, Metadata.empty());\n\n    StructType schema = new StructType(structFields);\n\n    // Populate the Dataset\n    List\u003cRow\u003e data = new ArrayList\u003c\u003e(3);\n    data.add(RowFactory.create(1, null, \"Don Quixote\", \"Miguel de Cervantes\", 1605));\n    data.add(RowFactory.create(2, null, \"The Great Gatsby\", \"F. Scott Fitzgerald\", 1925));\n    data.add(RowFactory.create(3, null, \"War and Peace\", \"Leo Tolstoy\", 1869));\n\n    Dataset\u003cRow\u003e ds = sqlContext.createDataFrame(data, schema);\n\n    // Write to the space. Different modes are supported\n    ds.write()\n        .format(\"org.apache.spark.sql.tarantool\")\n        .mode(SaveMode.Overwrite)\n        .option(\"tarantool.space\", \"test_space\")\n        .save();\n```\n\n## Supported DataSet write modes\n\nConsult with the following table about what will happen when a DataSet is written with different modes.\nIn all modes it is supposed that all the spaces used in an operation exist. An error will be produced otherwise. \n\n| Mode          | How it works                                                                                   |\n|---------------|------------------------------------------------------------------------------------------------|\n| Append        | If a record with the given primary key exists, it will be replaced, and inserted otherwise.    |\n| Overwrite     | The space will be truncated before writing the DataSet, and then the records will be inserted. |\n| ErrorIfExists | If the space is not empty, an error will be produced; otherwise, the records will be inserted. |\n| Ignore        | If the space is not empty, no records will be insertedd an no errors will be produced.         |\n\n## Batch writing modes\n\nBatch operations are supported for more efficient writing of data into the Tarantool cluster. They are enabled by default,\nbut the error handling differs depending on values of the options `rollbackOnError` and `stopOnError`. The first option\nis simply propagated to the [tarantool/crud](https://github.com/tarantool/crud) library methods and currently\nonly allows rolling back last batch of changes on a single replicaset when an exception has occurred with a tuple from\nthis replicaset. The data successfully written to other replicasets in scope of the failed batch, and the data written\nin the previous batches will remain in place. The second option is also propagated to the `tarantool/crud` library.\nIf it is set to `false`, the writing of batches will continue even in the case of errors. The list of errors will be\nreturned when all data are attempted to be written to the cluster. This variant may be useful for the `Append`\nwrite mode only. If the `stopOnError` value is `true` (default), the batch writing will stop on the next batch after\na batch fails with an exception or not all tuples in the last batch were written.\n\n## Learn more\n\n- [Tarantool](https://www.tarantool.io/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarantool%2Fcartridge-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftarantool%2Fcartridge-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarantool%2Fcartridge-spark/lists"}