{"id":13571313,"url":"https://github.com/streamnative/pulsar-spark","last_synced_at":"2026-02-06T04:00:35.179Z","repository":{"id":35445042,"uuid":"194587328","full_name":"streamnative/pulsar-spark","owner":"streamnative","description":"Spark Connector to read and write with Pulsar","archived":false,"fork":false,"pushed_at":"2025-12-05T18:49:52.000Z","size":764,"stargazers_count":116,"open_issues_count":14,"forks_count":52,"subscribers_count":30,"default_branch":"master","last_synced_at":"2026-01-02T16:46:36.337Z","etag":null,"topics":["apache-pulsar","apache-spark","batch-processing","data-processing","data-science","flink","spark","spark-sql","stream-processing","structured-streaming"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/streamnative.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2019-07-01T02:36:19.000Z","updated_at":"2025-12-05T18:41:58.000Z","dependencies_parsed_at":"2023-02-17T22:01:16.344Z","dependency_job_id":"d6c077f3-bb0b-4996-9aa6-6f53f371bc13","html_url":"https://github.com/streamnative/pulsar-spark","commit_stats":{"total_commits":190,"total_committers":23,"mean_commits":8.26086956521739,"dds":0.7315789473684211,"last_synced_commit":"c656f07a9fd02983d652b5a2e3436098cf8574c3"},"previous_names":[],"tags_count":43,"template":false,"template_full_name":null,"purl":"pkg:github/streamnative/pulsar-spark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/streamnative","download_url":"https://codeload.github.com/streamnative/pulsar-spark/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-spark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29149573,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T02:39:25.012Z","status":"ssl_error","status_checked_at":"2026-02-06T02:37:22.784Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-pulsar","apache-spark","batch-processing","data-processing","data-science","flink","spark","spark-sql","stream-processing","structured-streaming"],"created_at":"2024-08-01T14:01:00.893Z","updated_at":"2026-02-06T04:00:35.173Z","avatar_url":"https://github.com/streamnative.png","language":"Scala","funding_links":[],"categories":["Scala","Data Processing","大数据"],"sub_categories":[],"readme":"# pulsar-spark\n\n[![Version](https://img.shields.io/github/release/streamnative/pulsar-spark/all.svg)](https://github.com/streamnative/pulsar-spark/releases)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)\n\nUnified data processing with [Apache Pulsar](https://pulsar.apache.org) and [Apache Spark](https://spark.apache.org).\n\n## Prerequisites\n\n- Java 17 or later\n- Spark 4.1.1 or later\n- Pulsar 3.0 or later\n\n## Version Compatibility Matrix\n\nThe following table shows the tested and supported version combinations:\n\n| Connector Version | Spark Version | Pulsar Client | Pulsar Service | Scala Version | Status |\n|-------------------|---------------|---------------|----------------|---------------|--------|\n| 4.1.1.x           | 4.1.1         | 4.0.5         | 3.0 - 4.x      | 2.13.17       | Current |\n| 4.0.1.x           | 4.0.1         | 4.0.5         | 3.0 - 4.x      | 2.13.16       | Stable |\n| 3.5.6.x           | 3.5.6         | 4.0.5         | 3.0 - 4.x      | 2.13          | Stable |\n| 3.5.2.x           | 3.5.2         | 4.0.5         | 3.0 - 4.x      | 2.13          | Stable |\n| 3.4.1.x           | 3.4.1         | 2.10.5        | 2.10 - 3.x     | 2.13          | Legacy |\n| 3.4.0.x           | 3.4.0         | 2.10.2        | 2.10 - 3.x     | 2.13          | Legacy |\n\n**Notes**:\n- **Connector Version**: Follows the Spark major.minor version pattern\n- **Pulsar Client**: The version of Pulsar client library bundled with the connector\n- **Pulsar Service**: Compatible Pulsar broker/service versions\n- Each connector version is built and tested against the specific Spark version listed\n- Pulsar client 4.x provides the best compatibility with Pulsar service 3.0+ clusters\n- For Pulsar 2.x clusters, use connector versions 3.4.x with Pulsar client 2.x\n\n## Preparations\n\n### Link\n\n#### Client library  \nFor Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:\n\n```\n    groupId = io.streamnative.connectors\n    artifactId = pulsar-spark-connector_{{SCALA_BINARY_VERSION}}\n    version = {{PULSAR_SPARK_VERSION}}\n```\n\n### Deploy\n\n#### Client library  \nAs with any Spark applications, `spark-submit` is used to launch your application.     \n`pulsar-spark-connector_{{SCALA_BINARY_VERSION}}`\nand its dependencies can be directly added to `spark-submit` using `--packages`.  \n\nExample\n\n```\n$ ./bin/spark-submit \n  --packages io.streamnative.connectors:pulsar-spark-connector_{{SCALA_BINARY_VERSION}}:{{PULSAR_SPARK_VERSION}}\n  ...\n```\n\n#### CLI  \nFor experimenting on `spark-shell` (or `pyspark` for Python), you can also use `--packages` to add `pulsar-spark-connector_{{SCALA_BINARY_VERSION}}` and its dependencies directly.\n\nExample\n\n```\n$ ./bin/spark-shell \n  --packages io.streamnative.connectors:pulsar-spark-connector_{{SCALA_BINARY_VERSION}}:{{PULSAR_SPARK_VERSION}}\n  ...\n```\n\nWhen locating an artifact or library, `--packages` option checks the following repositories in order:\n\n1. Local maven repository\n\n2. Maven central repository\n\n3. Other repositories specified by `--repositories`\n\nThe format for the coordinates should be `groupId:artifactId:version`.\n\nFor more information about **submitting applications with external dependencies**, see [Application Submission Guide](https://spark.apache.org/docs/latest/submitting-applications.html).\n\n## Usage\n\n### Read data from Pulsar\n\n#### Create a Pulsar source for streaming queries\nThe following examples are in Scala.\n```scala\n// Subscribe to 1 topic\nval df = spark\n  .readStream\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"topic\", \"topic1\")\n  .load()\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .as[(String, String)]\n\n// Subscribe to multiple topics\nval df = spark\n  .readStream\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"topics\", \"topic1,topic2\")\n  .load()\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .as[(String, String)]\n\n// Subscribe to a topic pattern\nval df = spark\n  .readStream\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"topicsPattern\", \"topic.*\")\n  .load()\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .as[(String, String)]\n```\n\n\u003e #### Tip\n\u003e For more information on how to use other language bindings for Spark Structured Streaming,\n\u003e see [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html).\n\n#### Create a Pulsar source for batch queries\nIf you have a use case that is better suited to batch processing,\nyou can create a Dataset/DataFrame for a defined range of offsets.\n\nThe following examples are in Scala.\n```scala\n\n// Subscribe to 1 topic defaults to the earliest and latest offsets\nval df = spark\n  .read\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"topic\", \"topic1\")\n  .load()\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .as[(String, String)]\n\n// Subscribe to multiple topics, specifying explicit Pulsar offsets\nimport org.apache.spark.sql.pulsar.JsonUtils._\nval startingOffsets = topicOffsets(Map(\"topic1\" -\u003e messageId1, \"topic2\" -\u003e messageId2))\nval endingOffsets = topicOffsets(...)\nval df = spark\n  .read\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"topics\", \"topic1,topic2\")\n  .option(\"startingOffsets\", startingOffsets)\n  .option(\"endingOffsets\", endingOffsets)\n  .load()\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .as[(String, String)]\n\n// Subscribe to a pattern, at the earliest and latest offsets\nval df = spark\n  .read\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"topicsPattern\", \"topic.*\")\n  .option(\"startingOffsets\", \"earliest\")\n  .option(\"endingOffsets\", \"latest\")\n  .load()\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .as[(String, String)]\n```\n\n### Write data to Pulsar\n\nThe DataFrame written to Pulsar can have arbitrary schema, since each record in DataFrame is transformed as one message sent to Pulsar, fields of DataFrame are divided into two groups: `__key`, `__eventTime` and `__messageProperties` fields are encoded as metadata of Pulsar message; other fields are grouped and encoded using AVRO and put in `value()`:\n```scala\nproducer.newMessage().key(__key).value(avro_encoded_fields).eventTime(__eventTime)\n```\n\n#### Create a Pulsar sink for streaming queries\nThe following examples are in Scala.\n```scala\n\n// Write key-value data from a DataFrame to a specific Pulsar topic specified in an option\nval ds = df\n  .selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .writeStream\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"topic\", \"topic1\")\n  .start()\n\n// Write key-value data from a DataFrame to Pulsar using a topic specified in the data\nval ds = df\n  .selectExpr(\"__topic\", \"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .writeStream\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .start()\n```\n\n#### Write the output of batch queries to Pulsar\nThe following examples are in Scala.\n```scala\n\n// Write key-value data from a DataFrame to a specific Pulsar topic specified in an option\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .write\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"topic\", \"topic1\")\n  .save()\n\n// Write key-value data from a DataFrame to Pulsar using a topic specified in the data\ndf.selectExpr(\"__topic\", \"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .write\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .save()\n```\n\n#### Limitations\n\nCurrently, we provide at-least-once semantic. Consequently, when writing either streaming queries or batch queries to Pulsar, some records may be duplicated.\nA possible solution to remove duplicates when reading the written data could be to introduce a primary (unique) key that can be used to perform de-duplication when reading.\n\n\n## Configurations\n\n\u003ctable class=\"table\"\u003e\n\u003ctr\u003e\u003cth\u003eOption\u003c/th\u003e\u003cth\u003eValue\u003c/th\u003e\u003cth\u003eRequired\u003c/th\u003e\u003cth\u003eDefault\u003c/th\u003e\u003cth\u003eQueryType\u003c/th\u003e\u003cth\u003eDescription\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e`service.url`\u003c/td\u003e\n  \u003ctd\u003eThe Pulsar `serviceUrl` String\u003c/td\u003e\n  \u003ctd\u003eYes\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eThe Pulsar `serviceUrl` configuration for Pulsar service. Example: \"pulsar://localhost:6650\".\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`admin.url`\u003c/td\u003e\n  \u003ctd\u003eA service HTTP URL of your Pulsar cluster\u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eThe Pulsar `serviceHttpUrl` configuration. Only needed when `maxBytesPerTrigger` is specified\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`maxBytesPerTrigger`\u003c/td\u003e\n  \u003ctd\u003eA long value in unit of number of bytes\u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eA soft limit of the maximum number of bytes we want to process per microbatch. If this is specified, `admin.url` also needs to be specified.\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`predefinedSubscription`\u003c/td\u003e\n  \u003ctd\u003eA Subscription name string\u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eThe predefined subscription name used by the connector to track spark application progress.\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n    \u003ctd\u003e`subscriptionPrefix`\u003c/td\u003e\n    \u003ctd\u003eA subscription prefix string\u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNone\u003c/td\u003e\n    \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n    \u003ctd\u003eA prefix used by the connector to generate a random subscription to track spark application progress.\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`topic`\u003c/td\u003e\n  \u003ctd\u003eA topic name string\u003c/td\u003e\n  \u003ctd\u003eYes\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eThe topic to be consumed.\n  Only one of `topic`, `topics` or `topicsPattern`\n  options can be specified for Pulsar source.\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`topics`\u003c/td\u003e\n  \u003ctd\u003eA comma-separated list of topics\u003c/td\u003e\n  \u003ctd\u003eYes\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e \n  \u003ctd\u003eThe topic list to be consumed.\n  Only one of `topic`, `topics` or `topicsPattern`\n  options can be specified for Pulsar source.\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`topicsPattern`\u003c/td\u003e\n  \u003ctd\u003eA Java regex string\u003c/td\u003e\n  \u003ctd\u003eYes\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eThe pattern used to subscribe to topic(s).\n  Only one of `topic`, `topics` or `topicsPattern`\n  options can be specified for Pulsar source.\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`pollTimeoutMs`\u003c/td\u003e\n  \u003ctd\u003eA number string in unit of milliseconds \u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003e\"120000\"\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eThe timeout for reading messages from Pulsar. Example: `6000`.\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`waitingForNonExistedTopic`\u003c/td\u003e\n  \u003ctd\u003eThe following are valid values: true or false\u003cbr\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003e\"false\"\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eWhether the connector should wait until the desired topics are created. \n  By default, the connector will not wait for the topic\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`startingOffsets`\u003c/td\u003e\n  \u003ctd\u003eThe following are valid values:\u003cbr\u003e\n\n  * \"earliest\"(streaming and batch queries)\u003cbr\u003e\n\n  * \"latest\" (streaming query)\u003cbr\u003e\n\n  * A JSON string\u003cbr\u003e\n\n    **Example**\u003cbr\u003e\n\n    \"\"\" {\"topic-1\":[8,11,16,101,24,1,32,1],\"topic-5\":[8,15,16,105,24,5,32,5]} \"\"\"\n  \u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003e\n\n   * \"earliest\"（batch query)\u003cbr\u003e\n\n   *  \"latest\"（streaming query)\u003c/td\u003e\n  \u003ctd\u003eStreaming and batch queries\u003c/td\u003e\n  \u003ctd\u003e\n\n  `startingOffsets` option controls where a reader reads data from.\n\n  * \"earliest\": lacks a valid offset, the reader reads all the data in the partition, starting from the very beginning.\u003cbr\u003e\n\n*  \"latest\": lacks a valid offset, the reader reads from the newest records written after the reader starts running.\u003cbr\u003e\n\n* A JSON string: specifies a starting offset for each Topic. \u003cbr\u003e\nYou can use `org.apache.spark.sql.pulsar.JsonUtils.topicOffsets(Map[String, MessageId])` to convert a message offset to a JSON string. \u003cbr\u003e\n\n**Note**: \u003cbr\u003e\n\n* For batch query, \"latest\" is not allowed, either implicitly specified or use `MessageId.latest ([8,-1,-1,-1,-1,-1,-1,-1,-1,127,16,-1,-1,-1,-1,-1,-1,-1,-1,127])` in JSON.\u003cbr\u003e\n\n* For streaming query, \"latest\" only applies when a new query is started, and the resuming will\n  always pick up from where the query left off. Newly discovered partitions during a query will start at\n  \"earliest\".\n  \u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e`endingOffsets`\u003c/td\u003e\n  \u003ctd\u003eThe following are valid values:\u003cbr\u003e\n\n  * \"latest\" (batch query)\u003cbr\u003e\n\n  * A JSON string\u003cbr\u003e\n\n   **Example**\u003cbr\u003e\n\n   {\"topic-1\":[8,12,16,102,24,2,32,2],\"topic-5\":[8,16,16,106,24,6,32,6]}\n\n  \u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003e\"latest\"\u003c/td\u003e\n  \u003ctd\u003eBatch query\u003c/td\u003e\n  \u003ctd\u003e\n\n  `endingOffsets` option controls where a reader stops reading data.\n\n  * \"latest\": the reader stops reading data at the latest record.\n\n * A JSON string: specifies an ending offset for each topic.\u003cbr\u003e\n\n    **Note**: \u003cbr\u003e\n\n    `MessageId.earliest ([8,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,16,-1,-1,-1,-1,-1,-1,-1,-1,-1,1])` is not allowed.\n  \u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n    \u003ctd\u003e`startingTime`\u003c/td\u003e\n    \u003ctd\u003e A number in unit of milliseconds \u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNone\u003c/td\u003e\n    \u003ctd\u003ebatch queries\u003c/td\u003e\n    \u003ctd\u003e\n       For batch query, You can set a starting offset using milliseconds. \u003cbr\u003e\n       The target time of this option is publishTime. \u003cbr\u003e\n       Example: `1709254800000` (2024-03-01 01:00:00)\n    \u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n    \u003ctd\u003e`endingTime`\u003c/td\u003e\n    \u003ctd\u003e A number in unit of milliseconds \u003c/td\u003e\n    \u003ctd\u003eNo\u003c/td\u003e\n    \u003ctd\u003eNone\u003c/td\u003e\n    \u003ctd\u003ebatch queries\u003c/td\u003e\n    \u003ctd\u003e\n       For batch query, You can set a ending offset using milliseconds. \u003cbr\u003e\n       The target time of this option is publishTime. \u003cbr\u003e\n       Example: `1709254800000` (2024-03-01 02:00:00)\n    \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`failOnDataLoss`\u003c/td\u003e\n  \u003ctd\u003eThe following are valid values: true or false\u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003etrue\u003c/td\u003e\n  \u003ctd\u003eStreaming query\u003c/td\u003e\n  \u003ctd\u003e\n\n  `failOnDataLoss` option controls whether to fail a query when data is lost (for example, topics are deleted, or\n  messages are deleted because of retention policy).\u003cbr\u003e\n\n  This may cause a false alarm. You can set it to `false` when it doesn't work as you expected. \u003cbr\u003e\n\n  A batch query always fails if it fails to read any data from the provided offsets due to data loss.\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`allowDifferentTopicSchemas`\u003c/td\u003e\n  \u003ctd\u003e Boolean value \u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003e`false`\u003c/td\u003e\n  \u003ctd\u003e Streaming query  \u003c/td\u003e\n  \u003ctd\u003eIf multiple topics with different schemas are read, \n  using this parameter automatic schema-based topic \n  value deserialization can be turned off. \n  In that way, topics with different schemas can\n  be read in the same pipeline - which is then responsible\n  for deserializing the raw values based on some\n  schema. Since only the raw values are returned when\n  this is `true`, Pulsar topic schema(s) are not\n  taken into account during operation.\n  \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`pulsar.client.*`\u003c/td\u003e\n  \u003ctd\u003ePulsar Client configurations\u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eClient configurations. Example: \"pulsar.client.authPluginClassName\".\n\nPlease check [Pulsar Client Configuration](https://pulsar.apache.org/docs/2.11.x/client-libraries-java/#client) for more details \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`pulsar.admin.*`\u003c/td\u003e\n  \u003ctd\u003ePulsar Admin configurations\u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eAdmin configurations. Example: \"pulsar.admin.tlsAllowInsecureConnection\".\n\nPlease check [Pulsar Admin Configuration](https://pulsar.apache.org/docs/2.10.x/admin-api-overview/) for more details \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`pulsar.reader.*`\u003c/td\u003e\n  \u003ctd\u003ePulsar Reader configurations\u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eReader configurations. Example: \"pulsar.reader.subscriptionName\". \n\nPlease check [Pulsar Reader Configuration](https://pulsar.apache.org/docs/2.11.x/client-libraries-java/#configure-reader) for more details \u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n  \u003ctd\u003e`pulsar.producer.*`\u003c/td\u003e\n  \u003ctd\u003ePulsar Producer configurations\u003c/td\u003e\n  \u003ctd\u003eNo\u003c/td\u003e\n  \u003ctd\u003eNone\u003c/td\u003e\n  \u003ctd\u003eStreaming and Batch\u003c/td\u003e\n  \u003ctd\u003eProducer configurations. Example: \"pulsar.producer.blockIfQueueFull\".\n\nPlease check [Pulsar Producer Configuration](https://pulsar.apache.org/docs/2.11.x/client-libraries-java/#configure-producer) for more details\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003c/table\u003e\n\n### Authentication\nShould the Pulsar cluster require authentication, credentials can be set in the following way.\n\nThe following examples are in Scala.\n```scala\n// Secure connection with authentication, using the same credentials on the\n// Pulsar client and admin interface (if not given explicitly, the client configuration\n// is used for admin as well).\nval df = spark\n  .readStream\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar://localhost:6650\")\n  .option(\"pulsar.client.authPluginClassName\",\"org.apache.pulsar.client.impl.auth.AuthenticationToken\")\n  .option(\"pulsar.client.authParams\",\"token:\u003cvalid client JWT token\u003e\")\n  .option(\"topicsPattern\", \"sensitiveTopic\")\n  .load()\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .as[(String, String)]\n\n// Secure connection with client TLS enabled.\n// Note that the certificate file has to be present at the specified\n// path on every machine of the cluster!\nval df = spark\n  .readStream\n  .format(\"pulsar\")\n  .option(\"service.url\", \"pulsar+ssl://localhost:6651\")\n  .option(\"pulsar.admin.authPluginClassName\",\"org.apache.pulsar.client.impl.auth.AuthenticationToken\")\n  .option(\"pulsar.admin.authParams\",\"token:\u003cvalid admin JWT token\u003e\")\n  .option(\"pulsar.client.authPluginClassName\",\"org.apache.pulsar.client.impl.auth.AuthenticationToken\")\n  .option(\"pulsar.client.authParams\",\"token:\u003cvalid client JWT token\u003e\")\n  .option(\"pulsar.client.tlsTrustCertsFilePath\",\"/path/to/tls/cert/cert.pem\")\n  .option(\"pulsar.client.tlsAllowInsecureConnection\",\"false\")\n  .option(\"pulsar.client.tlsHostnameVerificationenable\",\"true\")\n  .option(\"topicsPattern\", \"sensitiveTopic\")\n  .load()\ndf.selectExpr(\"CAST(__key AS STRING)\", \"CAST(value AS STRING)\")\n  .as[(String, String)]\n```\n\n## Schema of Pulsar source\n- For topics without schema or with primitive schema in Pulsar, messages' payload\nis loaded to a `value` column with the corresponding type with Pulsar schema.\n- For topics with Avro or JSON schema, their field names and field types are kept in the result rows.\n- If the `topicsPattern` matches for topics which have different schemas, then setting\n`allowDifferentTopicSchemas` to `true` will allow the connector to read this content in a\nraw form. In this case it is the responsibility of the pipeline to apply the schema\non this content, which is loaded to the `value` column. \n\nBesides, each row in the source has the following metadata fields as well.\n\u003ctable class=\"table\"\u003e\n\u003ctr\u003e\u003cth\u003eColumn\u003c/th\u003e\u003cth\u003eType\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e`__key`\u003c/td\u003e\n  \u003ctd\u003eBinary\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e`__topic`\u003c/td\u003e\n  \u003ctd\u003eString\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e`__messageId`\u003c/td\u003e\n  \u003ctd\u003eBinary\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e`__publishTime`\u003c/td\u003e\n  \u003ctd\u003eTimestamp\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e`__eventTime`\u003c/td\u003e\n  \u003ctd\u003eTimestamp\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n  \u003ctd\u003e`__messageProperties`\u003c/td\u003e\n  \u003ctd\u003eMap \u0026lt String, String \u0026gt \u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n### Example\n\nThe topic of AVRO schema _s_ in Pulsar is as below:\n```scala\n  case class Foo(i: Int, f: Float, bar: Bar)\n  case class Bar(b: Boolean, s: String)\n  val s = Schema.AVRO(Foo.getClass)\n```\nhas the following schema as a DataFrame/DataSet in Spark:\n```\nroot\n |-- i: integer (nullable = false)\n |-- f: float (nullable = false)\n |-- bar: struct (nullable = true)\n |    |-- b: boolean (nullable = false)\n |    |-- s: string (nullable = true)\n |-- __key: binary (nullable = true)\n |-- __topic: string (nullable = true)\n |-- __messageId: binary (nullable = true)\n |-- __publishTime: timestamp (nullable = true)\n |-- __messageProperties: map (nullable = true)\n |    |-- key: string\n |    |-- value: string (valueContainsNull = true)\n ```\n\n For Pulsar topic with `Schema.DOUBLE`, it's schema as a DataFrame is:\n ```\n root\n |-- value: double (nullable = false)\n |-- __key: binary (nullable = true)\n |-- __topic: string (nullable = true)\n |-- __messageId: binary (nullable = true)\n |-- __publishTime: timestamp (nullable = true)\n |-- __eventTime: timestamp (nullable = true)\n |-- __messageProperties: map (nullable = true)\n |    |-- key: string\n |    |-- value: string (valueContainsNull = true)\n ```\n\n\n\n## Build Spark Pulsar Connector\nIf you want to build a Spark-Pulsar connector reading data from Pulsar and writing results to Pulsar, follow the steps below.\n\n1. Checkout the source code.\n\n```bash\n$ git clone https://github.com/streamnative/pulsar-spark.git\n$ cd pulsar-spark\n```\n\n2. Install Docker.\n\n\u003e Pulsar-spark connector is using [Testcontainers](https://www.testcontainers.org/) for\n\u003e integration tests. In order to run the integration tests, make sure you\n\u003e have installed [Docker](https://docs.docker.com/docker-for-mac/install/).\n\n3. Set a Scala version.\n\u003e Change `scala.version` and `scala.binary.version` in `pom.xml`.\n\u003e #### Note\n\u003e Scala version should be consistent with the Scala version of Spark you use.\n\n4. Build the project.\n\n```bash\n$ mvn clean install -DskipTests\n```\n\nIf you get the following error during compilation, try running Maven with Java 17:  \n```\n[ERROR] [Error] : Source option 6 is no longer supported. Use 7 or later.\n[ERROR] [Error] : Target option 6 is no longer supported. Use 7 or later.\n```\n\n5. Run the tests.\n\n```bash\n$ mvn clean install\n```\n\nNote: by configuring `scalatest-maven-plugin` in the [usual ways](https://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin), individual tests can be executed, if that is needed:\n\n```bash\nmvn -Dsuites=org.apache.spark.sql.pulsar.CachedPulsarClientSuite clean install\n```\n\nThis might be handy if test execution is slower, or you get a `java.io.IOException: Too many open files` exception during full suite run.\n\nOnce the installation is finished, there is a fat jar generated under both local maven repo and `target` directory.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstreamnative%2Fpulsar-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstreamnative%2Fpulsar-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstreamnative%2Fpulsar-spark/lists"}