{"id":15352498,"url":"https://github.com/yjshen/spark-connector-test","last_synced_at":"2025-04-15T05:52:30.560Z","repository":{"id":92469933,"uuid":"195768304","full_name":"yjshen/spark-connector-test","owner":"yjshen","description":"A tutorial on how to use pulsar-spark-connector","archived":false,"fork":false,"pushed_at":"2020-10-13T14:26:04.000Z","size":13,"stargazers_count":11,"open_issues_count":2,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-15T05:52:25.569Z","etag":null,"topics":["apache-pulsar","apache-spark","pulsar-spark-connector","sparksql","structured-streaming"],"latest_commit_sha":null,"homepage":"https://github.com/streamnative/pulsar-spark","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yjshen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-08T08:22:54.000Z","updated_at":"2024-12-24T06:47:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"4d3baf8e-2e6d-4aa9-a384-2e49bb14b277","html_url":"https://github.com/yjshen/spark-connector-test","commit_stats":{"total_commits":5,"total_committers":1,"mean_commits":5.0,"dds":0.0,"last_synced_commit":"392ac2f6d342c7fb7f6f248a51d79c7f9c770a20"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjshen%2Fspark-connector-test","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjshen%2Fspark-connector-test/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjshen%2Fspark-connector-test/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjshen%2Fspark-connector-test/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yjshen","download_url":"https://codeload.github.com/yjshen/spark-connector-test/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249016317,"owners_count":21198832,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-pulsar","apache-spark","pulsar-spark-connector","sparksql","structured-streaming"],"created_at":"2024-10-01T12:09:41.236Z","updated_at":"2025-04-15T05:52:30.541Z","avatar_url":"https://github.com/yjshen.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A step to step guide on how to use the Pulsar Spark Connector\n\nThe Pulsar Spark Connector is open source on July 9, 2019. See the source code and user guide [here](https://github.com/streamnative/pulsar-spark).\n\n## Environment\n\nThe following example uses the Homebrew package manager to download and install software on macOS, and you can choose other package managers based on *your own requirements* and operating system.\n1. Install Homebrew.\n```bash\n/usr/bin/ruby -e \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)\"\n```\n\n2. Install Java 8 or a higher version.\n\n    This example uses Homebrew to install JDK8.\n```bash\nbrew tap adoptopenjdk/openjdk\nbrew cask install adoptopenjdk8\n```\n\n3.  Install Apache Spark 2.4.0 or higher.\n\n    From the official website [download](https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz) Spark 2.4.3 and decompress.\n```bash\ntar xvfz spark-2.4.3-bin-hadoop2.7.tgz\n```\n\n4. Download Apache Pulsar 2.4.0.\n\n    From the official website [download](https://pulsar.apache.org/en/download/) Pulsar 2.4.0.\n```bash\nwget https://archive.apache.org/dist/pulsar/pulsar-2.4.0/apache-pulsar-2.4.0-bin.tar.gz\ntar xvfz apache-pulsar-2.4.0-bin.tar.gz\n```\n\n5. Install Apache Maven.\n```bash\nbrew install maven\n```\n\n6. Set up the development environment.\n\n    This example creates a Maven project called connector-test.\n    \n  (1) Build a framework for a Scala project using _archetype_ provided by [Scala Maven Plugin](http://davidb.github.io/scala-maven-plugin/).\n```bash\nmvn archetype:generate\n```\nIn the list that appears, select the latest version of net.alchim31.maven:scala-archetype-simple, which is currently 1.7, and specify groupId, artifactId, and version for the new project.\n  This example uses:\n  ```text\n  groupId: com.example\n  artifactId: connector-test\n  version: 1.0-SNAPSHOT\n```\nAfter the above steps, a Maven Scala project framework is basically set up.\n\n (2) Introduce Spark, Pulsar Spark Connector dependencies in _pom.xml_ under the project root directory, and use _maven_shade_plugin_ for project packaging.\n \n    a. Define the version information of the dependent package.\n```xml\n  \u003cproperties\u003e\n        \u003cmaven.compiler.source\u003e1.8\u003c/maven.compiler.source\u003e\n        \u003cmaven.compiler.target\u003e1.8\u003c/maven.compiler.target\u003e\n        \u003cencoding\u003eUTF-8\u003c/encoding\u003e\n        \u003cscala.version\u003e2.11.12\u003c/scala.version\u003e\n        \u003cscala.compat.version\u003e2.11\u003c/scala.compat.version\u003e\n        \u003cspark.version\u003e2.4.3\u003c/spark.version\u003e\n        \u003cpulsar-spark-connector.version\u003e2.4.0\u003c/pulsar-spark-connector.version\u003e\n        \u003cspec2.version\u003e4.2.0\u003c/spec2.version\u003e\n        \u003cmaven-shade-plugin.version\u003e3.1.0\u003c/maven-shade-plugin.version\u003e\n  \u003c/properties\u003e\n```\n    b. Introduce Spark, Pulsar Spark Connector dependencies.\n```xml\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.spark\u003c/groupId\u003e\n        \u003cartifactId\u003espark-core_${scala.compat.version}\u003c/artifactId\u003e\n        \u003cversion\u003e${spark.version}\u003c/version\u003e\n        \u003cscope\u003eprovided\u003c/scope\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.spark\u003c/groupId\u003e\n        \u003cartifactId\u003espark-sql_${scala.compat.version}\u003c/artifactId\u003e\n        \u003cversion\u003e${spark.version}\u003c/version\u003e\n        \u003cscope\u003eprovided\u003c/scope\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.spark\u003c/groupId\u003e\n        \u003cartifactId\u003espark-catalyst_${scala.compat.version}\u003c/artifactId\u003e\n        \u003cversion\u003e${spark.version}\u003c/version\u003e\n        \u003cscope\u003eprovided\u003c/scope\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eio.streamnative.connectors\u003c/groupId\u003e\n        \u003cartifactId\u003epulsar-spark-connector_${scala.compat.version}\u003c/artifactId\u003e\n        \u003cversion\u003e${pulsar-spark-connector.version}\u003c/version\u003e\n    \u003c/dependency\u003e\n```\n\n    c. Add a Maven repository that contains _pulsar-spark-connector_.\n```xml\n    \u003crepositories\u003e\n      \u003crepository\u003e\n        \u003cid\u003ecentral\u003c/id\u003e\n        \u003clayout\u003edefault\u003c/layout\u003e\n        \u003curl\u003ehttps://repo1.maven.org/maven2\u003c/url\u003e\n      \u003c/repository\u003e\n      \u003crepository\u003e\n        \u003cid\u003ebintray-streamnative-maven\u003c/id\u003e\n        \u003cname\u003ebintray\u003c/name\u003e\n        \u003curl\u003ehttps://dl.bintray.com/streamnative/maven\u003c/url\u003e\n      \u003c/repository\u003e\n    \u003c/repositories\u003e\n```\n      d. Package the sample class with _pulsar-spark-connector_ using _maven_shade_plugin_.\n```xml\n    \u003cplugin\u003e\n          \u003c!-- Shade all the dependencies to avoid conflicts --\u003e\n          \u003cgroupId\u003eorg.apache.maven.plugins\u003c/groupId\u003e\n          \u003cartifactId\u003emaven-shade-plugin\u003c/artifactId\u003e\n          \u003cversion\u003e${maven-shade-plugin.version}\u003c/version\u003e\n          \u003cexecutions\u003e\n            \u003cexecution\u003e\n              \u003cphase\u003epackage\u003c/phase\u003e\n              \u003cgoals\u003e\n                \u003cgoal\u003eshade\u003c/goal\u003e\n              \u003c/goals\u003e\n              \u003cconfiguration\u003e\n                \u003ccreateDependencyReducedPom\u003etrue\u003c/createDependencyReducedPom\u003e\n                \u003cpromoteTransitiveDependencies\u003etrue\u003c/promoteTransitiveDependencies\u003e\n                \u003cminimizeJar\u003efalse\u003c/minimizeJar\u003e\n\n                \u003cartifactSet\u003e\n                  \u003cincludes\u003e\n                    \u003cinclude\u003eio.streamnative.connectors:*\u003c/include\u003e\n                  \u003c/includes\u003e\n                \u003c/artifactSet\u003e\n                \u003cfilters\u003e\n                  \u003cfilter\u003e\n                    \u003cartifact\u003e*:*\u003c/artifact\u003e\n                    \u003cexcludes\u003e\n                      \u003cexclude\u003eMETA-INF/*.SF\u003c/exclude\u003e\n                      \u003cexclude\u003eMETA-INF/*.DSA\u003c/exclude\u003e\n                      \u003cexclude\u003eMETA-INF/*.RSA\u003c/exclude\u003e\n                    \u003c/excludes\u003e\n                  \u003c/filter\u003e\n                \u003c/filters\u003e\n                \u003ctransformers\u003e\n                  \u003ctransformer implementation=\"org.apache.maven.plugins.shade.resource.ServicesResourceTransformer\" /\u003e\n                  \u003ctransformer implementation=\"org.apache.maven.plugins.shade.resource.PluginXmlResourceTransformer\" /\u003e\n                \u003c/transformers\u003e\n              \u003c/configuration\u003e\n            \u003c/execution\u003e\n          \u003c/executions\u003e\n        \u003c/plugin\u003e\n```\n\n## Read from and write to Pulsar in Spark programs\n\nThe project in the example includes the following programs:\n1. Read the data from Pulsar (name the app _StreamRead_).\n2. Write the data to Pulsar (name the app _BatchWrite_).\n\n### Build a stream processing job to read data from Pulsar\n1. In _StreamRead_, create _SparkSession_.\n```scala\nval spark = SparkSession\n    .builder()\n    .appName(\"data-read\")\n    .config(\"spark.cores.max\", 2)\n    .getOrCreate()\n```\n2. In order to connect to Pulsar, you need to specify _service.url_ and _admin.url_ when building _DataFrame_ and specify the _topic_ to be read.\n```scala\nval ds = spark.readStream\n    .format(\"pulsar\")\n    .option(\"service.url\", \"pulsar://localhost:6650\")\n    .option(\"admin.url\", \"http://localhost:8088\")\n    .option(\"topic\", \"topic-test\")\n    .load()\nds.printSchema()  // print schema information of `topic-test`, as a validation step.\n```\n\n3. Output _ds_ to the console to start the job execution.\n```scala\nval query = ds.writeStream\n    .outputMode(\"append\")\n    .format(\"console\")\n    .start()\nquery.awaitTermination()\n```\n\n### Write data to Pulsar\n1. Similarly, in _BatchWrite_, first create _SparkSession_.\n```scala\nval spark = SparkSession\n    .builder()\n    .appName(\"data-sink\")\n    .config(\"spark.cores.max\", 2)\n    .getOrCreate()\n```\n2. Create a list of 1-10 and convert it to a Spark Dataset and write to Pulsar.\n```scala\nimport spark.implicits._\nspark.createDataset(1 to 10)\n    .write\n    .format(\"pulsar\")\n    .option(\"service.url\", \"pulsar://localhost:6650\")\n    .option(\"admin.url\", \"http://localhost:8088\")\n    .option(\"topic\", \"topic-test\")\n    .save()\n```\n\n### Running the program\nFirst configure and start the single-node cluster of Spark and Pulsar, then package the sample project, and submit two jobs through _spark-submit_ respectively, and finally observe the execution result of the program.\n1. Modify the log level of Spark (optional).\n```bash\n  cd ${spark.dir}/conf\n  cp log4j.properties.template log4j.properties\n```\n  In the text editor, change the log level to _WARN_ .\n```text\n  log4j.rootCategory=WARN, console\n```\n2. Start the Spark cluster.\n```bash\ncd ${spark.dir}\nsbin/start-all.sh\n```\n3. Modify the Pulsar WebService port to 8088 (edit `${pulsar.dir}/conf/standalone.conf`) to avoid conflicts with the Spark port.\n```text\nwebServicePort=8088\n```\n4. Start the Pulsar cluster.\n```bash\nbin/pulsar standalone\n```\n\n5. Package the sample project.\n```bash\ncd ${connector_test.dir}\nmvn package\n```\n\n6. Start _StreamRead_ to monitor data changes in _topic-test_.\n```bash\n${spark.dir}/bin/spark-submit --class com.example.StreamRead --master spark://localhost:7077 ${connector_test.dir}/target/connector-test-1.0-SNAPSHOT.jar\n```\n\n7. In another terminal window, start _BatchWrite_ to write a 1-10 digit to _topic-test_ at a time.\n```bash\n${spark.dir}/bin/spark-submit --class com.example.BatchWrite --master spark://localhost:7077 ${connector_test.dir}/target/connector-test-1.0-SNAPSHOT.jar\n```\n\n8. At this point, you can get a similar output in the terminal where _StreamRead_ is located.\n\n  ```text\n  root\n   |-- value: integer (nullable = false)\n   |-- __key: binary (nullable = true)\n   |-- __topic: string (nullable = true)\n   |-- __messageId: binary (nullable = true)\n   |-- __publishTime: timestamp (nullable = true)\n   |-- __eventTime: timestamp (nullable = true)\n\n  Batch: 0\n  +-----+-----+-------+-----------+-------------+-----------+\n  |value|__key|__topic|__messageId|__publishTime|__eventTime|\n  +-----+-----+-------+-----------+-------------+-----------+\n  +-----+-----+-------+-----------+-------------+-----------+\n\n  Batch: 1\n  +-----+-----+--------------------+--------------------+--------------------+-----------+\n  |value|__key|             __topic|         __messageId|       __publishTime|__eventTime|\n  +-----+-----+--------------------+--------------------+--------------------+-----------+\n  |    6| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...|       null|\n  |    7| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...|       null|\n  |    8| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...|       null|\n  |    9| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...|       null|\n  |   10| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...|       null|\n  |    1| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...|       null|\n  |    2| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...|       null|\n  |    3| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...|       null|\n  |    4| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...|       null|\n  |    5| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...|       null|\n  +-----+-----+--------------------+--------------------+--------------------+-----------+\n  ```\n\nSo far, we've started a Pulsar and a Spark, built the framework of the sample project, and used the Pulsar Spark Connector to read data from pulsar and write data to pulsar. Get a final result in spark at last.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyjshen%2Fspark-connector-test","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyjshen%2Fspark-connector-test","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyjshen%2Fspark-connector-test/lists"}