https://github.com/yjshen/spark-connector-test

A tutorial on how to use pulsar-spark-connector
https://github.com/yjshen/spark-connector-test

apache-pulsar apache-spark pulsar-spark-connector sparksql structured-streaming

Last synced: 7 months ago
JSON representation

A tutorial on how to use pulsar-spark-connector

Host: GitHub
URL: https://github.com/yjshen/spark-connector-test
Owner: yjshen
Created: 2019-07-08T08:22:54.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-10-13T14:26:04.000Z (about 5 years ago)
Last Synced: 2025-04-15T05:52:25.569Z (7 months ago)
Topics: apache-pulsar, apache-spark, pulsar-spark-connector, sparksql, structured-streaming
Language: Scala
Homepage: https://github.com/streamnative/pulsar-spark
Size: 12.7 KB
Stars: 11
Watchers: 2
Forks: 3
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# A step to step guide on how to use the Pulsar Spark Connector

The Pulsar Spark Connector is open source on July 9, 2019. See the source code and user guide [here](https://github.com/streamnative/pulsar-spark).

## Environment

The following example uses the Homebrew package manager to download and install software on macOS, and you can choose other package managers based on *your own requirements* and operating system.
1. Install Homebrew.
```bash
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```

2. Install Java 8 or a higher version.

This example uses Homebrew to install JDK8.
```bash
brew tap adoptopenjdk/openjdk
brew cask install adoptopenjdk8
```

3. Install Apache Spark 2.4.0 or higher.

From the official website [download](https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz) Spark 2.4.3 and decompress.
```bash
tar xvfz spark-2.4.3-bin-hadoop2.7.tgz
```

4. Download Apache Pulsar 2.4.0.

From the official website [download](https://pulsar.apache.org/en/download/) Pulsar 2.4.0.
```bash
wget https://archive.apache.org/dist/pulsar/pulsar-2.4.0/apache-pulsar-2.4.0-bin.tar.gz
tar xvfz apache-pulsar-2.4.0-bin.tar.gz
```

5. Install Apache Maven.
```bash
brew install maven
```

6. Set up the development environment.

    This example creates a Maven project called connector-test.

  (1) Build a framework for a Scala project using _archetype_ provided by [Scala Maven Plugin](http://davidb.github.io/scala-maven-plugin/).
```bash
mvn archetype:generate
```
In the list that appears, select the latest version of net.alchim31.maven:scala-archetype-simple, which is currently 1.7, and specify groupId, artifactId, and version for the new project.
  This example uses:
```text
groupId: com.example
artifactId: connector-test
version: 1.0-SNAPSHOT
```
After the above steps, a Maven Scala project framework is basically set up.

(2) Introduce Spark, Pulsar Spark Connector dependencies in _pom.xml_ under the project root directory, and use _maven_shade_plugin_ for project packaging.

a. Define the version information of the dependent package.
```xml

1.8
1.8
UTF-8
2.11.12
2.11
2.4.3
2.4.0
4.2.0
3.1.0

```
b. Introduce Spark, Pulsar Spark Connector dependencies.
```xml

org.apache.spark
spark-core_${scala.compat.version}
${spark.version}
provided

org.apache.spark
spark-sql_${scala.compat.version}
${spark.version}
provided

org.apache.spark
spark-catalyst_${scala.compat.version}
${spark.version}
provided

io.streamnative.connectors
pulsar-spark-connector_${scala.compat.version}
${pulsar-spark-connector.version}

```

c. Add a Maven repository that contains _pulsar-spark-connector_.
```xml

central
default
https://repo1.maven.org/maven2

bintray-streamnative-maven
bintray
https://dl.bintray.com/streamnative/maven

```
d. Package the sample class with _pulsar-spark-connector_ using _maven_shade_plugin_.
```xml

org.apache.maven.plugins
maven-shade-plugin
${maven-shade-plugin.version}

package

shade

true
true
false

io.streamnative.connectors:*

*:*

META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA

```

## Read from and write to Pulsar in Spark programs

The project in the example includes the following programs:
1. Read the data from Pulsar (name the app _StreamRead_).
2. Write the data to Pulsar (name the app _BatchWrite_).

### Build a stream processing job to read data from Pulsar
1. In _StreamRead_, create _SparkSession_.
```scala
val spark = SparkSession
.builder()
.appName("data-read")
.config("spark.cores.max", 2)
.getOrCreate()
```
2. In order to connect to Pulsar, you need to specify _service.url_ and _admin.url_ when building _DataFrame_ and specify the _topic_ to be read.
```scala
val ds = spark.readStream
.format("pulsar")
.option("service.url", "pulsar://localhost:6650")
.option("admin.url", "http://localhost:8088")
.option("topic", "topic-test")
.load()
ds.printSchema() // print schema information of `topic-test`, as a validation step.
```

3. Output _ds_ to the console to start the job execution.
```scala
val query = ds.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
```

### Write data to Pulsar
1. Similarly, in _BatchWrite_, first create _SparkSession_.
```scala
val spark = SparkSession
.builder()
.appName("data-sink")
.config("spark.cores.max", 2)
.getOrCreate()
```
2. Create a list of 1-10 and convert it to a Spark Dataset and write to Pulsar.
```scala
import spark.implicits._
spark.createDataset(1 to 10)
.write
.format("pulsar")
.option("service.url", "pulsar://localhost:6650")
.option("admin.url", "http://localhost:8088")
.option("topic", "topic-test")
.save()
```

### Running the program
First configure and start the single-node cluster of Spark and Pulsar, then package the sample project, and submit two jobs through _spark-submit_ respectively, and finally observe the execution result of the program.
1. Modify the log level of Spark (optional).
```bash
cd ${spark.dir}/conf
cp log4j.properties.template log4j.properties
```
In the text editor, change the log level to _WARN_ .
```text
log4j.rootCategory=WARN, console
```
2. Start the Spark cluster.
```bash
cd ${spark.dir}
sbin/start-all.sh
```
3. Modify the Pulsar WebService port to 8088 (edit `${pulsar.dir}/conf/standalone.conf`) to avoid conflicts with the Spark port.
```text
webServicePort=8088
```
4. Start the Pulsar cluster.
```bash
bin/pulsar standalone
```

5. Package the sample project.
```bash
cd ${connector_test.dir}
mvn package
```

6. Start _StreamRead_ to monitor data changes in _topic-test_.
```bash
${spark.dir}/bin/spark-submit --class com.example.StreamRead --master spark://localhost:7077 ${connector_test.dir}/target/connector-test-1.0-SNAPSHOT.jar
```

7. In another terminal window, start _BatchWrite_ to write a 1-10 digit to _topic-test_ at a time.
```bash
${spark.dir}/bin/spark-submit --class com.example.BatchWrite --master spark://localhost:7077 ${connector_test.dir}/target/connector-test-1.0-SNAPSHOT.jar
```

8. At this point, you can get a similar output in the terminal where _StreamRead_ is located.

Batch: 1
+-----+-----+--------------------+--------------------+--------------------+-----------+
|value|__key| __topic| __messageId| __publishTime|__eventTime|
+-----+-----+--------------------+--------------------+--------------------+-----------+
| 6| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 7| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 8| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 9| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 10| null|persistent://publ...|[08 86 01 10 02 2...|2019-07-08 14:51:...| null|
| 1| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
| 2| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
| 3| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
| 4| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
| 5| null|persistent://publ...|[08 86 01 10 03 2...|2019-07-08 14:51:...| null|
+-----+-----+--------------------+--------------------+--------------------+-----------+
```

So far, we've started a Pulsar and a Spark, built the framework of the sample project, and used the Pulsar Spark Connector to read data from pulsar and write data to pulsar. Get a final result in spark at last.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yjshen/spark-connector-test

Awesome Lists containing this project

README