{"id":13811484,"url":"https://github.com/streamnative/pulsar-flink","last_synced_at":"2025-05-14T19:33:32.143Z","repository":{"id":35445070,"uuid":"195936879","full_name":"streamnative/pulsar-flink","owner":"streamnative","description":"Elastic data processing with Apache Pulsar and Apache Flink","archived":true,"fork":false,"pushed_at":"2022-11-10T14:29:48.000Z","size":2266,"stargazers_count":279,"open_issues_count":73,"forks_count":119,"subscribers_count":37,"default_branch":"master","last_synced_at":"2024-11-14T20:48:11.661Z","etag":null,"topics":["apache-flink","apache-pulsar","batch-processing","catalog","data-processing","flink","flink-connector","flink-stream-processing","pulsar","schema","schema-registry","sql","stream-processing"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/streamnative.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null}},"created_at":"2019-07-09T05:12:49.000Z","updated_at":"2024-10-24T02:15:01.000Z","dependencies_parsed_at":"2022-07-14T03:40:33.903Z","dependency_job_id":null,"html_url":"https://github.com/streamnative/pulsar-flink","commit_stats":null,"previous_names":[],"tags_count":150,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-flink","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-flink/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-flink/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/streamnative%2Fpulsar-flink/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/streamnative","download_url":"https://codeload.github.com/streamnative/pulsar-flink/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225307784,"owners_count":17453870,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-flink","apache-pulsar","batch-processing","catalog","data-processing","flink","flink-connector","flink-stream-processing","pulsar","schema","schema-registry","sql","stream-processing"],"created_at":"2024-08-04T04:00:21.972Z","updated_at":"2024-11-19T06:32:30.414Z","avatar_url":"https://github.com/streamnative.png","language":"Java","readme":"# Pulsar Flink Connector\n\nThe Pulsar Flink connector implements elastic data processing using [Apache Pulsar](https://pulsar.apache.org) and [Apache Flink](https://flink.apache.org).\n\n如果需要阅读中文文档,请点击[此处](docs/README_CN.md)。\n\n# Prerequisites\n\n- Java 8 or higher version\n- Flink 1.13.0 or higher version\n- Pulsar 2.8.0 or higher version\n\n# Basic information\n\nThis section describes basic information about the Pulsar Flink connector.\n\n## Client\n\nWe change our project [version definition](docs/connector-version-definition.md), the Flink \u0026 Pulsar supporting matrix is here.\n\n| Flink version | Pulsar client version (or above) | Connector branch                                                                 |\n|:--------------|:---------------------------------|:---------------------------------------------------------------------------------|\n| 1.11.x        | 2.6.x                            | [`release-1.11`](https://github.com/streamnative/pulsar-flink/tree/release-1.11) |\n| 1.12.x        | 2.7.x                            | [`release-1.12`](https://github.com/streamnative/pulsar-flink/tree/release-1.12) |\n| 1.13.x        | 2.8.x                            | [`release-1.13`](https://github.com/streamnative/pulsar-flink/tree/release-1.13) |\n| 1.14.x        | 2.9.x                            | [`release-1.14`](https://github.com/streamnative/pulsar-flink/tree/release-1.14) |\n\n\u003e **Note**  \n\u003e Since Flink's API changed greatly through different versions, we mainly work on new features for the latest released flink version and fix bugs for old release.\n\u003e \n\u003e The old release (prior 1.10.x) is no longer maintained. Users who used old flink is recommend to upgrade to 1.11.\n\n## Version definitions\n\nSince the JAR package to Maven central, you can use this connector by using Maven, Gradle, or sbt. There are two types of connector, the `pulsar-flink-connector_2.11` for Scala 2.11, and the `pulsar-flink-connector_2.12` for Scala 2.12. This naming style is the same as Flink. The version of this project is in a four-part form, the first three part is the relying Flink version, and the last part is the patching version for connector.\n\nThis version definition is simple for users to choose right connector. We do not shade the `pulsar-client-all` to the Distro. Instead, we just use the Maven dependency. You can override the dependent `pulsar-client-all` as long as its version is higher than the one listed in the supporting matrix.\n\n## Maven projects\n\nFor Maven projects, add the following dep to your pom. `scala.binary.version` is following the flink dependency style, you can add it in your pom properties field. `${pulsar-flink-connector.version}` can be changed to your desired version, or defined it in pom properties field.\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eio.streamnative.connectors\u003c/groupId\u003e\n    \u003cartifactId\u003epulsar-flink-connector_${scala.binary.version}\u003c/artifactId\u003e\n    \u003cversion\u003e${pulsar-flink-connector.version}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nFor Maven projects, you can use the following [shade](https://imperceptiblethoughts.com/shadow/) plugin definition template to build an application JAR package that contains all the dependencies required for the client library and Pulsar Flink connector.\n\n```xml\n\u003cplugin\u003e\n  \u003c!-- Shade all the dependencies to avoid conflicts --\u003e\n  \u003cgroupId\u003eorg.apache.maven.plugins\u003c/groupId\u003e\n  \u003cartifactId\u003emaven-shade-plugin\u003c/artifactId\u003e\n  \u003cversion\u003e${maven-shade-plugin.version}\u003c/version\u003e\n  \u003cexecutions\u003e\n    \u003cexecution\u003e\n      \u003cphase\u003epackage\u003c/phase\u003e\n      \u003cgoals\u003e\n        \u003cgoal\u003eshade\u003c/goal\u003e\n      \u003c/goals\u003e\n      \u003cconfiguration\u003e\n        \u003ccreateDependencyReducedPom\u003etrue\u003c/createDependencyReducedPom\u003e\n        \u003cpromoteTransitiveDependencies\u003etrue\u003c/promoteTransitiveDependencies\u003e\n        \u003cminimizeJar\u003efalse\u003c/minimizeJar\u003e\n\n        \u003cartifactSet\u003e\n          \u003cincludes\u003e\n            \u003cinclude\u003eio.streamnative.connectors:*\u003c/include\u003e\n            \u003cinclude\u003eorg.apache.pulsar:*\u003c/include\u003e\n            \u003c!-- more libs to include here --\u003e\n          \u003c/includes\u003e\n        \u003c/artifactSet\u003e\n        \u003cfilters\u003e\n          \u003cfilter\u003e\n            \u003cartifact\u003e*:*\u003c/artifact\u003e\n            \u003cexcludes\u003e\n              \u003cexclude\u003eMETA-INF/*.SF\u003c/exclude\u003e\n              \u003cexclude\u003eMETA-INF/*.DSA\u003c/exclude\u003e\n              \u003cexclude\u003eMETA-INF/*.RSA\u003c/exclude\u003e\n            \u003c/excludes\u003e\n          \u003c/filter\u003e\n        \u003c/filters\u003e\n        \u003ctransformers\u003e\n          \u003ctransformer implementation=\"org.apache.maven.plugins.shade.resource.ServicesResourceTransformer\" /\u003e\n          \u003ctransformer implementation=\"org.apache.maven.plugins.shade.resource.PluginXmlResourceTransformer\" /\u003e\n        \u003c/transformers\u003e\n      \u003c/configuration\u003e\n    \u003c/execution\u003e\n  \u003c/executions\u003e\n\u003c/plugin\u003e\n```\n\n## Gradle projects\n\nFor Gradle projects, make sure maven central is added to your `build.gradle`, as shown below.\n\n```groovy\nrepositories {\n    mavenCentral()\n}\n```\n\nFor gradle projects, you can use the following [shade](https://imperceptiblethoughts.com/shadow/) plugin definition template to build an application JAR package that contains all the dependencies required for the client library and Pulsar Flink connector.\n\n```groovy\nbuildscript {\n     dependencies {\n         classpath 'com.github.jengelman.gradle.plugins:shadow:6.0.0'\n     }\n}\n\napply plugin: 'com.github.johnrengelman.shadow'\napply plugin: 'java'\n```\n\n# Build Pulsar Flink connector\n\nTo build the Pulsar Flink connector for reading data from Pulsar or writing the results to Pulsar, follow these steps.\n\n1. Check out the source code.\n\n    ```bash\n    git clone https://github.com/streamnative/pulsar-flink.git\n    cd pulsar-flink\n    ```\n\n2. Install the Docker.\n\n    The Pulsar Flink connector uses [Testcontainers](https://www.testcontainers.org/) for integration test. To run the integration test, ensure to install the Docker. For details about how to install the Docker, see [here](https://docs.docker.com/docker-for-mac/install/).\n\n3. Set the Java version.\n\n    Modify `java.version` and `java.binary.version` in `pom.xml`.\n\n    \u003e **Note**  \n    \u003e Ensure that the Java version should be identical to the Java version for the Pulsar Flink connector.\n\n4. Build the project.\n\n    ```bash\n    mvn clean install -DskipTests\n    ```\n\n5. Run the test.\n\n    ```bash\n    mvn clean install\n    ```\n\nAfter the Pulsar Flink connector is installed, a JAR package that contains all the dependencies is generated in both the local Maven repository and the `target` directory.\n\n# Deploy Pulsar Flink connector\n\nThis section describes how to deploy the Pulsar Flink connector.\n\n## Client library\n\nFor any Flink application, use the `./bin/flink run` command to compile and start your application.\n\nIf you have already built a JAR package with dependencies using the above shade plugin, you can use the `--classpath` option to add your JAR package.\n\n\u003e **Note**  \n\u003e The path must be in a protocol format (such as `file://`) and the path must be accessible on all nodes.\n\n**Example**\n\n```\n./bin/flink run -c com.example.entry.point.ClassName file://path/to/jars/your_fat_jar.jar\n```\n\n## Scala REPL\n\nThe Scala REPL is a tool (scala) for evaluating expressions in Scala. Use the `bin/start-scala-shell.sh` command to deploy Pulsar Flink connector on Scala client. You can use the `--addclasspath` to add `pulsar-flink-connector_{{SCALA_BINARY_VERSION}}-{{PULSAR_FLINK_VERSION}}.jar` package.\n\n**Example**\n\n```\n./bin/start-scala-shell.sh remote \u003chostname\u003e \u003cportnumber\u003e\n --addclasspath pulsar-flink-connector_{{SCALA_BINARY_VERSION}}-{{PULSAR_FLINK_VERSION}}.jar\n```\n\nFor more information on submitting applications through the CLI, see [Command-Line Interface](https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html) .\n\n## SQL client\n\nThe [SQL Client](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/sqlClient.html) is used to write SQL queries for manipulating data in Pulsar, you can use the `-addclasspath` option to add `pulsar-flink-connector_{{SCALA_BINARY_VERSION}}-{{PULSAR_FLINK_VERSION}}.jar` package.\n\n**Example**\n\n```\n./bin/sql-client.sh embedded --jar pulsar-flink-connector_{{SCALA_BINARY_VERSION}}-{{PULSAR_FLINK_VERSION}}.jar\n```\n\n\u003e **Note**  \n\u003e If you put the JAR package of our connector under `$FLINK_HOME/lib`, do not use `--jar` again to specify the package of the connector.\n\nBy default, to use the Pulsar directory in the SQL client and register it automatically at startup, the SQL client reads its configuration from the `./conf/sql-client-defaults.yaml` environment file. You need to add the Pulsar catalog to the `catalogs` section of this YAML file, as shown below.\n\n```yaml\ncatalogs:\n- name: pulsarcatalog\n    type: pulsar\n    default-database: tn/ns\n    service-url: \"pulsar://localhost:6650\"\n    admin-url: \"http://localhost:8080\"\n    format: json\n```\n\n# Usage\n\nThis section describes how to use the Pulsar Flink connector in the stream environment and table environment.\n\n## Stream environment\n\nThis section describes how to use the Pulsar Flink connector in the stream environment.\n\n### Source\n\nIn Pulsar Flink, the Pulsar consumer is called `FlinkPulsarSource\u003cT\u003e`. It accesses to one or more Pulsar topics.\n\nIts constructor method has the following parameters.\n\n- `serviceUrl` (service address) and `adminUrl` (administrative address): they are used to connect to the Pulsar instance.\n- `PulsarDeserializationSchema\u003cT\u003e`: when the `FlinkPulsarSource` is used, you need to set the `PulsarDeserializationSchema\u003cT\u003e` parameter.\n- `Properties`: it is used to configure the behavior of the Pulsar consumer, including the `topic`, `topics`, and `topicsPattern` options. The `topic`, `topics`, or `topicsPattern` option is used to configure information about the topic to be consumed. You must set a value for it. (**The `topics` parameters refers to multiple topics separated by a comma (,), and the `topicsPattern` parameter is a Java regular expression that matches a number of topics**.)\n- `setStartFromLatest`, `setStartFromEarliest`, `setStartFromSpecificOffsets`, or `setStartFromSubscription`: these parameters are used to configure the consumption mode. When the `setStartFromSubscription` consumption mode is configured, the checkpoint function must be enabled.\n\n**Example**\n\n```java\nStreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();\nProperties props = new Properties();\nprops.setProperty(\"topic\", \"test-source-topic\");\nprops.setProperty(\"partition.discovery.interval-millis\", \"5000\");\n\nFlinkPulsarSource\u003cString\u003e source = new FlinkPulsarSource\u003c\u003e(serviceUrl, adminUrl, PulsarDeserializationSchema.valueOnly(new SimpleStringSchema()), props);\n\n// or setStartFromLatest、setStartFromSpecificOffsets、setStartFromSubscription\nsource.setStartFromEarliest(); \n\nDataStream\u003cString\u003e stream = see.addSource(source);\n\n// chain operations on dataStream of String and sink the output\n// end method chaining\n\nsee.execute();\n```\n\n### Sink\n\nThe Pulsar producer uses the `FlinkPulsarSink` instance. It allows to write record streams to one or more Pulsar topics.\n\n**Example**\n\n```java\nPulsarSerializationSchema\u003cPerson\u003e pulsarSerialization = new PulsarSerializationSchemaWrapper.Builder\u003c\u003e(JsonSer.of(Person.class))\n    .usePojoMode(Person. class, RecordSchemaType.JSON)\n    .setTopicExtractor(person -\u003e null)\n    .build();\nFlinkPulsarSink\u003cPerson\u003e sink = new FlinkPulsarSink(\n    serviceUrl,\n    adminUrl,\n    Optional.of(topic), // mandatory target topic or use `Optional.empty()` if sink to different topics for each record\n    props,\n    pulsarSerialization\n);\n\nstream.addSink(sink);\n```\n\n### PulsarDeserializationSchema\n\nPulsarDeserializationSchema is a connector-defined Flink DeserializationSchema wrapper that allows flexible manipulation of Pulsar messages.\n\nPulsarDeserializationSchemaWrapper is a simple implementation of PulsarDeserializationSchema with two parameters: Flink DeserializationSchema and information about the decoded message type.\n\n```\nPulsarDeserializationSchemaWrapper(new SimpleStringSchema(),DataTypes.STRING())\n```\n\n\u003e **Note**  \n\u003e The `DataTypes` type comes from Flink's `table-common` module.\n\n### PulsarSerializationSchema\n\nPulsarSerializationSchema is a wrapper for Flink SerializationSchema that provides more functionality. In most cases, users do not need to implement PulsarSerializationSchema by themselves. PulsarSerializationSchemaWrapper is provided to wrap a Flink SerializationSchema as  PulsarSerializationSchema.\n\nPulsarSerializationSchema uses the builder pattern and you can call `setKeyExtractor` or `setTopicExtractor` to extract the key and customize the target topic from each message.\n\nIn particular, since Pulsar maintains its own Schema information internally, our messages must be able to export SchemaInfo when they are written to Pulsar. The `useSpecialMode`, `useAtomicMode`, `usePojoMode`, and `useRowMode` methods help you quickly build the Schema information required for Pulsar. You must choose one of these four modes.\n\n- SpecialMode: specify the `Schema\u003c?\u003e` mode directly. Ensure that this Schema is compatible with the Flink SerializationSchema setting.\n- AtomicMode: For some atomic types, pass the type of AtomicDataType, such as `DataTypes.INT()`, which corresponds to `Schema\u003cInteger\u003e` in Pulsar.\n- PojoMode: you need to pass a custom class object and either JSON or Arvo Schema to specify how to build a composite type Schema, such as `usePojoMode(Person.class, RecordSchemaType.JSON)`.\n- RowMode: in general, it is used for our internal `Table\u0026SQL` API implementation.\n\n### Fault tolerance\n\nWith Flink's checkpoints being enabled, `FlinkPulsarSink` can provide at-least-once and exactly-once delivery guarantees.\n\nIn addition to enabling checkpoints for Flink, you should also configure `setLogFailuresOnly(boolean)` and `setFlushOnCheckpoint(boolean)` parameters.\n\n\u003e **Note**  \n\u003e `setFlushOnCheckpoint(boolean)`: by default, it is set to `true`. When it is enabled, writing to Pulsar records is performed at this checkpoint snapshotState. This ensures that all records before the checkpoint are written to Pulsar. And, at-least-once setting must also be enabled.\n\n## Table environment\n\nThe Pulsar Flink connector supports all the Table features, as listed below.\n\n- SQL and DDL\n- Catalog\n\n### SQL and DDL\n\nThe following section describes SQL configurations and DDL configurations.\n\n#### SQL configurations\n\n```sql\nCREATE TABLE pulsar (\n  `physical_1` STRING,\n  `physical_2` INT,\n  `eventTime` TIMESTAMP(3) METADATA,\n  `properties` MAP\u003cSTRING, STRING\u003e METADATA ,\n  `topic` STRING METADATA VIRTUAL,\n  `sequenceId` BIGINT METADATA VIRTUAL,\n  `key` STRING ,\n  `physical_3` BOOLEAN\n) WITH (\n  'connector' = 'pulsar',\n  'topic' = 'persistent://public/default/topic82547611',\n  'key.format' = 'raw',\n  'key.fields' = 'key',\n  'value.format' = 'avro',\n  'service-url' = 'pulsar://localhost:6650',\n  'admin-url' = 'http://localhost:8080',\n  'scan.startup.mode' = 'earliest' \n)\n\nINSERT INTO pulsar \nVALUES\n ('data 1', 1, TIMESTAMP '2020-03-08 13:12:11.123', MAP['k11', 'v11', 'k12', 'v12'], 'key1', TRUE),\n ('data 2', 2, TIMESTAMP '2020-03-09 13:12:11.123', MAP['k21', 'v21', 'k22', 'v22'], 'key2', FALSE),\n ('data 3', 3, TIMESTAMP '2020-03-10 13:12:11.123', MAP['k31', 'v31', 'k32', 'v32'], 'key3', TRUE)\n \nSELECT * FROM pulsar\n```\n\nSQL supports configuring physical fields, calculated columns, watermark, METADATA and other features.\n\n#### DDL configurations\n\n| Parameter                    | Default value | Description                                                  | Required or not |\n| ----------------------------- | ------------- | ------------------------------------------------------------ | -------- |\n| connector                     | null          | Set the connector type. Available options are `pulsar` and `upsert-pulsar`. | Yes      |\n| topic                         | null          | Set the input or output topic, use half comma for multiple and concatenate topics. Choose one with the topic-pattern. | No       |\n| topic-pattern                 | null          | Use regular to get the matching topic.                       | No       |\n| service-url                   | null          | Set the Pulsar broker service address.                               | Yes      |\n| admin-url                     | null          | Set the Pulsar administration service address.                                 | Yes      |\n| scan.startup.mode             | latest        | Configure the Source's startup mode. Available options are `earliest`, `latest`, `external-subscription`, and `specific-offsets`. | No       |\n| scan.startup.specific-offsets | null          | This parameter is required when the `specific-offsets` parameter is specified. | No       |\n| scan.startup.sub-name         | null          | This parameter is required when the `external-subscription` parameter is specified. | No       |\n| discovery topic interval      | null          | Set the time interval for partition discovery, in unit of milliseconds.         | No       |\n| sink.message-router           | key-hash      | Set the routing method for writing messages to the Pulsar partition. Available options are `key-hash`, `round-robin`, and `custom MessageRouter`. | No       |\n| sink.semantic                 | at-least-once | The Sink writes the assurance level of the message. Available options are `at-least-once`, `exactly-once`, and `none`. | No       |\n| properties                    | empty         | Set Pulsar's optional configurations, in a format of `properties.key='value'`. For details, see [Configuration parameters](#configuration-parameters). | No       |\n| key.format                    | null          | Set the key-based serialization format for Pulsar messages. Available options are `No format`, `optional raw`, `Avro`, `JSON`, etc. | No       |\n| key.fields                    | null          | The SQL definition field to be used when serializing Key, multiple by half comma `,` concatenated. | No       |\n| key.fields-prefix             | null          | Define a custom prefix for all fields in the key format to avoid name conflicts with fields in the value format. By default, the prefix is empty. If a custom prefix is defined, the Table schema and `key.fields` are used. | No       |\n| format or value.format        | null          | Set the name with a prefix. When constructing data types in the key format, the prefix is removed and non-prefixed names are used within the key format. Pulsar message value serialization format, support JSON, Avro, etc. For more information, see the Flink format. | Yes      |\n| value.fields-include          | ALL           | The Pulsar message value contains the field policy, optionally ALL, and EXCEPT_KEY. | No       |\n\n#### Metadata configurations\n\nThe METADATA flag is used to read and write metadata in Pulsar messages. The support list is as follows.\n\n\u003e **Note**  \n\u003e The R/W column defines whether a metadata field is readable (R) and/or writable (W). Read-only columns must be declared VIRTUAL to exclude them during an INSERT INTO operation.\n\n| Key         | Data Type                                  | Description                            | R/W  |\n| ----------- | ------------------------------------------ | -------------------------------------- | ---- |\n| topic       | STRING NOT NULL                            | Topic name of the Pulsar message.      | R    |\n| messageId   | BYTES NOT NULL                             | Message ID of the Pulsar message.      | R    |\n| sequenceId  | BIGINT NOT NULL                            | sequence ID of the Pulsar message.     | R    |\n| publishTime | TIMESTAMP(3) WITH LOCAL TIME ZONE NOT NULL | Publishing time of the Pulsar message. | R    |\n| eventTime   | TIMESTAMP(3) WITH LOCAL TIME ZONE NOT NULL | Generation time of the Pulsar message. | R/W  |\n| properties  | MAP\u003cSTRING, STRING\u003e NOT NULL               | Extensions information of the Pulsar message. | R/W  |\n\n### Catalog\n\nFlink always searches for tables, views and UDFs in the current catalog and database. To use the Pulsar Catalog and treat the topic in Pulsar as a table in Flink, you should use the `pulsarcatalog` that has been defined in `./conf/sql-client-defaults.yaml` in `pulsarcatalog`.\n\n```java\ntableEnv.useCatalog(\"pulsarcatalog\")\ntableEnv.useDatabase(\"public/default\")\ntableEnv.scan(\"topic0\")\n```\n\n```SQL\nFlink SQL\u003e USE CATALOG pulsarcatalog;\nFlink SQL\u003e USE `public/default`;\nFlink SQL\u003e select * from topic0;\n```\n\nThe following configuration is optional in the environment file, or it can be overridden in the SQL client session using the `SET` command.\n\n\u003ctable class=\"table\"\u003e\n\u003ctr\u003e\u003cth\u003eOption\u003c/th\u003e\u003cth\u003eValue\u003c/th\u003e\u003cth\u003eDefault\u003c/th\u003e\u003cth\u003eDescription\u003c/th\u003e\u003c/tr \u003e\n\u003ctr\u003e\n \u003ctd\u003e`default-database`\u003c/td\u003e\n \u003ctd\u003eDefault database name \u003c/td\u003e\n \u003ctd\u003epublic/default\u003c/td\u003e\n \u003ctd\u003eWhen using the Pulsar catalog, the topic in Pulsar is treated as a table in Flink. Therefore, `database` is another name for `tenant/namespace`. The database is the base path for table lookups or creation. \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e`table-default-partitions`\u003c/td\u003e\n \u003ctd\u003eDefault topic partition\u003c/td\u003e\n \u003ctd\u003e5\u003c/td\u003e\n \u003ctd\u003eWhen using the Pulsar catalog, the topic in Pulsar is treated as a table in Flink. The size of the partition is set when creating the topic. \u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\nFor more details, see [DDL configurations](#ddl-configurations).\n\n\u003e **Note**  \n\u003e In Catalog, you cannot delete `tenant/namespace` or `topic`.\n\n# Advanced features\n\nThis section describes advanced features supported by Pulsar Flink connector.\n\n## Pulsar primitive types\n\nPulsar provides some basic native types. To use these native types, you can support them in the following ways.\n\n### Stream API environment\n\nPulsarPrimitiveSchema is an implementation of the `PulsarDeserializationSchema` and `PulsarSerializationSchema` interfaces.\n\nYou can create the required instance in a similar way `new PulsarSerializationSchema(String.class)`.\n\n### Table environment\n\nWe have created a new Flink format component called `atomic` that you can use in SQL format. In Source, it translates the Pulsar native type into only one column of RowData. In Sink, it translates the first column of RowData into the Pulsar native type and writes it to Pulsar.\n\n## Upsert Pulsar\n\nThere is an increasing demand for Upsert mode message queues for three main reasons.\n\n- Interpret the Pulsar topic as a changelog stream, which interprets records with keys as Upsert events.\n- As part of the real-time pipeline, multiple streams are connected for enrichment and the results are stored in the Pulsar topic for further computation. However, the results may contain updated events.\n- As part of the real-time pipeline, the data stream is aggregated and the results are stored in Pulsar Topic for further computation. However, the results may contain updated events.\n  \nBased on these requirements, we support Upsert Pulsar. With this feature, users can read data from and write data to Pulsar topics in an Upsert fashion.\n\nIn the SQL DDL definition, you can set the connector to `upsert-pulsar` to use the Upsert Pulsar connector.\n\n**In terms of configuration, the primary key of the Table must be specified, and `key.fields`, `key.fields-prefix` cannot be used.**\n\nAs a source, the Upsert Pulsar connector produces changelog streams, where each data record represents an update or deletion event. More precisely, the value in a data record is interpreted as a UPDATE of the last value of the same key, if this key exists (If the corresponding key does not exist, the UPDATE is considered as an INSERT.). Using the table analogy, data records in the changelog stream are interpreted as UPSERT, also known as INSERT/UPDATE, because any existing row with the same key is overwritten. Also, a message with a null value is treated as a DELETE message.\n\nAs a sink, the Upsert Pulsar connector can consume changelog streams. It writes INSERT/UPDATE_AFTER data as normal Pulsar messages and writes DELETE data as Pulsar messages with null value (It indicates that key of the message is deleted). Flink partitions the data based on the value of the primary key so that the messages on the primary key are ordered. And, UPDATE/DELETE messages with the same primary key fall in the same partition.\n\n## Key-Shared subscription mode\n\nIn some scenarios, users need messages to be strictly guaranteed message order to ensure correct business processing. Usually, in the case of strictly order-preserving messages, only one consumer can consume messages at the same time to guarantee the order. This results in a significant reduction in message throughput. Pulsar designs the Key-Shared subscription mode for such scenarios by adding keys to messages and routing messages with the same Key Hash to the same messenger, which ensures message order and improves throughput.\n\nPulsar Flink connector supports this feature the as well. This feature can be enabled by configuring the `enable-key-hash-range=true` parameter. When enabled, the range of Key Hash processed by each consumer is divided based on the parallelism of the task.\n\n## Fault tolerance\n\nPulsar Flink connector 2.7.0 provides different semantics for source and sink.\n\n### Source\n\nFor Pulsar source, Pulsar Flink connector 2.7.0 provides `exactly-once` semantic.\n\n### Sink\n\nPulsar Flink connector 2.4.12 only supports `at-least-once` semantic for sink. Based on transactions supported in Pulsar 2.7.0 and the Flink [`TwoPhaseCommitSinkFunction` API](https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/functions/sink/TwoPhaseCommitSinkFunction.html), Pulsar Flink connector 2.7.0 supports both `exactly-once` and `at-least-once` semantics for sink. For more information, see [here](https://flink.apache.org/2021/01/07/pulsar-flink-connector-270.html).\n\nBefore setting `exactly_once` semantic for a sink, you need to make the following configuration changes.\n\n1. In Pulsar, transaction related functions are **disabled by default**. In this case, you need to set `transactionCoordinatorEnabled = true` in the configuration file (`conf/standalone.conf` or `conf/broker.conf`) .\n\n2. When creating a sink, set `PulsarSinkSemantic.EXACTLY_ONCE`. The default value of  `PulsarSinkSemantic` is `AT_LEAST_ONCE`.\n\n    Example\n\n    ```\n    SinkFunction\u003cInteger\u003e sink = new FlinkPulsarSink\u003c\u003e(\n          adminUrl,\n          Optional.of(topic),\n          clientConfigurationData,\n          new Properties(),\n          new PulsarSerializationSchemaWrapper.Builder\u003c\u003e\n                  ((SerializationSchema\u003cInteger\u003e) element -\u003e Schema.INT32.encode(element))\n                  .useAtomicMode(DataTypes.INT())\n                  .build(),\n          PulsarSinkSemantic.EXACTLY_ONCE\n    );\n    ```\n\n    Additionally, you can set transaction related configurations as below.\n\n    Parameter|Description|Default value\n    ---|---|---\n    `PulsarOptions.TRANSACTION_TIMEOUT`|Timeout for transactions in Pulsar. If the time exceeds, the transaction operation fails.|360000ms\n    `PulsarOptions.MAX_BLOCK_TIME_MS`|Maximum time to wait for a transaction to commit or abort. If the time exceeds, the operator throws an exception.|100000ms\n\n    Alternatively, you can override these configurations in the `Properties` object and pass it into the `Sink` constructor.\n\n## Configuration parameters\n\nThis parameter corresponds to the `FlinkPulsarSource` in StreamAPI, the Properties object in the FlinkPulsarSink construction parameter, and the configuration properties parameter in Table mode.\n\n| Parameter | Default value | Description | Effective range |\n| --------- | -------- | ---------------------- | ------------ |\n| topic | null | Pulsar topic | source |\n| topics | null | Multiple Pulsar topics connected by half-width commas | source |\n| topicspattern | null | Multiple Pulsar topics with more Java regular matching | source |\n| partition.discovery.interval-millis | -1 | Automatically discover added or removed topics, in unit of milliseconds. If the value is set to -1, it indicates that means not open. | source |\n| clientcachesize | 100 | Set the number of cached Pulsar clients. | source, sink |\n| auth-params | null | Set the authentication parameters for Pulsar clients. | source, sink |\n| auth-plugin-classname | null | Set the authentication class name for Pulsar clients.  | source, sink |\n| flushoncheckpoint | true | Write a message to Pulsar topics. | sink |\n| failonwrite | false | When sink error occurs, continue to confirm the message. | sink |\n| polltimeoutms | 120000 | Set the timeout for waiting to get the next message, in unit of milliseconds. | source |\n| pulsar.reader.fail-on-data-loss | true | When data is lost, the operation fails. | source |\n| pulsar.reader.use-earliest-when-data-loss | false | When data is lost, use earliest reset offset. | source |\n| commitmaxretries | 3 | Set the maximum number of retries when an offset is set for Pulsar messages. | source |\n| send-delay-millisecond | 0 | delay millisecond message, just use **TableApi**, **StreamApi** use`PulsarSerializationSchema.setDeliverAtExtractor` | Sink |\n| scan.startup.mode | null | Set the earliest, latest, and the position where subscribers consume news,. It is a required parameter. | source |\n| enable-key-hash-range | false | Enable the Key-Shared subscription mode. | source |\n| pulsar.reader.* | | For details about Pulsar reader configurations, see [Pulsar reader](https://pulsar.apache.org/docs/en/client-libraries-java/#reader). | source |\n| pulsar.reader.subscriptionRolePrefix | flink-pulsar- | When no subscriber is specified, the prefix of the subscriber name is automatically created. | source |\n| pulsar.reader.receiverQueueSize | 1000 | Set the receive queue size. | source |\n| pulsar.producer.* | | For details about Pulsar producer configurations, see [Pulsar producer](https://pulsar.apache.org/docs/en/client-libraries-java/#producer). | Sink |\n| pulsar.producer.sendTimeoutMs | 30000 | Set the timeout for sending a message, in unit of milliseconds. | Sink |\n| pulsar.producer.blockIfQueueFull | false | The Pulsar producer writes messages. When the queue is full, the method is blocked instead of an exception is thrown. | Sink |\n\n`pulsar.reader.*` and `pulsar.producer.*` specify more detailed configuration of the Pulsar behavior. The asterisk sign (*) is replaced by the configuration name in Pulsar. For details, see [Pulsar reader](https://pulsar.apache.org/docs/en/client-libraries-java/#reader) and [Pulsar producer](https://pulsar.apache.org/docs/en/client-libraries-java/#producer).\n\nIn the DDL statement, the sample which is similar to the following is used.\n\n```\n'properties.pulsar.reader.subscriptionRolePrefix' = 'pulsar-flink-',\n'properties.pulsar.producer.sendTimeoutMs' = '30000',\n```\n\n## Authentication configuration\n\nFor Pulsar instances configured with authentication, the Pulsar Flink connector can be configured in a similar as the regular Pulsar client.\n\n1. For `FlinkPulsarSource` and `FlinkPulsarSink` on Java API, you can use one of the following ways to set up authentication.\n\n   - Set the `Properties` parameter.\n\n     ```java\n     props.setProperty(PulsarOptions.AUTH_PLUGIN_CLASSNAME_KEY, \"org.apache.pulsar.client.impl.auth.AuthenticationToken\");\n     props.setProperty(PulsarOptions.AUTH_PARAMS_KEY, \"token:abcdefghijklmn\");\n     ```\n\n   - Set the `ClientConfigurationData` parameter, which has a higher priority than the `Properties` parameter.\n\n     ```java\n     ClientConfigurationData conf = new ClientConfigurationData();\n     conf.setServiceUrl(serviceUrl);\n     conf.setAuthPluginClassName(className);\n     conf.setAuthParams(params);\n     ```\n     \n2. For the Table and SQL, you can use the following way to set up authentication.\n\n   ```sql\n   CREATE TABLE pulsar (\n                          `physical_1` STRING,\n                          `physical_2` INT,\n                          `eventTime` TIMESTAMP(3) METADATA,\n                          `properties` MAP\u003cSTRING, STRING\u003e METADATA ,\n                          `topic` STRING METADATA VIRTUAL,\n                          `sequenceId` BIGINT METADATA VIRTUAL,\n                          `key` STRING ,\n                          `physical_3` BOOLEAN\n   ) WITH (\n       'connector' = 'pulsar',\n       'topic' = 'persistent://public/default/topic82547611',\n       'key.format' = 'raw',\n       'key.fields' = 'key',\n       'value.format' = 'avro',\n       'service-url' = 'pulsar://localhost:6650',\n       'admin-url' = 'http://localhost:8080',\n       'scan.startup.mode' = 'earliest',\n       'properties.auth-plugin-classname' = 'org.apache.pulsar.client.impl.auth.AuthenticationToken',\n       'properties.auth-params' = 'token:xxxxxxxxxx',\n   )\n   ```\n\nFor details about authentication configuration, see [Pulsar Security](https://pulsar.apache.org/docs/en/security-overview/).\n\n## ProtoBuf\n\n\u003e **Note**\n\u003e\n\u003e Currently, ProtoBuf is an experimental feature.\n\nThis feature is based on this [PR](https://github.com/apache/flink/pull/14376) and is not merged yet. Therefore, it is temporarily placed in this repository as a source code for packaging and dependencies.\n\n**Example**\n\n```sql\ncreate table pulsar (\n                        a INT,\n                        b BIGINT,\n                        c BOOLEAN,\n                        d FLOAT,\n                        e DOUBLE,\n                        f VARCHAR(32),\n                        g BYTES,\n                        h VARCHAR(32),\n                        f_abc_7d INT,\n                        `eventTime` TIMESTAMP(3) METADATA,\n                        compute as a + 1,\n                        watermark for eventTime as eventTime\n                        ) with (\n                        'connector' = 'pulsar',\n                        'topic' = 'test-protobuf',\n                        'service-url' = 'pulsar://localhost:6650',\n                        'admin-url' = 'http://localhost:8080',\n                        'scan.startup.mode' = 'earliest',\n                        'format' = 'protobuf',\n                        'protobuf.message-class-name' = 'org.apache.flink.formats.protobuf.testproto.SimpleTest'\n                        )\n\nINSERT INTO pulsar VALUES (1,2,false,0.1,0.01,'haha', ENCODE('1', 'utf-8'), 'IMAGES',1, TIMESTAMP '2020-03-08 13:12:11.123');\n```\n\nThe `SimpleTest` class must be `GeneratedMessageV3`.\n\n","funding_links":[],"categories":["Java","Data Processing"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstreamnative%2Fpulsar-flink","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstreamnative%2Fpulsar-flink","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstreamnative%2Fpulsar-flink/lists"}