https://github.com/streamthoughts/kafka-connect-transform-grok
Grok Expression Transform for Kafka Connect.
https://github.com/streamthoughts/kafka-connect-transform-grok
apache-kafka grok-parser grok-patterns kafka-connect
Last synced: 4 months ago
JSON representation
Grok Expression Transform for Kafka Connect.
- Host: GitHub
- URL: https://github.com/streamthoughts/kafka-connect-transform-grok
- Owner: streamthoughts
- License: apache-2.0
- Created: 2020-12-22T15:02:50.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2021-02-08T15:29:56.000Z (about 5 years ago)
- Last Synced: 2025-07-19T18:06:30.310Z (9 months ago)
- Topics: apache-kafka, grok-parser, grok-patterns, kafka-connect
- Language: Java
- Homepage:
- Size: 74.2 KB
- Stars: 16
- Watchers: 1
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
= Kafka Connect Grok Transformation
image:https://img.shields.io/badge/License-Apache%202.0-blue.svg[https://github.com/streamthoughts/kafka-connect-transform-grok/blob/master/LICENSE]
image:https://img.shields.io/github/issues-raw/streamthoughts/kafka-connect-transform-grok[GitHub issues]
image:https://img.shields.io/github/stars/streamthoughts/kafka-connect-transform-grok?style=social[GitHub Repo stars]
== Description
The Apache Kafka® SMT `io.streamthoughts.kafka.connect.transform.Grok` allows parsing unstructured data text into a data `Struct` from the entire key or value using
a Grok expression.
== Installation
Grok SMT can be installed using the https://docs.confluent.io/current/confluent-hub/client.html[Confluent Hub Client] with:
1. Download the distribution ZIP file for the latest available version.
+
[source, bash]
----
export VERSION=1.1.0
export GITHUB_HUB_REPO=https://github.com/streamthoughts/kafka-connect-transform-grok
$ curl -sSL $GITHUB_HUB_REPO/releases/download/v$VERSION/streamthoughts-kafka-connect-transform-grok-$VERSION.zip
----
+
2. Use the `confluent-hub` CLI for installing it.
+
[source, bash]
----
$ confluent-hub install streamthoughts-kafka-connect-transform-grok-$VERSION.zip
----
Alternatively, you can extract the ZIP file into one of the directories that is listed on the `plugin.path` worker configuration property.
== Grok Basics
The syntax for a grok pattern is `%{SYNTAX:SEMANTIC}` or `%{SYNTAX:SEMANTIC:TYPE}`
The `SYNTAX` is the name of the pattern that should match the input text value.
The `SEMANTIC` is the field name for the data field that will contain the piece of text being matched.
The `TYPE` is the target type to which the data field must be converted.
Supported types are: ::
* `SHORT`
* `INTEGER`
* `LONG`
* `FLOAT`
* `DOUBLE`
* `BOOLEAN`
* `STRING`
The Kafka Connect Grok Transformer ships with a lot of reusable grok patterns. See the complete list of https://github.com/streamthoughts/kafka-connect-transform-grok/tree/main/src/main/resources/patterns[patterns].
== Debugging Grok Expression
You can build and debug your patterns using the useful online tools: http://grokdebug.herokuapp.com/[Grok Debug] and http://grokconstructor.appspot.com/[Grok Constructor].
== Regular Expressions
Grok sits on top of regular expressions, so any regular expressions are valid in grok as well.
The Grok SMT uses the regular expression https://github.com/jruby/joni[Joni] library which is the Java port of https://github.com/kkos/oniguruma[Oniguruma] regexp library used by the http://www.elasticsearch.org/overview/[Elastic stack] (i.e Logstash).
== Custom Patterns
Sometimes, the patterns provided by the Grok SMT will not be sufficient to match your data.
Therefore, you have a few options to define custom patterns.
Option #1::
You can use the https://github.com/kkos/oniguruma[Oniguruma] syntax for _named capture_ which allows you to match a piece of text and capture it as a field:
[source]
----
(?the regex pattern)
----
For example, if you need to capture parts of an email we can you the following pattern :
[source]
----
(?(?^[A-Z0-9._%+-]+)(?@[A-Z0-9.-]+\.[A-Z]{2,6})$)
----
Configuration::
[source, properties]
----
transforms=Grok
transforms.Grok.type=io.streamthoughts.kafka.connect.transform.Grok$Value
transforms.Grok.pattern=(?(?[A-Za-z0-9._%+-]+)@(?[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}))
----
_Note: The pattern `EMAILADDRESS` is already provided by the Grok SMT._
Option #2::
You can create a custom patterns file that will loaded the first time the Grok SMT is used :
For example, defining the pattern needed to parse NGINX access logs:::
* Create a directory (e.g. `grok-patterns`) with a file in it called `nginx`.
* Then, write the pattern you need in that file as: ``.
[source, bash]
----
$ mkdir ./grok-patterns
$ cat < ./grok-patterns/nginx
NGINX_ACCESS %{IPORHOST:remote_addr} - %{USERNAME:remote_user} \[%{HTTPDATE:time_local}\] \"%{DATA:request}\" %{INT:status} %{NUMBER:bytes_sent} \"%{DATA:http_referer}\" \"%{DATA:http_user_agent}\"
EOF
----
Configuration::
[source, properties]
----
transforms=Grok
transforms.Grok.type=io.streamthoughts.kafka.connect.transform.Grok$Value
transforms.Grok.pattern=%{NGINX_ACCESS}
transforms.Grok.patternsDir=/tmp/grok-patterns
----
== Grok Configuration
[%header,format=csv]
|===
Property,Description,Type,Importance, Default
`breakOnFirstPattern`, If true break on the first successful matching. Otherwise, the transformation will try all configured grok patterns, `boolean`, `true`
`pattern`, The grok expression to match and extract named captures (i.e data fields) with., `string`, High, -
`patterns.`, An ordered list of grok expression to match and extract named captures (i.e data fields) with., `string`, High, -
`patternDefinitions`, Custom pattern definitions, `list`, Low, -
`patternsDir`, List of user-defined pattern directories, `list`, Low, -
`namedCapturesOnly`, If true then only store named captures from grok, `boolean`, Medium, `true`
|===
== 💡 Contributions
Any feedback, bug reports and PRs are greatly appreciated!
* Source Code: https://github.com/streamthoughts/kafka-connect-transform-grok[https://github.com/streamthoughts/kafka-connect-transform-grok]
* Issue Tracker: https://github.com/streamthoughts/kafka-connect-transform-grok/issues[https://github.com/streamthoughts/kafka-connect-transform-grok/issues]
* Releases: https://github.com/streamthoughts/kafka-connect-transform-grok/releases[https://github.com/streamthoughts/kafka-connect-transform-grok/releases]
== About
Originally, most of the source code used by the Apache Kafka® SMT `io.streamthoughts.kafka.connect.transform.Grok` was developed within the https://github.com/streamthoughts/kafka-connect-file-pulse[Kafka Connect File Pulse] connector plugin.
== Licence
Copyright 2020-2021 StreamThoughts.
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0[http://www.apache.org/licenses/LICENSE-2.0]
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.