https://github.com/sanjuthomas/kafka-connect-marklogic

Kafka Sink Connect MarkLogic
https://github.com/sanjuthomas/kafka-connect-marklogic

apache java kafka kafka-connect kafka-connector marklogic rest

Last synced: 2 months ago
JSON representation

Kafka Sink Connect MarkLogic

Host: GitHub
URL: https://github.com/sanjuthomas/kafka-connect-marklogic
Owner: sanjuthomas
License: mit
Created: 2017-03-18T22:08:45.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2024-06-30T23:58:18.000Z (about 1 year ago)
Last Synced: 2025-05-03T21:11:53.990Z (2 months ago)
Topics: apache, java, kafka, kafka-connect, kafka-connector, marklogic, rest
Language: Java
Homepage: http://sanjuthomas.com
Size: 8.86 MB
Stars: 19
Watchers: 7
Forks: 4
Open Issues: 5
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# Overview

Kafka Connect MarkLogic is a sink-only connector that pulls messages from Kafka and stores them in MarkLogic as JSON documents. Consider using the Kafka MarkLogic Sync Connector for the following use cases.

1. Near real-time ingestion requirements.
2. Regulate the traffic towards MarkLogic.
3. Requirement to maintain the order of the messages.
4. A Kafka ecosystem exists.
5. Replay messages using an offset.

## High-Level Architecture Diagram

![Kafka Connect MarkLogic](KafkaConnectMarkLogic.png)

This project is about the component marked in green. Refer [here](http://confluent.io/product/connectors) to see the available list of source connectors.

## Prerequisites

[Apache ZooKeeper](https://zookeeper.apache.org) and [Apache Kafka](https://kafka.apache.org) are installed and running on your machine. Please refer to respective sites to download and start ZooKeeper and Kafka.

## What is MarkLogic?
The MarkLogic is a multi-model schemaless NoSQL database to store, manage, and search JSON, XML, and RDF triples. For more details, please refer to MarkLogic's official [website](marklogic.com).

## What is Apache Kafka?
Apache Kafka is an open-source stream processing platform written in Scala and Java by the Apache Software Foundation. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. For more details, please refer to the [kafka home page](https://kafka.apache.org/).

## Implementation details

To send data to MarkLogic, this connector uses MarkLogic REST API. By default, the /v1/documents endpoint at port 8000 is used. You may change that in the marklogic-sink.properties file. You may use your own REST/Service extension instead of the out-of-the-box document API to transform anything on the way in.

To listen to multiple topics, please add the topic name in the marklogic-sink.properties file. The current implementation will take the topic name and use it as the MarkLogic document collection name.

## Data Mapping

MarkLogic is a schemaless document store/NoSQL database. Since we are working with plain JSON data, we don't need a schema to serialize and deserialize the messages.

**For stand-alone mode**, please copy ```kafka_home/config/connect-standalone.properties``` to create ```kafka_home/config/marklogic-connect-standalone.properties``` file. Open ```kafka_home/config/marklogic-connect-standalone.properties``` and set the following properties to false.

```
key.converter.schemas.enable=false
value.converter.schemas.enable=false
```

**For distributed mode**, please copy ```kafka_home/config/connect-distributed.properties``` to create ```kafka_home/config/marklogic-connect-distributed.properties``` file. Open ```kafka_home/config/marklogic-connect-distributed.properties``` and set the following properties to false.

```
key.converter.schemas.enable=false
value.converter.schemas.enable=false
```

In distributed mode, if you run more than one worker per host, the ```rest.port``` settings must have different values for each instance. By default REST interface is available at 8083.

## How do you deploy the connector in Kafka?

This is a maven project. To create an [uber](https://maven.apache.org/plugins/maven-shade-plugin/index.html) jar, execute the following maven goals.

```mvn clean compile package shade:shade install```

Copy the artifact ```kafka-connect-marklogic-1.0.jar``` to kakfa_home/lib folder.

Copy the [marklogic-sink.properties](https://github.com/sanjuthomas/kafka-connect-marklogic/blob/master/config/marklogic-sink.properties) file into kafka_home/config folder. Update the content of the property file according to your environment.

Alternatively, you may keep the ```kafka-connect-marklogic-1.0.jar``` in another directory and export that directory into Kafka class path before starting the connector.

## How to start the connector in stand-alone mode?

Open a shell prompt, move to kafka_home, and execute the following.

```
bin/connect-standalone.sh config/marklogic-connect-standalone.properties config/marklogic-sink.properties
```

## How to start the connector in distributed mode?

Open a shell prompt, move to kafka_home, and execute the following.

```
bin/connect-distributed.sh config/marklogic-connect-distributed.properties config/marklogic-sink.properties
```

## How to produce some messages?

Some examples of message producers are available [here](https://github.com/sanjuthomas/marklogic-kafka-sample-client).

## Contact

Create an issue in the GitHub.

## Contribute

Please refer to [CONTRIBUTING.md](https://github.com/sanjuthomas/kafka-connect-marklogic/blob/master/CONTRIBUTING.md)

## License
The project is licensed under the MIT license.

## Looking for a connector that supports XML messages?

Drop me a line - [email protected]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sanjuthomas/kafka-connect-marklogic

Awesome Lists containing this project

README