https://github.com/benfradet/spark-kafka-writer
Write your Spark data to Kafka seamlessly
https://github.com/benfradet/spark-kafka-writer
kafka spark
Last synced: 3 months ago
JSON representation
Write your Spark data to Kafka seamlessly
- Host: GitHub
- URL: https://github.com/benfradet/spark-kafka-writer
- Owner: BenFradet
- License: apache-2.0
- Created: 2016-07-20T20:07:13.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2024-07-10T13:26:00.000Z (12 months ago)
- Last Synced: 2025-04-06T23:17:49.558Z (3 months ago)
- Topics: kafka, spark
- Language: Scala
- Homepage: https://benfradet.github.io/spark-kafka-writer
- Size: 1.36 MB
- Stars: 174
- Watchers: 12
- Forks: 65
- Open Issues: 20
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spark-kafka-writer
[](https://travis-ci.org/BenFradet/spark-kafka-writer)
[](https://codecov.io/gh/BenFradet/spark-kafka-writer)
[](https://gitter.im/BenFradet/spark-kafka-writer?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[](https://maven-badges.herokuapp.com/maven-central/com.github.benfradet/spark-kafka-writer_2.12)
[](https://waffle.io/BenFradet/spark-kafka-writer)Write your Spark data to Kafka seamlessly
## Installation
spark-kafka-writer is available on maven central with the following coordinates depending on whether
you're using Kafka 0.8 or 0.10 and your version of Spark:| | Kafka 0.8 | Kafka 0.10 |
|:-:|:-:|:-:|
| **Spark 2.4.X** | :x: | `"com.github.benfradet" %% "spark-kafka-writer" % "0.5.0"` |
| **Spark 2.2.X** | :x: | `"com.github.benfradet" %% "spark-kafka-writer" % "0.4.0"` |
| **Spark 2.1.X** | `"com.github.benfradet" %% "spark-kafka-0-8-writer" % "0.3.0"` | `"com.github.benfradet" %% "spark-kafka-0-10-writer" % "0.3.0"` |
| **Spark 2.0.X** | `"com.github.benfradet" %% "spark-kafka-0-8-writer" % "0.2.0"` | `"com.github.benfradet" %% "spark-kafka-0-10-writer" % "0.2.0"` |
| **Spark 1.6.X** | `"com.github.benfradet" %% "spark-kafka-writer" % "0.1.0"` | :x: |## Usage
### Without callbacks
- if you want to save an `RDD` to Kafka:
```scala
import com.github.benfradet.spark.kafka.writer._
import org.apache.kafka.common.serialization.StringSerializerval topic = "my-topic"
val producerConfig = Map(
"bootstrap.servers" -> "127.0.0.1:9092",
"key.serializer" -> classOf[StringSerializer].getName,
"value.serializer" -> classOf[StringSerializer].getName
)val rdd: RDD[String] = ...
rdd.writeToKafka(
producerConfig,
s => new ProducerRecord[String, String](topic, s)
)
```- if you want to save a `DStream` to Kafka:
```scala
import com.github.benfradet.spark.kafka.writer._
import org.apache.kafka.common.serialization.StringSerializerval dStream: DStream[String] = ...
dStream.writeToKafka(
producerConfig,
s => new ProducerRecord[String, String](topic, s)
)
```- if you want to save a `Dataset` to Kafka:
```scala
import com.github.benfradet.spark.kafka.writer._
import org.apache.kafka.common.serialization.StringSerializercase class Foo(a: Int, b: String)
val dataset: Dataset[Foo] = ...
dataset.writeToKafka(
producerConfig,
foo => new ProducerRecord[String, String](topic, foo.toString)
)
```- if you want to write a `DataFrame` to Kafka:
```scala
import com.github.benfradet.spark.kafka.writer._
import org.apache.kafka.common.serialization.StringSerializerval dataFrame: DataFrame = ...
dataFrame.writeToKafka(
producerConfig,
row => new ProducerRecord[String, String](topic, row.toString)
)
```### With callbacks
It is also possible to assign a `Callback` from the Kafka Producer API that will
be triggered after each write, this has a default value of None.The `Callback` must implement the `onCompletion` method and the `Exception`
parameter will be `null` if the write was successful.Any `Callback` implementations will need to be serializable to be used in Spark.
For example, if you want to use a `Callback` when saving an `RDD` to Kafka:
```scala
// replace by kafka08 if you're using Kafka 0.8
import com.github.benfradet.spark.kafka010.writer._
import org.apache.kafka.clients.producer.{Callback, ProducerRecord, RecordMetadata}@transient lazy val log = org.apache.log4j.Logger.getLogger("spark-kafka-writer")
val rdd: RDD[String] = ...
rdd.writeToKafka(
producerConfig,
s => new ProducerRecord[String, String](topic, s),
Some(new Callback with Serializable {
override def onCompletion(metadata: RecordMetadata, e: Exception): Unit = {
if (Option(e).isDefined) {
log.warn("error sending message", e)
} else {
log.info(s"write succeeded! offset: ${metadata.offset()}")
}
}
})
)
```
Check out [the Kafka documentation](http://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html#send(org.apache.kafka.clients.producer.ProducerRecord,%20org.apache.kafka.clients.producer.Callback))
to know more about callbacks.### Java usage
It's also possible to use the library from Java, for example if we were to write a `DStream` to Kafka:
```java
// Define a serializable Function1 separately
abstract class SerializableFunc1 extends AbstractFunction1 implements Serializable {}Map producerConfig = new HashMap();
producerConfig.put("bootstrap.servers", "localhost:9092");
producerConfig.put("key.serializer", StringSerializer.class);
producerConfig.put("value.serializer", StringSerializer.class);KafkaWriter kafkaWriter = new DStreamKafkaWriter<>(javaDStream.dstream(),
scala.reflect.ClassTag$.MODULE$.apply(String.class));
kafkaWriter.writeToKafka(producerConfig.asScala,
new SerializableFunc1>() {
@Override
public ProducerRecord apply(final String s) {
return new ProducerRecord<>(topic, s);
}
},
//new Some<>((metadata, exception) -> {}), // with callback, define your lambda here.
Option.empty() // or without callback.
);
```However, [#59](https://github.com/benfradet/spark-kafka-writer/issues/59) will provide a better Java API.
## Scaladoc
You can find the full scaladoc at https://benfradet.github.io/spark-kafka-writer.
## Credit
The original code was written by [Hari Shreedharan](https://github.com/harishreedharan).