https://github.com/vinted/flink-big-query-connector

Flink connector for BigQuery
https://github.com/vinted/flink-big-query-connector

bigquery flink flink-connector flink-connector-bigquery streaming

Last synced: 2 months ago
JSON representation

Flink connector for BigQuery

Host: GitHub
URL: https://github.com/vinted/flink-big-query-connector
Owner: vinted
License: mit
Created: 2023-07-24T13:27:10.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-11-29T11:12:52.000Z (over 2 years ago)
Last Synced: 2023-11-29T16:54:49.519Z (over 2 years ago)
Topics: bigquery, flink, flink-connector, flink-connector-bigquery, streaming
Language: Java
Homepage:
Size: 155 KB
Stars: 10
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Flink BigQuery Connector ![Build](https://github.com/vinted/flink-big-query-connector/actions/workflows/gradle.yml/badge.svg) [![](https://jitpack.io/v/com.vinted/flink-big-query-connector.svg)](https://jitpack.io/#com.vinted/flink-big-query-connector)

This project provides a BigQuery sink that allows writing data with exactly-once or at-least guarantees.

## Usage

There are builder classes to simplify constructing a BigQuery sink. The code snippet below shows an example of building a BigQuery sink in Java:

```java

var credentials = new JsonCredentialsProvider("key");

var clientProvider = new BigQueryProtoClientProvider(credentials,

    WriterSettings.newBuilder()

                 .build()

);

var bigQuerySink = BigQueryStreamSink.newBuilder()

    .withClientProvider(clientProvider)

    .withDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)

    .withRowValueSerializer(new NoOpRowSerializer<>())

    .build();

```

Async connector for at least once delivery

```java

var credentials = new JsonCredentialsProvider("key");

var clientProvider = new AsyncClientProvider(credentials,

    WriterSettings.newBuilder()

                 .build()

);

var sink = AsyncBigQuerySink.builder()

        .setRowSerializer(new NoOpRowSerializer<>())

        .setClientProvider(clientProvider)

        .setMaxBatchSize(30)

        .setMaxBufferedRequests(10)

        .setMaxBatchSizeInBytes(10000)

        .setMaxInFlightRequests(4)

        .setMaxRecordSizeInBytes(10000)

        .build();

```

The sink takes in a batch of records. Batching happens outside the sink by opening a window. Batched records need to implement the BigQueryRecord interface.

```java

var trigger = BatchTrigger.builder()

    .withCount(100)

    .withTimeout(Duration.ofSeconds(1))

    .withSizeInMb(1)

    .withResetTimerOnNewRecord(true)

    .build();

var processor = new BigQueryStreamProcessor()

    .withDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)

    .build();

source.key(s -> s)

    .window(GlobalWindows.create())

    .trigger(trigger)

    .process(processor);

```

To write to BigQuery, you need to:

- Define credentials

- Create a client provider

- Batch records

- Create a value serializer

- Sink to BigQuery

# Credentials

There are two types of credentials:

- Loading from a file

```java

new FileCredentialsProvider("/path/to/file")

```

- Passing as a JSON string

```java

new JsonCredentialsProvider("key")

```

# Types of Streams

BigQuery supports two types of data formats: json and proto. When creating a stream, you can choose these types by creating the appropriate client and using the builder methods.

- JSON

```java

var clientProvider = new BigQueryJsonClientProvider(credentials,

    WriterSettings.newBuilder()

                 .build()

);

var bigQuerySink = BigQueryStreamSink.newBuilder()

```

- Proto

```java

var clientProvider = new BigQueryProtoClientProvider(credentials,

    WriterSettings.newBuilder()

                 .build()

);

var bigQuerySink = BigQueryStreamSink.newBuilder();

```

# Exactly once

It utilizes a [buffered stream](https://cloud.google.com/bigquery/docs/write-api#buffered_type), managed by the BigQueryStreamProcessor, to assign and process data batches. If a stream is inactive or closed, a new stream is created automatically. The BigQuery sink writer appends and flushes data to the latest offset upon checkpoint commit.

# At least once

Data is written to the [default stream](https://cloud.google.com/bigquery/docs/write-api#default_stream) and handled by the BigQueryStreamProcessor, which batches and sends rows to the sink for processing.

# Serializers

For the proto stream, you need to implement `ProtoValueSerializer`, and for the JSON stream, you need to implement `JsonRowValueSerializer`.

# Metrics

  

    

      Scope

      Metrics

      Description

      Type

    

  

  

    

        Stream

        stream_offset

        Current offset for the stream. When using at least once, the offset is always 0

        Gauge

    

    

        batch_count

        Number of records in the appended batch

        Gauge

    

    

        batch_size_mb

        Appended batch size in mb

        Gauge

    

    

        split_batch_count

        Number of times the batch hit the BigQuery limit and was split into two parts

        Gauge

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vinted/flink-big-query-connector

Awesome Lists containing this project

README