https://github.com/vinted/flink-big-query-connector
Flink connector for BigQuery
https://github.com/vinted/flink-big-query-connector
bigquery flink flink-connector flink-connector-bigquery streaming
Last synced: 2 months ago
JSON representation
Flink connector for BigQuery
- Host: GitHub
- URL: https://github.com/vinted/flink-big-query-connector
- Owner: vinted
- License: mit
- Created: 2023-07-24T13:27:10.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-11-29T11:12:52.000Z (over 2 years ago)
- Last Synced: 2023-11-29T16:54:49.519Z (over 2 years ago)
- Topics: bigquery, flink, flink-connector, flink-connector-bigquery, streaming
- Language: Java
- Homepage:
- Size: 155 KB
- Stars: 10
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Flink BigQuery Connector  [](https://jitpack.io/#com.vinted/flink-big-query-connector)
This project provides a BigQuery sink that allows writing data with exactly-once or at-least guarantees.
## Usage
There are builder classes to simplify constructing a BigQuery sink. The code snippet below shows an example of building a BigQuery sink in Java:
```java
var credentials = new JsonCredentialsProvider("key");
var clientProvider = new BigQueryProtoClientProvider(credentials,
WriterSettings.newBuilder()
.build()
);
var bigQuerySink = BigQueryStreamSink.newBuilder()
.withClientProvider(clientProvider)
.withDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.withRowValueSerializer(new NoOpRowSerializer<>())
.build();
```
Async connector for at least once delivery
```java
var credentials = new JsonCredentialsProvider("key");
var clientProvider = new AsyncClientProvider(credentials,
WriterSettings.newBuilder()
.build()
);
var sink = AsyncBigQuerySink.builder()
.setRowSerializer(new NoOpRowSerializer<>())
.setClientProvider(clientProvider)
.setMaxBatchSize(30)
.setMaxBufferedRequests(10)
.setMaxBatchSizeInBytes(10000)
.setMaxInFlightRequests(4)
.setMaxRecordSizeInBytes(10000)
.build();
```
The sink takes in a batch of records. Batching happens outside the sink by opening a window. Batched records need to implement the BigQueryRecord interface.
```java
var trigger = BatchTrigger.builder()
.withCount(100)
.withTimeout(Duration.ofSeconds(1))
.withSizeInMb(1)
.withResetTimerOnNewRecord(true)
.build();
var processor = new BigQueryStreamProcessor()
.withDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.build();
source.key(s -> s)
.window(GlobalWindows.create())
.trigger(trigger)
.process(processor);
```
To write to BigQuery, you need to:
- Define credentials
- Create a client provider
- Batch records
- Create a value serializer
- Sink to BigQuery
# Credentials
There are two types of credentials:
- Loading from a file
```java
new FileCredentialsProvider("/path/to/file")
```
- Passing as a JSON string
```java
new JsonCredentialsProvider("key")
```
# Types of Streams
BigQuery supports two types of data formats: json and proto. When creating a stream, you can choose these types by creating the appropriate client and using the builder methods.
- JSON
```java
var clientProvider = new BigQueryJsonClientProvider(credentials,
WriterSettings.newBuilder()
.build()
);
var bigQuerySink = BigQueryStreamSink.newBuilder()
```
- Proto
```java
var clientProvider = new BigQueryProtoClientProvider(credentials,
WriterSettings.newBuilder()
.build()
);
var bigQuerySink = BigQueryStreamSink.newBuilder();
```
# Exactly once
It utilizes a [buffered stream](https://cloud.google.com/bigquery/docs/write-api#buffered_type), managed by the BigQueryStreamProcessor, to assign and process data batches. If a stream is inactive or closed, a new stream is created automatically. The BigQuery sink writer appends and flushes data to the latest offset upon checkpoint commit.
# At least once
Data is written to the [default stream](https://cloud.google.com/bigquery/docs/write-api#default_stream) and handled by the BigQueryStreamProcessor, which batches and sends rows to the sink for processing.
# Serializers
For the proto stream, you need to implement `ProtoValueSerializer`, and for the JSON stream, you need to implement `JsonRowValueSerializer`.
# Metrics
Scope
Metrics
Description
Type
Stream
stream_offset
Current offset for the stream. When using at least once, the offset is always 0
Gauge
batch_count
Number of records in the appended batch
Gauge
batch_size_mb
Appended batch size in mb
Gauge
split_batch_count
Number of times the batch hit the BigQuery limit and was split into two parts
Gauge