https://github.com/conduitio-labs/conduit-connector-s3-iceberg

apache-iceberg conduit iceberg java s3

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/conduitio-labs/conduit-connector-s3-iceberg
Owner: conduitio-labs
Created: 2023-11-03T13:41:51.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-24T23:10:39.000Z (3 months ago)
Last Synced: 2025-02-25T00:19:59.319Z (3 months ago)
Topics: apache-iceberg, conduit, iceberg, java, s3
Language: Java
Homepage:
Size: 320 KB
Stars: 0
Watchers: 5
Forks: 0
Open Issues: 7
Metadata Files:
- Readme: README.md
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

# Conduit's connector for S3 and Apache Iceberg

The S3-Apache Iceberg connector is a [Conduit](https://github.com/ConduitIO/conduit) destination connector,
which makes it possible to write data into S3 in the [Apache Iceberg](https://iceberg.apache.org/) format.

## Pre-requisites
* JDK 17
* Currently, only Unix-like OSes are supported.

## How to build
First, the connector needs to be built, which can be done with `scripts/dist.sh`. A new directory, `dist`,
will be created. The contents of `dist` need to be copied into the Conduit connectors' directory (which,
by default, is `connectors`). Read more about installing Conduit connectors
[here](https://conduit.io/docs/connectors/installing).

## Destination
This destination connector pushes data from upstream resources to an S3 iceberg table via Conduit.

For inserts, a record is inserted into the table, which will append an iceberg formatted file into S3. For Deletes,
the `Record.Key` field from the OpenCDC record will be used as a condition for deleting a record. Finally, for Updates,
the record will first be deleted according to its `Record.Key` value, then the updated version will be inserted.

## Configuration

| Field JSON Name | Description | Required | Default Value | Example |
|------------------------|-----------------------------------------------------|----------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `catalog.name` | Catalog name | true | - | "exampleCatalog" |
| `namespace` | Namespace | true | - | "exampleNamespace" |
| `table.name` | Table name | true | - | "exampleTable" |
| `s3.endpoint` | S3 endpoint URL | true | - | "https://s3.example.com" |
| `s3.accessKeyId` | S3 Access Key ID | true | - | "exampleAccessKeyId" |
| `s3.secretAccessKey` | S3 Secret Access Key | true | - | "exampleSecretKey" |
| `s3.region` | S3 region | true | - | "us-east-1" |
| `catalog.catalog-impl` | Catalog implementation to be used | true | - | `"org.apache.iceberg.rest.RESTCatalog"`
Possible values:
- `"org.apache.iceberg.hadoop.HadoopCatalog"`
- `"org.apache.iceberg.jdbc.JdbcCatalog"`
- `"org.apache.iceberg.rest.RESTCatalog"` |
| `catalog.{propertyName}` | Set a catalog property with the name `propertyName` | true | - | {"catalog.uri": "http://localhost:8181"} |

## Example pipeline configuration file
```yaml
---
version: 2.0
pipelines:
- id: postgres-to-s3-iceberg
status: running
description: Postgres to S3 Iceberg
connectors:
- id: pg-source
type: source
plugin: "builtin:postgres"
name: source1
settings:
url: "postgresql://username:password@localhost/dbname?sslmode=disable"
key: id
table: "iceberg_input"
snapshotMode: "never"
cdcMode: "logrepl"
- id: s3-iceberg-destination
type: destination
plugin: standalone:s3-iceberg
name: destination1
settings:
namespace: "conduit"
table.name: "test_table"
catalog.name: "demo"
catalog.catalog-impl: "org.apache.iceberg.rest.RESTCatalog"
catalog.uri: "http://localhost:8181"
s3.endpoint: "http://localhost:9000"
s3.accessKeyId: "admin"
s3.secretAccessKey: "password"
s3.region: "us-east-1"
```
check [Pipeline Configuration Files Docs](https://github.com/ConduitIO/conduit/blob/main/docs/pipeline_configuration_files.md)
for more details about how to run this configuration file.

## Known Limitations
In rare occasions, when updating a record (which consists of first deleting and then inserting the updated one),
If an error occurred while inserting the new record, then an Error will be returned, and a record might be lost if the
error happened after the delete operation and before the insert. Check your table's snapshots to recover the data if lost.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/conduitio-labs/conduit-connector-s3-iceberg

Awesome Lists containing this project

README