An open API service indexing awesome lists of open source software.

https://github.com/treeverse/lakefs-iceberg

A custom Iceberg catalog implementation for lakeFS
https://github.com/treeverse/lakefs-iceberg

Last synced: 14 days ago
JSON representation

A custom Iceberg catalog implementation for lakeFS

Awesome Lists containing this project

README

          

lakeFS logo          Apache Iceberg logo

## lakeFS Iceberg Catalog

lakeFS enriches your Iceberg tables with Git capabilities: create a branch and make your changes in isolation, without affecting other team members.

See the instructions below on how to use it, and check out the integration in action in the [lakeFS samples repository](https://github.com/treeverse/lakeFS-samples/).

### Install

Use the following Maven dependency to install the lakeFS custom catalog:

```xml

io.lakefs
lakefs-iceberg
0.1.4

```

### Configure

Here is how to configure the lakeFS custom catalog in Spark:
```scala
conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog");
conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog");
conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://example-repo");
```

You will also need to configure the S3A Hadoop FileSystem to interact with lakeFS:
```scala
conf.set("fs.s3a.access.key", "AKIAlakefs12345EXAMPLE")
conf.set("fs.s3a.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
conf.set("fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io")
conf.set("fs.s3a.path.style.access", "true")
```

### Create a table

To create a table on your main branch, use the following syntax:

```sql
CREATE TABLE lakefs.main.table1 (id int, data string);
```

### Create a branch

We can now commit the creation of the table to the main branch:

```
lakectl commit lakefs://example-repo/main -m "my first iceberg commit"
```

Then, create a branch:

```
lakectl branch create lakefs://example-repo/dev -s lakefs://example-repo/main
```

### Make changes on the branch

We can now make changes on the branch:

```sql
INSERT INTO lakefs.dev.table1 VALUES (3, 'data3');
```

### Query the table

If we query the table on the branch, we will see the data we inserted:

```sql
SELECT * FROM lakefs.dev.table1;
```

Results in:
```
+----+------+
| id | data |
+----+------+
| 1 | data1|
| 2 | data2|
| 3 | data3|
+----+------+
```

However, if we query the table on the main branch, we will not see the new changes:

```sql
SELECT * FROM lakefs.main.table1;
```

Results in:
```
+----+------+
| id | data |
+----+------+
| 1 | data1|
| 2 | data2|
+----+------+
```