https://github.com/abc3/ex_iceberg_port

Elixir bindings for Apache Iceberg via Apache Spark
https://github.com/abc3/ex_iceberg_port

elixir iceberg scala spark

Last synced: about 1 month ago
JSON representation

Elixir bindings for Apache Iceberg via Apache Spark

Host: GitHub
URL: https://github.com/abc3/ex_iceberg_port
Owner: abc3
Created: 2025-01-14T16:20:28.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-05-24T09:22:16.000Z (5 months ago)
Last Synced: 2025-05-24T10:27:15.267Z (5 months ago)
Topics: elixir, iceberg, scala, spark
Language: Elixir
Homepage:
Size: 24.4 KB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # ExIcebergPort

Apache Iceberg port for Elixir applications, providing SQL interface through Apache Spark

## Requirements

- Java 11 (OpenJDK 11 recommended)

- Apache Spark 3.4.1

- Scala 2.13.12

- SBT (Scala Build Tool)

Before running the application, ensure you have Java 11 set up correctly:

```bash

export JAVA_HOME=/path/to/your/java11

export PATH="$JAVA_HOME/bin:$PATH"

```

You can verify your Java version with:

```bash

java -version

```

## Building

Before using the library, you need to build the JVM component:

```bash

make jvm

```

## Development

To start the application in development mode with IEx shell:

```bash

make dev

```

## Usage

### Start the Port

To start the port, use the following command:

```elixir

ExIcebergPort.start_link(

  warehouse_path: "LOCAL_PATH",

  catalog_name: "local",

)

```

### Create table

To create a table, use the following command:

```elixir

ExIcebergPort.query("CREATE TABLE IF NOT EXISTS local.db.my_table (id INT, name STRING, age INT) USING iceberg")

```

### Insert data

To insert data, use the following command:

```elixir

ExIcebergPort.query("insert into local.db.my_table values (1, 'John', 30), (2, 'Jane', 25), (3, 'Bob', 35)")

```

### Select rows

To select rows, use the following command:

```elixir

ExIcebergPort.query("SELECT * FROM local.db.my_table")

```

### Table Maintenance Operations

#### List Snapshots

View all snapshots (versions) of a table:

```elixir

ExIcebergPort.query("SELECT * FROM local.db.my_table.snapshots")

```

#### Compact Table Files

To optimize table performance by rewriting and compacting the table data:

```elixir

ExIcebergPort.query("""

  INSERT OVERWRITE TABLE local.db.my_table

  SELECT * FROM local.db.my_table

""")

```

This operation will rewrite the table data, which helps to:

- Compact small files into larger ones

- Remove deleted records

- Optimize the table's physical layout

#### Expire Old Snapshots

Remove old snapshots to free up storage space. This operation is safe as it only removes snapshots that are no longer needed for time travel queries:

```elixir

ExIcebergPort.query("""

  CALL catalog_name.system.expire_snapshots(

    table => 'local.db.my_table',

    older_than => TIMESTAMP '2025-05-24 00:00:00'

  )

""")

```

#### Remove Orphan Files

After expiring snapshots, you can physically delete unreferenced files to reclaim storage space. This operation should be run after `expire_snapshots`:

```elixir

ExIcebergPort.query("""

  CALL catalog_name.system.remove_orphan_files(

    table => 'local.db.my_table',

    older_than => TIMESTAMP '2025-05-24 00:00:00'

  )

""")

```

Note: Always ensure you have a backup before running maintenance operations, and verify the `older_than` timestamp carefully to avoid removing data that might still be needed.

## Catalog Support

| Catalog Type   | Status | Description                                    |

| -------------- | ------ | ---------------------------------------------- |

| Local Catalog  | ✅     | Hadoop-based local filesystem catalog          |

| AWS S3 Catalog | 🔄     | Store tables in S3 buckets (Coming soon)       |

| REST Catalog   | 🔄     | Use Iceberg REST catalog service (Coming soon) |

### DataFrame Operations

Create and write a DataFrame to an Iceberg table:

```elixir

iex> ExIcebergPort.dummy_df

{:ok,

 %ExIcebergPort.Result{

   columns: ["id", "name", "age"],

   rows: [],

   num_rows: 2,

   exec_time_ms: 207

 }}

```

### SQL Queries

Query data from Iceberg tables using SQL:

```elixir

iex> ExIcebergPort.query("select * from local.db.my_table")

{:ok,

 %ExIcebergPort.Result{

   sql: "select * from local.db.my_table",

   columns: ["id", "name", "age"],

   rows: [[1, "John_529", 18], [2, "Jane_595", 81]],

   num_rows: 2,

   exec_time_ms: 84

 }}

```

The result includes:

- `columns`: List of column names

- `rows`: List of data rows

- `num_rows`: Number of rows returned

- `exec_time_ms`: Query execution time in milliseconds

- `sql`: The SQL query that was executed (for SQL queries only)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/abc3/ex_iceberg_port

Awesome Lists containing this project

README