https://github.com/polardb/duckdb-paimon

DuckDB extension for accessing Apache Paimon. 🦆
https://github.com/polardb/duckdb-paimon

datalake duckdb paimon

Last synced: 2 months ago
JSON representation

DuckDB extension for accessing Apache Paimon. 🦆

Host: GitHub
URL: https://github.com/polardb/duckdb-paimon
Owner: polardb
License: apache-2.0
Created: 2026-03-02T09:33:29.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-04-15T10:03:03.000Z (2 months ago)
Last Synced: 2026-04-15T12:07:08.199Z (2 months ago)
Topics: datalake, duckdb, paimon
Language: C++
Homepage:
Size: 252 KB
Stars: 24
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
- Agents: AGENTS.md

Awesome Lists containing this project

README

# DuckDB Paimon Extension 🦆

This extension enables [DuckDB](https://duckdb.org/) to read and query [Apache Paimon](https://paimon.apache.org/) format data directly — no ETL pipelines, no Flink/Spark clusters required. Just open a DuckDB shell and run SQL against your Paimon tables.

Similar to other extension, duckdb-paimon brings DuckDB's powerful local analytics to the Paimon data lake ecosystem.

## About Apache Paimon

[Apache Paimon](https://paimon.apache.org/) is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations. It innovatively combines lake format and LSM structure, bringing realtime streaming updates into the lake architecture.

## Implementation

This extension is built on top of [paimon-cpp](https://github.com/alibaba/paimon-cpp), an open-source C++ library that provides native access to Paimon format data. It is the first library that brings native Paimon read/write capabilities to the C++ ecosystem.

### Technical Highlights

- **Zero JVM dependency** — No Java runtime required. Pure C++ implementation means minimal memory footprint and instant startup.
- **Apache Arrow data exchange** — Data flows between paimon-cpp and DuckDB via Apache Arrow, the industry standard for columnar in-memory data, enabling zero-copy transfers with no serialization overhead.
- **Parallel scan architecture** — Paimon tables are split into independent Splits, and DuckDB's multi-threaded execution engine reads them in parallel to fully utilize multi-core CPUs.
- **Secure credential management** — OSS credentials are managed through DuckDB's native Secret Manager with scope isolation and automatic key redaction.

## Features

- Read Paimon table data (local and remote OSS)
- Projection pushdown optimization
- Multiple file format support (Parquet data files, ORC manifest files)
- Catalog ATTACH support
- DuckDB Secret-based OSS credential management

## Use Cases

### Lightweight Ad-hoc Queries on Realtime Lakehouses

Data is written into Paimon by Flink in real time. Analysts can query it directly on OSS using DuckDB + duckdb-paimon — **no compute cluster needed**, reducing query latency from minutes to seconds.

### Data Validation & Quality Checks

Use DuckDB in CI/CD pipelines to run data quality assertions on Paimon tables, verifying that Flink job outputs meet expectations. Lightweight, fast, and dependency-free.

### Data Exploration & Debugging

Data engineers developing Flink jobs can instantly inspect the current state of Paimon tables using DuckDB Shell, quickly locating data issues — far more efficient than launching a Flink SQL Client.

### Cross-format Federated Queries

DuckDB natively supports Parquet, CSV, JSON, Iceberg, and more. Combined with duckdb-paimon, you can JOIN Paimon tables with other data sources without any data movement:

```sql
-- Join a Paimon orders table with a local CSV dimension table
SELECT o.order_id, o.amount, c.customer_name
FROM paimon_scan('oss://...', 'db', 'orders') o
JOIN read_csv('customers.csv') c ON o.customer_id = c.id;
```

## Getting Started

Clone the repository:

```shell
git clone --recurse-submodules https://github.com/polardb/duckdb-paimon.git
cd duckdb-paimon
```

Note that `--recurse-submodules` will ensure DuckDB and paimon-cpp are pulled which are required to build the extension.

### Building

```shell
GEN=ninja make
```

### Running the Extension

To run the extension code, simply start the shell with `./build/release/duckdb`. This shell will have the extension pre-loaded.

Now we can use the features from the extension directly in DuckDB:

#### Query Local Paimon Tables

```sql
SELECT * FROM paimon_scan('./data/testdb.db/testtbl');
┌─────────┬───────┬───────┬────────┐
│ f0 │ f1 │ f2 │ f3 │
│ varchar │ int32 │ int32 │ double │
├─────────┼───────┼───────┼────────┤
│ Alice │ 1 │ 0 │ 11.0 │
│ Bob │ 1 │ 1 │ 12.1 │
│ Cathy │ 1 │ 2 │ 13.2 │
│ David │ 2 │ 0 │ 21.0 │
│ Eve │ 2 │ 1 │ 22.1 │
│ Frank │ 2 │ 2 │ 23.2 │
│ Grace │ 3 │ 0 │ 31.0 │
│ Henry │ 3 │ 1 │ 32.1 │
│ Iris │ 3 │ 2 │ 33.2 │
└─────────┴───────┴───────┴────────┘
```

#### Query Remote OSS Paimon Tables

```sql
-- Configure OSS credentials
CREATE SECRET my_oss (
TYPE paimon,
key_id 'your-access-key-id',
secret 'your-access-key-secret',
endpoint 'oss-cn-hangzhou.aliyuncs.com'
);

-- Query Paimon tables on OSS
SELECT * FROM paimon_scan('oss://your-bucket/warehouse', 'your_db', 'your_table');
```

#### Attach as Catalog

```sql
ATTACH 'oss://my-bucket/warehouse' AS paimon_lake (TYPE paimon);

SHOW ALL TABLES;
DESCRIBE paimon_lake.sales_db.orders;
```

### Running the Tests

```shell
make test
```

## Related Projects

- **[Apache Paimon](https://paimon.apache.org/)** — Realtime lakehouse format
- **[paimon-cpp](https://github.com/alibaba/paimon-cpp)** — Native C++ library for Paimon (underlying dependency)
- **[DuckDB](https://duckdb.org/)** — Embeddable OLAP database
- **[duckdb-iceberg](https://github.com/duckdb/duckdb_iceberg)** — DuckDB's official Iceberg extension

## Join the Community

We welcome contributions and discussions! If you have questions, ideas, or want to connect with other users and developers, join our community by clicking [here](https://qr.dingtalk.com/action/joingroup?code=v1,k1,xL7wNtAi3J83o8gW/R+2vl0twZAzSwohxbXwCwQG6v8=&_dt_no_comment=1&origin=11) or scan the QR code below:

DingTalk Group QR Code

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/polardb/duckdb-paimon

Awesome Lists containing this project

README