https://github.com/delta-io/delta-sharing

An open protocol for secure data sharing
https://github.com/delta-io/delta-sharing

big-data data-sharing delta-lake pandas spark

Last synced: about 1 month ago
JSON representation

An open protocol for secure data sharing

Host: GitHub
URL: https://github.com/delta-io/delta-sharing
Owner: delta-io
License: apache-2.0
Created: 2021-04-08T22:58:25.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2025-04-25T23:15:28.000Z (about 2 months ago)
Last Synced: 2025-04-28T14:09:35.284Z (about 2 months ago)
Topics: big-data, data-sharing, delta-lake, pandas, spark
Language: Scala
Homepage: https://delta.io/sharing
Size: 2.75 MB
Stars: 830
Watchers: 28
Forks: 194
Open Issues: 97
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt

Awesome Lists containing this project

awesome-starred - delta-io/delta-sharing - An open protocol for secure data sharing (others)
awesome-lakehouse - Delta Sharing - Open protocol for secure data sharing. (📂 Additional Sections / 3. Open-source Projects)
awesome-iceberg - Delta Sharing - Open protocol for secure data sharing. (📂 Additional Sections / 3. Open-source Projects)
awesome-lakehouse - Delta Sharing - Open protocol for secure data sharing. (📂 Additional Sections / 3. Open-source Projects)

README

# Delta Sharing: An Open Protocol for Secure Data Sharing

[![Build and Test](https://github.com/delta-io/delta-sharing/actions/workflows/build-and-test.yml/badge.svg)](https://github.com/delta-io/delta-sharing/actions/workflows/build-and-test.yml)
[![License](https://img.shields.io/badge/license-Apache%202-brightgreen.svg)](https://github.com/delta-io/delta-sharing/blob/main/LICENSE.txt)
[![PyPI](https://img.shields.io/pypi/v/delta-sharing.svg)](https://pypi.org/project/delta-sharing/)

[Delta Sharing](https://delta.io/sharing) is an open protocol for secure real-time exchange of large datasets, which enables organizations to share data in real time regardless of which computing platforms they use. It is a simple [REST protocol](PROTOCOL.md) that securely shares access to part of a cloud dataset and leverages modern cloud storage systems, such as S3, ADLS, or GCS, to reliably transfer data.

With Delta Sharing, a user accessing shared data can directly connect to it through pandas, Tableau, Apache Spark, Rust, or other systems that support the open protocol, without having to deploy a specific compute platform first. Data providers can share a dataset once to reach a broad range of consumers, while consumers can begin using the data in minutes.

This repo includes the following components:

- Delta Sharing [protocol specification](PROTOCOL.md).
- Python Connector: A Python library that implements the Delta Sharing Protocol to read shared tables as [pandas](https://pandas.pydata.org/) DataFrame or [Apache Spark](http://spark.apache.org/) DataFrames.
- [Apache Spark](http://spark.apache.org/) Connector: An Apache Spark connector that implements the Delta Sharing Protocol to read shared tables from a Delta Sharing Server. The tables can then be accessed in SQL, Python, Java, Scala, or R.
- Delta Sharing Server: A reference implementation server for the Delta Sharing Protocol for development purposes. Users can deploy this server to share existing tables in Delta Lake and Apache Parquet format on modern cloud storage systems.

# Python Connector

The Delta Sharing Python Connector is a Python library that implements the [Delta Sharing Protocol](PROTOCOL.md) to read tables from a Delta Sharing Server. You can load shared tables as a [pandas](https://pandas.pydata.org/) DataFrame, or as an [Apache Spark](http://spark.apache.org/) DataFrame if running in PySpark with the Apache Spark Connector installed.

## System Requirements

Python 3.8+ for delta-sharing version 1.1+, Python 3.6+ for older versions
If running Linux, glibc version >= 2.31 (for automatic delta-kernel-rust-sharing-wrapper package installation, please see next section for more details)

## Installation

```
pip3 install delta-sharing
```

If you are using [Databricks Runtime](https://docs.databricks.com/runtime/dbr.html), you can follow [Databricks Libraries doc](https://docs.databricks.com/libraries/index.html) to install the library on your clusters.

If this doesn’t work because of an issue downloading delta-kernel-rust-sharing-wrapper try the following:
- Check python3 version >= 3.8
- Upgrade your pip3 to the latest version
- Check the linux glibc version >= 2.31
- [Install Rust](https://www.rust-lang.org/tools/install)

If you cannot upgrade glibc or PyPI does not have a pre-built wheel for delta-kernel-rust-sharing-wrapper for your environment, pip will have to build the package from source, which requires Rust to be installed.
See https://pypi.org/project/delta-kernel-rust-sharing-wrapper/0.2.1/#files for environments that have a pre-built wheel.

You can also use an older version of the delta-sharing package which did not bake delta-kernel-rust-sharing-wrapper into the installation with the following:
```
pip3 install delta-sharing==1.0.5
```

You can also install the delta-kernel-rust-sharing-wrapper package manually:
```
cd [delta-sharing-root]/python/delta-kernel-rust-sharing-wrapper
python3 -m venv .venv
source .venv/bin/activate
pip3 install maturin
maturin develop
```

## Accessing Shared Data

The connector accesses shared tables based on [profile files](PROTOCOL.md#profile-file-format), which are JSON files containing a user's credentials to access a Delta Sharing Server. We have several ways to get started:

- Download the profile file to access an open, example Delta Sharing Server that we're hosting [here](https://databricks-datasets-oregon.s3-us-west-2.amazonaws.com/delta-sharing/share/open-datasets.share). You can try the connectors with this sample data.
- Start your own [Delta Sharing Server](#delta-sharing-reference-server) and create your own profile file following [profile file format](PROTOCOL.md#profile-file-format) to connect to this server.
- Download a profile file from your data provider.

## Quick Start

After you save the profile file, you can use it in the connector to access shared tables.

```python
import delta_sharing

# Point to the profile file. It can be a file on the local file system or a file on a remote storage.
profile_file = ""

# Create a SharingClient.
client = delta_sharing.SharingClient(profile_file)

# List all shared tables.
client.list_all_tables()

# Create a url to access a shared table.
# A table path is the profile file path following with `#` and the fully qualified name of a table
# (`..`).
table_url = profile_file + "#.."

# Fetch 10 rows from a table and convert it to a Pandas DataFrame. This can be used to read sample data
# from a table that cannot fit in the memory.
delta_sharing.load_as_pandas(table_url, limit=10)

# Load a table as a Pandas DataFrame. This can be used to process tables that can fit in the memory.
delta_sharing.load_as_pandas(table_url)

# Load a table as a Pandas DataFrame explicitly using Delta Format
delta_sharing.load_as_pandas(table_url, use_delta_format = True)

# Load a table as a Pandas DataFrame explicitly using jsonPredicateHints
hintOnHireDate = '''{
"op": "equal",
"children": [
{"op": "column", "name":"hireDate", "valueType":"date"},
{"op":"literal","value":"2021-04-29","valueType":"date"}
]
}'''
delta_sharing.load_as_pandas(table_url, jsonPredicateHints = hintOnHireDate)

# If the code is running with PySpark, you can use `load_as_spark` to load the table as a Spark DataFrame.
delta_sharing.load_as_spark(table_url)
```

If the table supports history sharing(`tableConfig.cdfEnabled=true` in the OSS Delta Sharing Server), the connector can query table changes.
```python
# Load table changes from version 0 to version 5, as a Pandas DataFrame.
delta_sharing.load_table_changes_as_pandas(table_url, starting_version=0, ending_version=5)

# Load table changes from version 0 to version 5 as a Pandas DataFrame, explicitly using Delta Format.
delta_sharing.load_table_changes_as_pandas(table_url, starting_version=0, ending_version=5, use_delta_format=True)

# If the code is running with PySpark, you can load table changes as Spark DataFrame.
delta_sharing.load_table_changes_as_spark(table_url, starting_version=0, ending_version=5)
```

You can try this by running our [examples](examples/README.md) with the open, example Delta Sharing Server.

### Details on Profile Paths

- The profile file path for `SharingClient` and `load_as_pandas` can be any URL supported by [FSSPEC](https://filesystem-spec.readthedocs.io/en/latest/index.html) (such as `s3a://my_bucket/my/profile/file`). If you are using [Databricks File System](https://docs.databricks.com/data/databricks-file-system.html), you can also [preface the path with `/dbfs/`](https://docs.databricks.com/data/databricks-file-system.html#dbfs-and-local-driver-node-paths) to access the profile file as if it were a local file.
- The profile file path for `load_as_spark` can be any URL supported by Hadoop FileSystem (such as `s3a://my_bucket/my/profile/file`).
- A table path is the profile file path following with `#` and the fully qualified name of a table (`..`).

# Apache Spark Connector

The Apache Spark Connector implements the [Delta Sharing Protocol](PROTOCOL.md) to read shared tables from a Delta Sharing Server. It can be used in SQL, Python, Java, Scala and R.

## System Requirements

- Java 8+
- Scala 2.12.x
- Apache Spark 3+ or [Databricks Runtime](https://docs.databricks.com/runtime/dbr.html) 9+

## Accessing Shared Data

The connector loads user credentials from profile files. Please see [Accessing Shared Data](#accessing-shared-data) to download a profile file for our example server or for your own data sharing server.

## Configuring Apache Spark

You can set up Apache Spark to load the Delta Sharing connector in the following two ways:

- Run interactively: Start the Spark shell (Scala or Python) with the Delta Sharing connector and run the code snippets interactively in the shell.
- Run as a project: Set up a Maven or SBT project (Scala or Java) with the Delta Sharing connector, copy the code snippets into a source file, and run the project.

If you are using [Databricks Runtime](https://docs.databricks.com/runtime/dbr.html), you can skip this section and follow [Databricks Libraries doc](https://docs.databricks.com/libraries/index.html) to install the connector on your clusters.

### Set up an interactive shell

To use Delta Sharing connector interactively within the Spark’s Scala/Python shell, you can launch the shells as follows.

#### PySpark shell

```
pyspark --packages io.delta:delta-sharing-spark_2.12:3.1.0
```

#### Scala Shell

```
bin/spark-shell --packages io.delta:delta-sharing-spark_2.12:3.1.0
```

### Set up a standalone project

If you want to build a Java/Scala project using Delta Sharing connector from Maven Central Repository, you can use the following Maven coordinates.

#### Maven

You include Delta Sharing connector in your Maven project by adding it as a dependency in your POM file. Delta Sharing connector is compiled with Scala 2.12.

```xml

io.delta
delta-sharing-spark_2.12
3.1.0

```

#### SBT

You include Delta Sharing connector in your SBT project by adding the following line to your `build.sbt` file:

```scala
libraryDependencies += "io.delta" %% "delta-sharing-spark" % "3.1.0"
```

## Quick Start

After you save the profile file and launch Spark with the connector library, you can access shared tables using any language.

### SQL
```sql
-- A table path is the profile file path following with `#` and the fully qualified name
-- of a table (`..`).
CREATE TABLE mytable USING deltaSharing LOCATION '#..';
SELECT * FROM mytable;
```

### Python

```python
# A table path is the profile file path following with `#` and the fully qualified name
# of a table (`..`).
table_path = "#.."
df = spark.read.format("deltaSharing").load(table_path)
```

### Scala

```scala
// A table path is the profile file path following with `#` and the fully qualified name
// of a table (`..`).
val tablePath = "#.."
val df = spark.read.format("deltaSharing").load(tablePath)
```

### Java

```java
// A table path is the profile file path following with `#` and the fully qualified name
// of a table (`..`).
String tablePath = "#..";
Dataset df = spark.read.format("deltaSharing").load(tablePath);
```

### R
```r
# A table path is the profile file path following with `#` and the fully qualified name
# of a table (`..`).
table_path <- "#.."
df <- read.df(table_path, "deltaSharing")
```

You can try this by running our [examples](examples/README.md) with the open, example Delta Sharing Server.

### CDF
Starting from release 0.5.0, querying [Change Data Feed](https://docs.databricks.com/delta/delta-change-data-feed.html) is supported with Delta Sharing.
Once the provider turns on CDF on the original delta table and shares it through Delta Sharing, the recipient can query
CDF of a Delta Sharing table similar to CDF of a delta table.
```scala
val tablePath = "#.."
val df = spark.read.format("deltaSharing")
.option("readChangeFeed", "true")
.option("startingVersion", "3")
.load(tablePath)
```

### Streaming
Starting from release 0.6.0, Delta Sharing table can be used as a data source for [Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html).
Once the provider shares a table with history, the recipient can perform a streaming query on the table.

Note: Trigger.AvailableNow is not supported in delta sharing streaming because it's supported since spark 3.3.0, while delta sharing is still using spark 3.1.1.
```scala
val tablePath = "#.."
val df = spark.readStream.format("deltaSharing")
.option("startingVersion", "1")
.option("skipChangeCommits", "true")
.load(tablePath)
```

### Table paths

- A profile file path can be any URL supported by Hadoop FileSystem (such as `s3a://my_bucket/my/profile/file`).
- A table path is the profile file path following with `#` and the fully qualified name of a table (`..`).

# The Community

Connector
Link
Status
Supported Features

Power BI
Databricks owned
Released
QueryTableVersion
QueryTableMetadata
QueryTableLatestSnapshot

Clojure

[amperity/delta-sharing-client-clj](https://github.com/amperity/delta-sharing-client-clj)

Released
QueryTableVersion
QueryTableMetadata
QueryTableLatestSnapshot
QueryTableChanges(CDF)
Time Travel Queries
Query Changes between Versions
Delta Format Queries
Limit and Predicate Pushdown

Node.js

[goodwillpunning/nodejs-sharing-client](https://github.com/goodwillpunning/nodejs-sharing-client)