https://github.com/treeverse/lakefs-spark-extensions
Spark SQL extensions for lakeFS
https://github.com/treeverse/lakefs-spark-extensions
Last synced: 3 months ago
JSON representation
Spark SQL extensions for lakeFS
- Host: GitHub
- URL: https://github.com/treeverse/lakefs-spark-extensions
- Owner: treeverse
- Created: 2023-08-20T09:42:09.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-10-24T08:02:34.000Z (over 2 years ago)
- Last Synced: 2025-02-27T17:30:51.801Z (over 1 year ago)
- Language: Scala
- Size: 27.3 KB
- Stars: 2
- Watchers: 7
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spark SQL Extensions for lakeFS
## Usage
In order to use the Spark extensions, you will need to _load_ them and then
_add_ them to Spark.
From Maven Repository: _wait until we upload an initial version, until then
see [Development](#Development)
Add:
```
--conf spark.sql.extensions=io.lakefs.iceberg.extension.LakeFSSparkSessionExtensions \
--packages io.lakefs:spark-extensions:
```
to your `spark-*` command-line, or add this package to your
`spark.jars.packages` configuration.
### Development
Run `sbt package`, then add
```
--conf spark.sql.extensions=io.lakefs.iceberg.extension.LakeFSSparkSessionExtensions \
--jars ./target/scala-2.12/lakefs-spark-extensions_2.12-0.1.0-SNAPSHOT.jar`
```
## Available extensions
### Data diff
`refs_data_diff` is a Spark SQL table-valued function. The expression
```sql
refs_data_diff(PREFIX, FROM_SCHEMA, TO_SCHEMA, TABLE)
```
yields a relation that compares the "from" table `PREFIX.FROM_SCHEMA.TABLE`
with the "to" table `PREFIX.TO_SCHEMA.TABLE`. Elements of "to" but not
"from" are _added_ and appear with `lakefs_change='+'`, elements of "from" but not
"to" are _deleted_ and appear with `lakefs_change='-'`.
For instance,
```sql
SELECT lakefs_change, Player, COUNT(*) FROM refs_data_diff('lakefs', 'main~', 'main', 'db.allstar_games')
GROUP BY lakefs_change, Player;
```
uses lakeFS Iceberg support to compute how many rows were changed for each
player in the last commit.
Internally this relation is exactly a `SELECT` expression. For instance,
you can set up a view with it:
```sql
CREATE TEMPORARY VIEW diff_allstar_games_main_last_commit AS
refs_data_diff('lakefs', 'main~', 'main', 'db.allstar_games');
```