Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/projectnessie/nessie-demos

Demos for Nessie. Nessie provides Git-like capabilities for your Data Lake.
https://github.com/projectnessie/nessie-demos

binder iceberg jupyter-notebooks nessie spark

Last synced: about 1 month ago
JSON representation

Demos for Nessie. Nessie provides Git-like capabilities for your Data Lake.

Awesome Lists containing this project

README

        

# Nessie Binder Demos

These demos run under binder and can be found at:

* [Spark and Iceberg](https://mybinder.org/v2/gh/projectnessie/nessie-demos/main?labpath=notebooks%2Fnessie-iceberg-demo-nba.ipynb)
* [Flink and Iceberg](https://mybinder.org/v2/gh/projectnessie/nessie-demos/main?labpath=notebooks%2Fnessie-iceberg-flink-demo-nba.ipynb)
* [Hive and Iceberg](https://mybinder.org/v2/gh/projectnessie/nessie-demos/main?labpath=notebooks%2Fnessie-iceberg-hive-demo-nba.ipynb)

They are automatically rebuilt every time we push to main. They are unit tested using `testbook` library to ensure we get
the correct results as the underlying libraries continue to grow/mature.

## Upgrade instructions

Because of the split between Binder and unit tests it wasn't totally trivial to create a single place to update all versions.
Some versions have to be updated in multiple places:

### Nessie

Nessie version is set in Binder at `docker/binder/requirements_base.txt`. Currently, the demos are using 0.74.x of Nessie.

### Iceberg

Currently we are using Iceberg `1.4.2` and it is specified in both iceberg notebooks as well as `docker/utils/__init__.py`

### Spark

Only has to be updated in `docker/binder/requirements.txt`. Currently, Iceberg supports 3.2, 3.3, 3.4 and 3.5, we use Spark 3.2 in the demos.

### Flink

Flink version is set in Binder at `docker/binder/requirements_flink.txt`. Currently, we are using `1.17.1`.

### Hadoop

Hadoop libs are used by flink and currently specified in `docker/utils/__init__.py` only. We use `2.10.1` with Flink and Hive.

### Hive

Current Hive version that is being used `2.3.9` which supports Hadoop version of `2.10.1`. To update the version, it needs to be only updated
in `docker/utils/__init__.py`.

## Binder

[Binder](https://mybinder.org) is a more customizable platform for Jupyter notebooks and
more (see their website). Binder generates a Dockerfile + image based on the settings in the
source GitHub repository (other sources are possible). It is possible to pre-install both
e.g. Ubuntu and/or Python packages into the Docker image generated by Binder.

Of course, Binder just lets a user "simply start" a notebook via a simple "click on a link".

## Development
For development, you will need to make sure to have the following installed:
- Python 3.10+
- pre-commit

Regarding pre-commit, you will need to make sure is installed through `pre-commit install` in order to install the hooks locally since this repo
executes some several scripts in pre-commit stage.

To run the notebooks unit tests, in `notebook` folder, run the following commands:
1. `python -m pip install -r requirements_dev.txt`
2. `tox`

Running the unit tests takes time since it will need to download all the binaries files like Hive, Flink ..etc and then it will
run the tests.