Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/swallez/elasticsearch-arrow-experiments

Experiments on exposing Elasticsearch as Apache Arrow
https://github.com/swallez/elasticsearch-arrow-experiments

Last synced: 26 days ago
JSON representation

Experiments on exposing Elasticsearch as Apache Arrow

Host: GitHub
URL: https://github.com/swallez/elasticsearch-arrow-experiments
Owner: swallez
License: apache-2.0
Created: 2023-09-26T20:51:10.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-01-22T18:56:39.000Z (10 months ago)
Last Synced: 2024-04-24T15:09:22.443Z (7 months ago)
Language: Java
Size: 144 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# elasticsearch-arrow-experiments
Experiments on exposing Elasticsearch as Apache Arrow

You need Elasticsearch with ES|QL to run this experiment. At the time of writing it's only available in 8.11 snapshots:

```
docker run --it --name es-esql -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:8.11.0-SNAPSHOT`
```

Copy `config-example.ini` to `config.ini` and paste there the password displayed at the end of Elasticsearch's initialization sequence.

The examples need some test data. To seed the Elasticsearch server with random test data, run `./gradlew ingest-data`.

To start the ArrowFlight / ES|QL bridge, run `./gradlew flight-server`.

**Java example:** run `./gradlew flight-client`

**Python examples:** the `python` directory has examples with both Pandas and Polars. They also illustrate two different authentication methods: one using Arrow Flight's native authentication, and one using a middleware that sets the http `Authorization` header. The bridge server understands both.

**Rust example:** the `rust` directory has an exemple with Arrow.

## ESQL serialization benchmarks

In the Java project the `benchmarks` package contains benchmarks to compare ESQL serialization to JSON, CBOR and Arrow.

**Size benchmark**: CBOR and Arrow produce payload sizes of comparable size. In Arrow number columns are more compact, but string take a bit more space (size is always a 32 bits integer).

**Memory benchmark**: TODO. Arrow libraries are just dataframe wrappers around the byte buffer. So contrarily to CBOR, Arrow has zero deserialization cost.

**CPU benchmark**: TODO. Since Arrow has no deserialization cost, we can reasonably expect CPU usage to be minimal. Arrow libraries also come with they own optimized computation kernels, something a user would have to bring/write with CBOR.