Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/swallez/elasticsearch-arrow-experiments
Experiments on exposing Elasticsearch as Apache Arrow
https://github.com/swallez/elasticsearch-arrow-experiments
Last synced: 26 days ago
JSON representation
Experiments on exposing Elasticsearch as Apache Arrow
- Host: GitHub
- URL: https://github.com/swallez/elasticsearch-arrow-experiments
- Owner: swallez
- License: apache-2.0
- Created: 2023-09-26T20:51:10.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-01-22T18:56:39.000Z (10 months ago)
- Last Synced: 2024-04-24T15:09:22.443Z (7 months ago)
- Language: Java
- Size: 144 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# elasticsearch-arrow-experiments
Experiments on exposing Elasticsearch as Apache ArrowYou need Elasticsearch with ES|QL to run this experiment. At the time of writing it's only available in 8.11 snapshots:
```
docker run --it --name es-esql -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:8.11.0-SNAPSHOT`
```Copy `config-example.ini` to `config.ini` and paste there the password displayed at the end of Elasticsearch's initialization sequence.
The examples need some test data. To seed the Elasticsearch server with random test data, run `./gradlew ingest-data`.
To start the ArrowFlight / ES|QL bridge, run `./gradlew flight-server`.
**Java example:** run `./gradlew flight-client`
**Python examples:** the `python` directory has examples with both Pandas and Polars. They also illustrate two different authentication methods: one using Arrow Flight's native authentication, and one using a middleware that sets the http `Authorization` header. The bridge server understands both.
**Rust example:** the `rust` directory has an exemple with Arrow.
## ESQL serialization benchmarks
In the Java project the `benchmarks` package contains benchmarks to compare ESQL serialization to JSON, CBOR and Arrow.
**Size benchmark**: CBOR and Arrow produce payload sizes of comparable size. In Arrow number columns are more compact, but string take a bit more space (size is always a 32 bits integer).
**Memory benchmark**: TODO. Arrow libraries are just dataframe wrappers around the byte buffer. So contrarily to CBOR, Arrow has zero deserialization cost.
**CPU benchmark**: TODO. Since Arrow has no deserialization cost, we can reasonably expect CPU usage to be minimal. Arrow libraries also come with they own optimized computation kernels, something a user would have to bring/write with CBOR.