https://github.com/duneanalytics/arrow_struct
https://github.com/duneanalytics/arrow_struct
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/duneanalytics/arrow_struct
- Owner: duneanalytics
- Created: 2024-09-04T16:39:42.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-14T20:48:47.000Z (over 1 year ago)
- Last Synced: 2025-06-10T10:53:12.522Z (about 1 year ago)
- Language: Rust
- Size: 37.1 KB
- Stars: 2
- Watchers: 8
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TODO
* Benchmark
* serde_arrow
* arrow2-construct
* Configurable column cases with attributes
* Pick a better name
* Add more convenient interface for converting record batches
# Usage
## RecordBatch vs. StructArray
## Option vs non-Option
Unless you have a lot of trust in your data, prefer to use `Option` for all struct fields (i.e., `struct Struct { field: Option }` over `struct Struct { field: i32 }`),
except for nested structs. Arrow does not enforce not-null constraints in RecordBatches. That is, the schema can claim that it's not-null, while in fact the data is null.
We will panic if we encounter a null field for a not-Option column.
# Performance tips for deserialization
## Zero-copy
If you can, you should prefer to use references for non-primitive types (i.e., `&str` instead of `String`, `&[u8]` instead of `Bytes`).
This avoids clones.
## Avoid Arrow lists
If you can, you should prefer to avoid using Arrow lists.
Even if we are careful when deserializing lists, we create a vector for every row with a non-null list.