Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adamreeve/dataset-metadata-benchmark
https://github.com/adamreeve/dataset-metadata-benchmark
Last synced: 12 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/adamreeve/dataset-metadata-benchmark
- Owner: adamreeve
- Created: 2024-06-05T12:55:47.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-06-06T08:41:37.000Z (5 months ago)
- Last Synced: 2024-10-15T18:50:53.177Z (28 days ago)
- Language: Python
- Size: 46.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Arrow Dataset Benchmarks
This repository contains code for benchmarking the use of Parquet `_metadata` files when reading from an Arrow Dataset.
## Local file system benchmarks
Generate data:
```
python write.py
```Clear file system caches then run the read query:
```
sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
python read_benchmark.py
```Run using the `_metadata` file based Dataset:
```
sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
python read_benchmark.py -m
```## Minio (S3 API) benchmarks
Run a minio service:
```
podman run -p 9000:9000 -p 9001:9001 quay.io/minio/minio server /data --console-address ":9001"
```Generate data:
```
python write.py --s3
```Clear file system caches then run the read query:
```
sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
python read_benchmark.py --s3
```Run using the `_metadata` file based Dataset:
```
sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
python read_benchmark.py --s3 -m
```## Results
| File system | Baseline query time (ms) | Query time with `_metadata` (ms) | Speed up |
| --- | --- | --- | --- |
| Local (SSD) | 176 | 93 | 1.9x |
| Minio | 786 | 121 | 6.5x |