Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fifemon/dump-es-parquet
https://github.com/fifemon/dump-es-parquet
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/fifemon/dump-es-parquet
- Owner: fifemon
- License: bsd-3-clause
- Created: 2024-03-29T12:11:16.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-04-03T17:54:56.000Z (9 months ago)
- Last Synced: 2024-04-03T18:57:17.372Z (9 months ago)
- Language: Python
- Size: 11.7 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Dump data from Elasticsearch or Opensearch to parquet files, one file per index.
A columnar dataframe is built in memory using [Polars](https://docs.pola.rs/), then written out to parquet with zstd compression.
Nested fields are represented as Structs, unles `--flatten` is provided, in which case fields are flattened into the top-level by combining field names with underscores. Flattening is recommended when working with multiple indices that have dynamic mapping, as columns can then be merged across files - different structs cannot easily be merged.
# Requirements
Developed with Python 3.11 with:
- opensearch-py==2.4.2
- polars==0.20.15
- requests==2.31.0## Nix (recommended)
A `flake.nix` is included - run `nix develop` to enter a shell with all dependendencies.
With `direnv` installed run `direnv allow` to have it load the environment for you when you enter the directory.## Pip
pip install -r requirements.txt
# Usage
This will read all records from the `my-data` index, in batches of 500, and write them to a parquet file named `my-data.parquet`:
dump-es-parquet --es https://example.com:9200 my-data
You can also dump all indices matching a pattern; each index will get its own file:
dump-es-parquet --es https://example.com:9200 'my-data-*'
If you then want to analyze the data in DuckDB, for instance:
```sql
CREATE TABLE mydata AS SELECT * FROM read_parquet('my-data-*.parquet', union_by_name=true);
```