Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fifemon/dump-es-parquet

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/fifemon/dump-es-parquet
Owner: fifemon
License: bsd-3-clause
Created: 2024-03-29T12:11:16.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-04-03T17:54:56.000Z (9 months ago)
Last Synced: 2024-04-03T18:57:17.372Z (9 months ago)
Language: Python
Size: 11.7 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        Dump data from Elasticsearch or Opensearch to parquet files, one file per index. 

A columnar dataframe is built in memory using [Polars](https://docs.pola.rs/), then written out to parquet with zstd compression.

Nested fields are represented as Structs, unles `--flatten` is provided, in which case fields are flattened into the top-level by combining field names with underscores. Flattening is recommended when working with multiple indices that have dynamic mapping, as columns can then be merged across files - different structs cannot easily be merged.

# Requirements

Developed with Python 3.11 with:

- opensearch-py==2.4.2

- polars==0.20.15

- requests==2.31.0

## Nix (recommended)

A `flake.nix` is included - run `nix develop` to enter a shell with all dependendencies. 

With `direnv` installed run `direnv allow` to have it load the environment for you when you enter the directory.

## Pip

    pip install -r requirements.txt

# Usage

This will read all records from the `my-data` index, in batches of 500, and write them to a parquet file named `my-data.parquet`:

    dump-es-parquet --es https://example.com:9200 my-data

You can also dump all indices matching a pattern; each index will get its own file:

    dump-es-parquet --es https://example.com:9200 'my-data-*'

If you then want to analyze the data in DuckDB, for instance:

```sql

CREATE TABLE mydata AS SELECT * FROM read_parquet('my-data-*.parquet', union_by_name=true);

```