{"id":20358878,"url":"https://github.com/fifemon/dump-es-parquet","last_synced_at":"2026-05-09T08:10:00.093Z","repository":{"id":231367335,"uuid":"779259501","full_name":"fifemon/dump-es-parquet","owner":"fifemon","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-11T15:29:24.000Z","size":23,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-11T16:59:04.356Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fifemon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-29T12:11:16.000Z","updated_at":"2025-06-11T15:29:27.000Z","dependencies_parsed_at":"2024-04-03T18:57:19.627Z","dependency_job_id":"736a6bff-4415-4acb-af9b-d674bafd03ab","html_url":"https://github.com/fifemon/dump-es-parquet","commit_stats":null,"previous_names":["fifemon/dump-es-parquet"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/fifemon/dump-es-parquet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fifemon%2Fdump-es-parquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fifemon%2Fdump-es-parquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fifemon%2Fdump-es-parquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fifemon%2Fdump-es-parquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fifemon","download_url":"https://codeload.github.com/fifemon/dump-es-parquet/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fifemon%2Fdump-es-parquet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27394207,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-30T02:00:05.582Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T23:29:16.562Z","updated_at":"2026-05-09T08:10:00.081Z","avatar_url":"https://github.com/fifemon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Dump data from Elasticsearch or Opensearch to parquet, json, or csv files, or directly to stdout.\nFiles are named the same as the index, with a partition number added in case of large datasets, and an \nappropriate extension.\n\nThere are two modes of operation, depending on the output requested:\n\n## parquet, ndjson, or csv \n\nA columnar dataframe is built in memory using [Polars](https://docs.pola.rs/), \nthen written out to parquet with zstd compression, ndjson, or csv (compression not yet supported) as appropriate. For large\ndatasets (controlled with the `--max-partition-size-mb` flag) multiple files will be output, \nwith an incremental partition number appended to the index name.\n\nNested fields are represented as Structs, unles `--flatten` is provided, in which case fields are flattened into the top-level by combining field names with underscores. Flattening is recommended when working with multiple indices that have dynamic mapping, as columns can then be merged across files - different structs cannot easily be merged. Flattening is also required to output to CSV.\n\n## stdout or jsonl\n\nRecords are dumped in JSON, one record per line, to stdout or a file as they are received, with up to `--max-partition-rows` (default 1 000 000) records per file.\n\n# Requirements\n\nDeveloped with Python 3.12 with:\n\n- opensearch-py==2.8.0\n- polars==1.36.0\n- requests==2.32.5\n\n## Nix (recommended)\n\nA `flake.nix` is included - run `nix develop` to enter a shell with all dependendencies. \nWith `direnv` installed run `direnv allow` to have it load the environment for you when you enter the directory.\n\n## Pip\n\n    pip install -r requirements.txt\n\n# Usage\n\n```\nusage: dump-es-parquet [-h] [--es ES] [--cert CERT] [--key KEY] [--no-verify-certs] [--capath CAPATH]\n                       [--size SIZE] [--sort SORT] [--timeout TIMEOUT]\n                       [--output {parquet,ndjson,csv,jsonl,stdout}] [--flatten] [--query QUERY]\n                       [--fields FIELDS] [--max-partition-rows MAX_PARTITION_ROWS]\n                       [--max-partition-mb MAX_PARTITION_MB] [--no-partition] [--debug] [--quiet]\n                       index\n\nDump documents from Elasticsearch or OpenSearch to stdout or files.\n\nBehavior varies with output format:\n\n    parquet: builds a polars dataframe in-memory, accumulating records until the dataframe reaches the\n             specified max partition size, at which point it is written to a parquet file with the index name\n             and partition number, which is omitted if the entire results fit into a single partition.\n    ndjson:  same as parquet, but written to newline-delimited json files instead.\n    csv:     same as parquet, but written to csv files instead.\n    stdout:  outputs raw records in JSON format to stdout. Does not attempt to build a dataframe,\n             so will work even if the source data has problematic/inconsistent types.\n    jsonl:   same as stdout, but outputs records to files, one per request batch.\n\npositional arguments:\n  index                 source index pattern\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --es ES               source cluster address\n  --cert CERT           Client x509 certificate\n  --key KEY             Client x509 key\n  --no-verify-certs     Do not verify x509 certificates\n  --capath CAPATH       Path to CA certificates\n  --size SIZE           Record batch size (default 500)\n  --sort SORT           Comma-separated list of field:direction pairs\n  --timeout TIMEOUT     Elasticsearch read timeout in seconds (default 60)\n  --output {parquet,ndjson,csv,jsonl,stdout}\n                        output format\n  --flatten             Flatten nested data into top level, otherwise use structs\n  --query QUERY         Query string to filter results\n  --fields FIELDS       Comma-separated list of fields to include in the output. Wildcards are supported.\n                        Defaults to all fields.\n  --max-partition-rows MAX_PARTITION_ROWS\n                        Maximum rows in partition\n  --max-partition-mb MAX_PARTITION_MB\n                        Maximum in-memory size of partition dataframe in megabytes (default 1000). Note that\n                        the file size will be smaller due to compression\n  --no-partition        Do not partition into files no matter how big the dataset is\n  --debug               Enable debug logging\n  --quiet               Disable most logging (ignored if --debug specified)\n```\n\n# Examples\n\nThis will read all records from the `my-data` index, in batches of 500, and write them to a parquet file named `my-data.parquet`:\n\n    dump-es-parquet --es https://example.com:9200 my-data\n\nYou can also dump all indices matching a pattern; each index will get its own file:\n\n    dump-es-parquet --es https://example.com:9200 'my-data-*'\n\nIf you then want to analyze the data in DuckDB, for instance:\n\n```sql\nCREATE TABLE mydata AS SELECT * FROM read_parquet('my-data-*.parquet', union_by_name=true);\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffifemon%2Fdump-es-parquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffifemon%2Fdump-es-parquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffifemon%2Fdump-es-parquet/lists"}