Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/miku/esdump
Stream documents from elasticsearch with scroll (and HTTP GET only)
https://github.com/miku/esdump
code4lib command-line-tool elasticsearch
Last synced: 19 days ago
JSON representation
Stream documents from elasticsearch with scroll (and HTTP GET only)
- Host: GitHub
- URL: https://github.com/miku/esdump
- Owner: miku
- License: mit
- Created: 2020-04-09T11:43:04.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-06-28T17:59:08.000Z (6 months ago)
- Last Synced: 2024-11-10T05:32:05.682Z (about 1 month ago)
- Topics: code4lib, command-line-tool, elasticsearch
- Language: Go
- Homepage:
- Size: 245 KB
- Stars: 9
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-library - esdump - Export data from elasticsearch. (Uncategorized / Uncategorized)
README
# esdump
Stream docs from Elasticsearch to stdout for ad-hoc data mangling using the
[Scroll
API](https://www.elastic.co/guide/en/elasticsearch/guide/master/scroll.html#scroll).
Just like [solrdump](https://github.com/ubleipzig/solrdump), but for
[elasticsearch](https://elastic.co/). Since esdump 0.1.11, the default operator can be set explicitly and changed from `OR` to `AND`.Libraries can use both GET and POST requests to issue scroll requests.
* [elasticsearch-py](https://github.com/elastic/elasticsearch-py/blob/c0767a9569a719dcb15adec91a88afc32b27b1b0/elasticsearch/client/__init__.py#L1300-L1323) uses POST
* [esapi](https://github.com/elastic/go-elasticsearch/blob/6f36a473b19f05f20933da8f59347b308ab46594/esapi/api.scroll.go#L65) uses GETThis tool uses HTTP GET only, and does not clear scrolls (which would probably
use
[DELETE](https://github.com/elastic/go-elasticsearch/blob/6f36a473b19f05f20933da8f59347b308ab46594/esapi/api.clear_scroll.go#L60))
so this tool works with read-only servers, that only allow GET.## Install
```
$ go install github.com/miku/esdump/cmd/esdump@latest
```Or via a [release](https://github.com/miku/esdump/releases).
## Usage
```
esdump uses the elasticsearch scroll API to stream documents to stdout.Originally written to extract samples from https://search.fatcat.wiki (a
scholarly communications preservation and discovery project).$ esdump -s https://search.fatcat.wiki -i fatcat_release -q 'web archiving'
Usage of ./esdump:
-i string
index name (default "fatcat_release")
-ids string
a path to a file with one id per line to fetch
-l int
limit number of documents fetched, zero means no limit
-mq string
path to file, one lucene query per line
-op string
default operator for query string queries (default "AND")
-q string
lucene syntax query to run, example: 'affiliation:"alberta"' (default "*")
-s string
elasticsearch server (default "https://search.fatcat.wiki")
-scroll string
context timeout (default "5m")
-size int
batch size (default 1000)
-v show version
-verbose
be verbose
```## Performance data point(s)
```
925636 docs in 4m47.460217252s (3220 docs/s)
```## TODO
* [ ] move to [`search_after`](https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html)