An open API service indexing awesome lists of open source software.

https://github.com/federicotdn/inelastic

Print an Elasticsearch inverted index as a CSV table or JSON object.
https://github.com/federicotdn/inelastic

csv elastic elasticsearch index inverted json search

Last synced: 2 months ago
JSON representation

Print an Elasticsearch inverted index as a CSV table or JSON object.

Awesome Lists containing this project

README

          

# inelastic
[![Build Status](https://travis-ci.org/federicotdn/inelastic.svg)](https://travis-ci.org/federicotdn/inelastic)
[![Version](https://img.shields.io/pypi/v/inelastic.svg?style=flat)](https://pypi.python.org/pypi/inelastic)
![](https://img.shields.io/badge/python-3-blue.svg)
![](https://img.shields.io/badge/code%20style-black-000000.svg)

Print an Elasticsearch inverted index as a CSV table or JSON object.

`inelastic` builds an approximation of how an [inverted index](https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up) would look like for a particular index and document field, using the [Multi termvectors API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html) on all stored documents.

## Installation
To install `inelastic`, run the following command:
```bash
$ pip3 install --upgrade inelastic
```

`inelastic` currently only supports Elasticsearch versions 6.X and 7.X.

## Example

Having the following index:
```
PUT /tweets
{
"mappings": {
"properties": {
"content": {
"type": "text"
}
}
}
}
```

with the following documents:
```
POST /tweets/_bulk
{ "index": { "_id": 1 }}
{ "content": "This is my first tweet." }
{ "index": { "_id": 2 }}
{ "content": "Most Elasticsearch examples use tweets." }
{ "index": { "_id": 3 }}
{ "content": "This is an example." }
{ "index": { "_id": 4 }}
{ "content": "Adding some more tweets." }
{ "index": { "_id": 5 }}
{ "content": "Adding more and more tweets." }
```

`inelastic` could be used as follows (combined with the `column` command):

```bash
$ inelastic -i tweets -f content | column -t -s ,
```

Which would output:
```
term freq doc_count d0 d1 d2
adding 2 2 4 5
an 1 1 3
and 1 1 5
elasticsearch 1 1 2
example 1 1 3
examples 1 1 2
first 1 1 1
is 2 2 1 3
more 3 2 4 5
most 1 1 2
my 1 1 1
some 1 1 4
this 2 2 1 3
tweet 1 1 1
tweets 3 3 2 4 5
use 1 1 2
```

The `freq` field specifies the total amount of times the term appears in all documents, and the `doc_count` field specifies how many documents contain the term at least once. The `d0`, `d1`... fields list the IDs for documents containing the term.

The chosen document field's type must be `text` or `keyword`.

## Usage
These are the arguments `inelastic` accepts:
- `-i` (`--index`): Index name (**required**).
- `-f` (`--field`): Document field name from which to generate inverted index (**required**).
- `-l` (`--id-field`): Document field to use as ID when printing results (*default: _id*).
- `-o` (`--output`): Output format, `json` or `csv` (*default: csv*).
- `-p` (`--port`): Elasticsearch host port (*default: 9200*).
- `-e` (`--host`): Elasticsearch host address (*default: localhost*).
- `-q` (`--query`): Elasticsearch DSL JSON query to use when fetch documents. (*default: None*).
- `-d` (`--doctype`): Document type (*default: _doc*) (**Elasticsearch 6.X only**).
- `-v` (`--verbose`): Print debug information (*default: False*).
- `-h` (`--help`): Show help and exit.

## Scripting
The `inelastic` module exposes the `InvertedIndex` class, which can be used in custom Python scripts:
```python
from inelastic import InvertedIndex
from elasticsearch import Elasticsearch # Only with ES 7.X
from elasticsearch6 import Elasticsearch # Only with ES 6.X

es = Elasticsearch()
ii = InvertedIndex(search_size=250, scroll_time='10s')

n_docs, errors = ii.read_index(es, 'tweets', 'content')

print('# docs: {}, # errors: {}'.format(n_docs, errors))

for entry in ii.to_list():
print(entry)
```

When run, the previous script will output:
```
# docs: 5, # errors: 0
('adding', )
('an', )
('and', )
('elasticsearch', )
('example', )
('examples', )
('first', )
('is', )
('more', )
('most', )
('my', )
('some', )
('this', )
('tweet', )
('tweets', )
('use', )
```