https://github.com/federicotdn/inelastic

Print an Elasticsearch inverted index as a CSV table or JSON object.
https://github.com/federicotdn/inelastic

csv elastic elasticsearch index inverted json search

Last synced: 5 months ago
JSON representation

Print an Elasticsearch inverted index as a CSV table or JSON object.

Host: GitHub
URL: https://github.com/federicotdn/inelastic
Owner: federicotdn
License: apache-2.0
Archived: true
Created: 2018-08-09T23:19:27.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2024-03-20T17:40:51.000Z (about 2 years ago)
Last Synced: 2025-11-27T18:27:22.973Z (7 months ago)
Topics: csv, elastic, elasticsearch, index, inverted, json, search
Language: Python
Homepage:
Size: 37.1 KB
Stars: 11
Watchers: 2
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # inelastic

[![Build Status](https://travis-ci.org/federicotdn/inelastic.svg)](https://travis-ci.org/federicotdn/inelastic)

[![Version](https://img.shields.io/pypi/v/inelastic.svg?style=flat)](https://pypi.python.org/pypi/inelastic)

![](https://img.shields.io/badge/python-3-blue.svg)

![](https://img.shields.io/badge/code%20style-black-000000.svg)

Print an Elasticsearch inverted index as a CSV table or JSON object.

`inelastic` builds an approximation of how an [inverted index](https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up) would look like for a particular index and document field, using the [Multi termvectors API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html) on all stored documents.

## Installation

To install `inelastic`, run the following command:

```bash

$ pip3 install --upgrade inelastic

```

`inelastic` currently only supports Elasticsearch versions 6.X and 7.X.

## Example

Having the following index:

```

PUT /tweets

{

    "mappings": {

        "properties": {

            "content": {

                "type": "text"

            }

        }

    }

}

```

with the following documents:

```

POST /tweets/_bulk

{ "index": { "_id": 1 }}

{ "content": "This is my first tweet." }

{ "index": { "_id": 2 }}

{ "content": "Most Elasticsearch examples use tweets." }

{ "index": { "_id": 3 }}

{ "content": "This is an example." }

{ "index": { "_id": 4 }}

{ "content": "Adding some more tweets." }

{ "index": { "_id": 5 }}

{ "content": "Adding more and more tweets." }

```

`inelastic` could be used as follows (combined with the `column` command):

```bash

$ inelastic -i tweets -f content | column -t -s ,

```

Which would output:

```

term           freq  doc_count  d0  d1  d2

adding         2     2          4   5

an             1     1          3

and            1     1          5

elasticsearch  1     1          2

example        1     1          3

examples       1     1          2

first          1     1          1

is             2     2          1   3

more           3     2          4   5

most           1     1          2

my             1     1          1

some           1     1          4

this           2     2          1   3

tweet          1     1          1

tweets         3     3          2   4   5

use            1     1          2

```

The `freq` field specifies the total amount of times the term appears in all documents, and the `doc_count` field specifies how many documents contain the term at least once. The `d0`, `d1`... fields list the IDs for documents containing the term.

The chosen document field's type must be `text` or `keyword`.

## Usage

These are the arguments `inelastic` accepts:

- `-i` (`--index`): Index name (**required**).

- `-f` (`--field`): Document field name from which to generate inverted index (**required**).

- `-l` (`--id-field`): Document field to use as ID when printing results (*default: _id*).

- `-o` (`--output`): Output format, `json` or `csv` (*default: csv*).

- `-p` (`--port`): Elasticsearch host port (*default: 9200*).

- `-e` (`--host`): Elasticsearch host address (*default: localhost*).

- `-q` (`--query`): Elasticsearch DSL JSON query to use when fetch documents. (*default: None*).

- `-d` (`--doctype`): Document type (*default: _doc*) (**Elasticsearch 6.X only**).

- `-v` (`--verbose`): Print debug information (*default: False*).

- `-h` (`--help`): Show help and exit.

## Scripting

The `inelastic` module exposes the `InvertedIndex` class, which can be used in custom Python scripts:

```python

from inelastic import InvertedIndex

from elasticsearch import Elasticsearch  # Only with ES 7.X

from elasticsearch6 import Elasticsearch # Only with ES 6.X

es = Elasticsearch()

ii = InvertedIndex(search_size=250, scroll_time='10s')

n_docs, errors = ii.read_index(es, 'tweets', 'content')

print('# docs: {}, # errors: {}'.format(n_docs, errors))

for entry in ii.to_list():

    print(entry)

```

When run, the previous script will output:

```

# docs: 5, # errors: 0

('adding', )

('an', )

('and', )

('elasticsearch', )

('example', )

('examples', )

('first', )

('is', )

('more', )

('most', )

('my', )

('some', )

('this', )

('tweet', )

('tweets', )

('use', )

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/federicotdn/inelastic

Awesome Lists containing this project

README