Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/slub/esfstats
A Python3 program that extracts some statistics regarding field coverage from an elasticsearch index
https://github.com/slub/esfstats
cli elasticsearch python statistics
Last synced: about 2 months ago
JSON representation
A Python3 program that extracts some statistics regarding field coverage from an elasticsearch index
- Host: GitHub
- URL: https://github.com/slub/esfstats
- Owner: slub
- License: apache-2.0
- Created: 2018-03-16T12:37:04.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2021-04-21T15:26:48.000Z (over 3 years ago)
- Last Synced: 2024-04-14T22:49:12.609Z (9 months ago)
- Topics: cli, elasticsearch, python, statistics
- Language: Python
- Size: 31.3 KB
- Stars: 3
- Watchers: 14
- Forks: 3
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# esfstats - elasticsearch fields statistics
esfstats is a commandline command (Python3 program) that extracts some statistics regarding field coverage from an elasticsearch index.
## Usage
```
esfstats
required arguments:
-index INDEX elasticsearch index to use (default: None)
-type TYPE elasticsearch index (document) type to use (default: None)optional arguments:
-h, --help show this help message and exit
-host HOST hostname or IP address of the elasticsearch instance to use (default: localhost)
-port PORT port of the elasticsearch instance to use (default: 9200)
-marc ignore MARC indicator, i.e., combine only MARC tag + MARC code (valid/applicable for input generated with help of xbib/marc (https://github.com/xbib/marc) or input MARC JSON records that follow this structure) (default: False)
-csv-output prints the output as pure CSV data (all values are quoted)
(default: False)
```* example:
```
esfstats -host [HOSTNAME OF YOUR ELASTICSEARCH INSTANCE] -index [YOUR ELASTICSEARCH INDEX] -type [DOCUMENT TYPE OF THE ELEASTICSEARCH INDEX] > [OUTPUT STATISTICS DOCUMENT]
```### Note
When utilising this commandline command with argument '-marc' the input JSON records need to be generated with help of [xbib/marc](https://github.com/xbib/marc) (e.g. via [marc2jsonl](https://github.com/slub/marc2jsonl)) or they need to follow at least this structure (otherwise the result will lead to unexpected behaviour).
## Requirements
[elasticsearch-py](http://elasticsearch-py.rtfd.org/)
e.g.
```
apt-get install python-elasticsearch
```## Run
* install elasticsearch-py
* clone this git repo or just download the esfstats.py file
* run ./esfstats.py
* for a hackish way to use esfstats system-wide, copy to /usr/local/bin### Install system-wide via pip
* via pip:
```
sudo -H pip3 install --upgrade [ABSOLUTE PATH TO YOUR LOCAL GIT REPOSITORY OF ESFSTATS]
```
(which provides you ```esfstats``` as a system-wide commandline command)## Description
(of the column headers of a resulting statistic)
### ... in English
#### existing
* number of records that contain this field (path), i.e., field coverage#### %
* ^ percentage of 'existing'
* (existing / Total Records * 100)#### notexisting
* number of records that do not contain this field (path)#### !%
* ^ percentage of 'notexisting'
* (not existing / Total Records * 100)#### occurrence
* total count of the occurrence of this field (path) over all records, i.e., an indicator for field where multiple values are allowed#### unique (appr.)
* number of unique/distinct values of this field (path), i.e., cardinality
* note: this value is an approximated value#### field name
* the field (path) of this statistic line### ... in German
Erklärung der Spaltenköpfe
#### existing
* gibt an, wieviele Felder diesen Pfades existieren.#### %
* existing in Prozent
* existing / Total Records * 100#### notexisting
* gibt an, wieviele Rekords nicht über diesen Pfad verfügen#### !%
* notexisting in Prozent
* notexisting / Total Records * 100)#### occurrence
* gibt an, wieviele Werte diesen Pfades vorhanden sind. (Mehrfachbelegung)#### unique (appr.)
* gibt an, wieviele einzigartige Werte man in diesem Pfad findet
* Hinweis: dieser Wert ist nur angenähert berechnet, d.h., er ist u.U. ungenau#### field name
* Der Pfad zu den analysierten Werten