{"id":21660193,"url":"https://github.com/slub/esfstats","last_synced_at":"2025-09-01T16:37:06.089Z","repository":{"id":41281068,"uuid":"125514383","full_name":"slub/esfstats","owner":"slub","description":"A Python3 program that extracts some statistics regarding field coverage from an elasticsearch index","archived":false,"fork":false,"pushed_at":"2021-04-21T15:26:48.000Z","size":32,"stargazers_count":4,"open_issues_count":4,"forks_count":4,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-25T18:41:15.535Z","etag":null,"topics":["cli","elasticsearch","python","statistics"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/slub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-16T12:37:04.000Z","updated_at":"2024-04-17T19:48:07.000Z","dependencies_parsed_at":"2022-09-21T01:00:50.161Z","dependency_job_id":null,"html_url":"https://github.com/slub/esfstats","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slub%2Fesfstats","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slub%2Fesfstats/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slub%2Fesfstats/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slub%2Fesfstats/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/slub","download_url":"https://codeload.github.com/slub/esfstats/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248493022,"owners_count":21113159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","elasticsearch","python","statistics"],"created_at":"2024-11-25T09:32:28.964Z","updated_at":"2025-04-11T22:41:15.539Z","avatar_url":"https://github.com/slub.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg alt=\"EFRE-Lod logo\" src=\"https://raw.githubusercontent.com/slub/data.slub-dresden.de/master/assets/images/EFRE_EU_quer_2015_rgb_engl.svg\" width=\"300\" \u003e\n\n# esfstats - elasticsearch fields statistics\n\nesfstats is a commandline command (Python3 program) that extracts some statistics regarding field coverage from an elasticsearch index.\n\n## Usage\n\n```\nesfstats\n        required arguments:\n          -index INDEX  elasticsearch index to use (default: None)\n          -type TYPE    elasticsearch index (document) type to use (default: None)\n\n        optional arguments:\n          -h, --help    show this help message and exit\n          -host HOST    hostname or IP address of the elasticsearch instance to use (default: localhost)\n          -port PORT    port of the elasticsearch instance to use (default: 9200)\n          -marc         ignore MARC indicator, i.e., combine only MARC tag + MARC code (valid/applicable for input generated with help of xbib/marc (https://github.com/xbib/marc) or input MARC JSON records that follow this structure) (default: False)\n          -csv-output   prints the output as pure CSV data (all values are quoted)\n                        (default: False)\n```\n\n* example:\n    ```\n    esfstats -host [HOSTNAME OF YOUR ELASTICSEARCH INSTANCE] -index [YOUR ELASTICSEARCH INDEX] -type [DOCUMENT TYPE OF THE ELEASTICSEARCH INDEX] \u003e [OUTPUT STATISTICS DOCUMENT]\n    ```\n\n### Note\n\nWhen utilising this commandline command with argument '-marc' the input JSON records need to be generated with help of [xbib/marc](https://github.com/xbib/marc) (e.g. via [marc2jsonl](https://github.com/slub/marc2jsonl)) or they need to follow at least this structure (otherwise the result will lead to unexpected behaviour).\n\n## Requirements\n\n[elasticsearch-py](http://elasticsearch-py.rtfd.org/)\n\ne.g.\n```\napt-get install python-elasticsearch\n```\n\n## Run\n\n* install elasticsearch-py\n* clone this git repo or just download the esfstats.py file\n* run ./esfstats.py\n* for a hackish way to use esfstats system-wide, copy to /usr/local/bin\n\n### Install system-wide via pip\n\n* via pip:\n    ```\n    sudo -H pip3 install --upgrade [ABSOLUTE PATH TO YOUR LOCAL GIT REPOSITORY OF ESFSTATS]\n    ```\n    (which provides you ```esfstats``` as a system-wide commandline command)\n\n## Description\n\n(of the column headers of a resulting statistic)\n\n### ... in English\n\n#### existing\n* number of records that contain this field (path), i.e., field coverage\n\n#### %\n* ^ percentage of 'existing'\n* (existing / Total Records * 100)\n\n#### notexisting\n* number of records that do not contain this field (path)\n\n#### !%\n* ^ percentage of 'notexisting'\n* (not existing / Total Records * 100)\n\n#### occurrence\n* total count of the occurrence of this field (path) over all records, i.e., an indicator for field where multiple values are allowed\n\n#### unique (appr.)\n* number of unique/distinct values of this field (path), i.e., cardinality\n* note: this value is an approximated value\n\n#### field name\n* the field (path) of this statistic line\n\n### ... in German\n\nErklärung der Spaltenköpfe\n\n#### existing\n* gibt an, wieviele Felder diesen Pfades existieren.\n\n#### %\n* existing in Prozent\n* existing / Total Records * 100\n\n#### notexisting\n* gibt an, wieviele Rekords nicht über diesen Pfad verfügen\n\n#### !%\n* notexisting in Prozent\n* notexisting / Total Records * 100)\n\n#### occurrence\n* gibt an, wieviele Werte diesen Pfades vorhanden sind. (Mehrfachbelegung)\n\n#### unique (appr.)\n* gibt an, wieviele einzigartige Werte man in diesem Pfad findet\n* Hinweis: dieser Wert ist nur angenähert berechnet, d.h., er ist u.U. ungenau\n\n#### field name\n* Der Pfad zu den analysierten Werten\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslub%2Fesfstats","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fslub%2Fesfstats","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslub%2Fesfstats/lists"}