https://github.com/o19s/elyzer
"Stop worrying about Elasticsearch analyzers", my therapist says
https://github.com/o19s/elyzer
custom-analyzer elasticsearch elasticsearch-analyzers
Last synced: 24 days ago
JSON representation
"Stop worrying about Elasticsearch analyzers", my therapist says
- Host: GitHub
- URL: https://github.com/o19s/elyzer
- Owner: o19s
- License: apache-2.0
- Created: 2015-09-17T02:36:43.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2021-06-01T21:49:27.000Z (almost 4 years ago)
- Last Synced: 2025-04-02T08:38:10.376Z (about 1 month ago)
- Topics: custom-analyzer, elasticsearch, elasticsearch-analyzers
- Language: Python
- Homepage:
- Size: 33.2 KB
- Stars: 155
- Watchers: 11
- Forks: 15
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Elyzer
See step-by-step how Elasticsearch custom analyzers decompose your text into tokens.
My therapist said this would be a good idea...
Doug: hi
Eliza: How are you today.. What would you like to discuss?
Doug: well I can't really see how my Elasticsearch analyzers work, on the inside
Eliza: How do you know you can't really see how your Elasticsearch analyzers work, on the inside?
Doug: cause I call _analyze, and it does the whole analyzer or just one step
Eliza What other reasons might there be?
Doug: I guess nobody's gotten around to making a good debugger
Eliza I'm not sure I understand you fully.
Doug well just keep reading this README then...
Eliza Come, come, elucidate your thoughts.# Installation
Install through pip3
(ES 2.x & 5.x)
pip3 install elyzer
# Usage
Give Elyzer some text and have its analyzed. See the process broken down step-by-step on the command line.
Assuming I'd created an `english_bigrams` custom analyzer for the index tmdb, I would run the analyzer like so
```
$ elyzer --es "http://localhost:9200" --index tmdb --analyzer english_bigrams "Mary had a little lamb"
TOKENIZER: standard
{1:Mary} {2:had} {3:a} {4:little} {5:lamb}
TOKEN_FILTER: standard
{1:Mary} {2:had} {3:a} {4:little} {5:lamb}
TOKEN_FILTER: lowercase
{1:mary} {2:had} {3:a} {4:little} {5:lamb}
TOKEN_FILTER: porter_stem
{1:mari} {2:had} {3:a} {4:littl} {5:lamb}
TOKEN_FILTER: bigram_filter
{1:mari had} {2:had a} {3:a littl} {4:littl lamb}
```Output is each token, prefixed by the numerical position attribute in the token stream at each step.
## Args
There are four required command line args:
- es: the elasticsearch host (ie http://localhost:9200)
- index: name of the index where your custom analyzer can be found
- analyzer: name of your custom analyzer
- text: the text to analyze# Shortcomings
aka "Areas for Improvement"
- Only works for custom analyzers right now (as it accesses the settings for your index)
- Attributes besides the token text and position would be handy## Who?
Created by [OpenSource Connections](http://opensourceconnections.com)
## License
Released under [Apache 2](LICENSE.txt)