https://github.com/codelibs/esanpy
Python Text Analyzer based on Elasticsearch
https://github.com/codelibs/esanpy
analyzer python
Last synced: about 2 months ago
JSON representation
Python Text Analyzer based on Elasticsearch
- Host: GitHub
- URL: https://github.com/codelibs/esanpy
- Owner: codelibs
- License: apache-2.0
- Created: 2017-10-01T06:26:46.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-12-14T06:41:10.000Z (over 7 years ago)
- Last Synced: 2025-04-20T01:33:36.262Z (2 months ago)
- Topics: analyzer, python
- Language: Python
- Homepage:
- Size: 24.4 KB
- Stars: 9
- Watchers: 7
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Esanpy: Elasticsearch based Analyzer for Python [](https://travis-ci.org/codelibs/esanpy)
Esanpy is Python Text Analyzer based on Elasticsearch.
Using Elasticsearch, Esanpy provides powerful and fully-customizable text analysis.
Since Esanpy manages Elasticsearch instance internally, you DO NOT need to install/configure Elasticsearch.## Install Esanpy
$ pip install esanpy
If you want to install development version, run as below:
$ git clone https://github.com/codelibs/esanpy.git
$ cd esanpy
$ pip install .### Requirement
* Python 2.7 or 3.4-3.6
* Java 8 or above## Python
First of all, import esanpy module.
```
import esanpy
```### Start Server
To access to Elasticsearch, use `start_server` function.
This function downloads/configures embedded elasticsearch and plugins, and then start Elasticsearch instance.
The elasticsearch is saved in `~/.esanpy` directory.
If they are configured, this function just start elasticsearch instance.```
esanpy.start_server()
```### Analyze Text
Esanpy provides `analyzer` and `custom_analyzer` function.
```
tokens = esanpy.analyzer("This is a pen.")
# tokens = ["this", "is", "a", "pen"]
```To use other analyzer, set an analyzer name with `analyzer`.
```
tokens = esanpy.analyzer("今日の天気は晴れです。", analyzer="kuromoji")
````custom_analyzer` has `tokenizer`, `token_filter` and `char_filter` as arguments.
```
tokens = esanpy.custom_analyzer('this is a test',
tokenizer="keyword",
token_filter=["lowercase"],
char_filter=["html_strip"])
```For Elasticsearch Analyze API, see [Analyze](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html).
### Stop Server
To stop Elasticsearch, use `stop_server()`.
```
esanpy.stop_server()
```## Command
Esanpy provides `esanpy` command.
```
$ esanpy --text "This is a pen."
this
is
a
pen
````esanpy` starts Elasticsearch if it does not run.
So, it takes time to start it, but it will be fast after that because Elasticsearch instance is reused.To change analyzer, use `--analyzer` option.
```
$ esanpy --text 今日の天気は晴れです。 --analyzer kuromoji
今日
天気
晴れ
````--stop` opition stops Elasticsearch instance on the command exit.
```
$ esanpy --text "This is a pen." --stop
```## Advance Usecases
### Register Analyzer
You can register own analyzers by `create_analysis`.
To register analyzers with `my_analyzers` namespace:```
esanpy.create_analysis('my_analyzers',
char_filter={
"mapping_ja_filter": {
"type": "mapping",
"mappings_path": mapping_file
}
},
tokenizer={
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "normal",
"user_dictionary": userdict_file,
"discard_punctuation": False
}
},
token_filter={
"ja_stopword": {
"type": "ja_stop",
"stopwords": [
"行く"
]
}
},
analyzer={
"kuromoji_analyzer": {
"type": "custom",
"char_filter": ["mapping_ja_filter"],
"tokenizer": "kuromoji_user_dict",
"filter": ["ja_stopword"]
}
}
)
```To use kuromoji_analyzer, invoke `analyzer` with a namespace and analyzer:
```
tokens = esanpy.analyzer('①東京スカイツリーに行く',
analyzer="kuromoji_analyzer",
namespace='my_analyzers')
# tokens = ['1', '東京スカイツリー', 'に']
```To delete namespace, use `delete_analysis`:
```
esanpy.delete_analysis('my_analyzers')
```For more information, see [Analysis](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis.html).
### Use Kuromoji Neologd
Installing analysis-kuromoji-neologd plugin, you can use Nelogd analyzer.
To install it, use `--plugin` option.```
$ esanpy --stop
$ esanpy --plugin org.codelibs:elasticsearch-analysis-kuromoji-neologd:5.6.1
```After installation, `kuromoji_neologd` analyzer is available.
```
$ esanpy --text 今日の天気は晴れです。 --analyzer kuromoji_neologd
今日の天気
晴れ
```### Uninstall Esanpy
To remove Esanpy, check/kill processes:
```
$ ps aux | grep esanpy
$ kill [above PIDs]
```and then remove `~/.esanpy` directory:
```
$ rm -rf ~/.esanpy
```