https://github.com/takuti/lucene-ja-userdict-neologd

Create Lucene user-defined dictionary with NEologd
https://github.com/takuti/lucene-ja-userdict-neologd

kuromoji lucene mecab natural-language-processing neologd solr

Last synced: 2 months ago
JSON representation

Create Lucene user-defined dictionary with NEologd

Host: GitHub
URL: https://github.com/takuti/lucene-ja-userdict-neologd
Owner: takuti
License: apache-2.0
Created: 2018-07-06T00:12:38.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2018-07-06T00:12:50.000Z (almost 7 years ago)
Last Synced: 2025-02-12T07:54:29.550Z (4 months ago)
Topics: kuromoji, lucene, mecab, natural-language-processing, neologd, solr
Language: Python
Size: 6.84 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Apache Lucene Japanese User-Defined Dictionary with NEologd
===

Our Python script enables you to utilize [NEologd](https://github.com/neologd/mecab-ipadic-neologd), a neologism dictionary for [the MeCab morphological analyzer](http://taku910.github.io/mecab/), to enhance Japanese tokenization of Apache Lucene.

Since [Lucene's custom dictionary](https://lucene.apache.org/core/7_4_0/analyzers-kuromoji/index.html) does not allow us to specify word score which is necessary to make more accurate tokenization, the script aggressively filters out less important words from original NEologd (e.g., non-noun words, short word which might be tokenizable even by the standard tokenizer).

Note that tokenizer implemented in Lucene is based on [Kuromoji](http://www.atilika.org/), and the latest version of Kuromoji already supports custom dictionary format with word scores: [atilika/kuromoji#91](https://github.com/atilika/kuromoji/pull/91). However, unfortunately Lucene does not include the update at this moment.

## Usage

Following command downloads the latest version of NEologd and converts it into Lucene user-defined dictionary format:

```sh
$ python build.py
```

Eventually, `lucene-ja-userdict-neologd.csv.gz` is created. The CSV file contains custom rules for better Japanese tokenization as:

```
ten create,ten-create,,名詞
ハインリヒ・ベル,ハインリヒ・ベル,,名詞
佐竹笙悟,佐竹笙悟,,名詞
小貝澤,小貝澤,,名詞
神田達成,神田達成,,名詞
西村ツチカ,西村ツチカ,,名詞
愛と涙の蔵出し物語,愛と涙の蔵出し物語,,名詞
ヨンマンキュウセンヒャクエン,ヨンマンキュウセンヒャクエン,,名詞
mfブックス,mfブックス,,名詞
ハイイロガン,ハイイロガン,,名詞
...
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/takuti/lucene-ja-userdict-neologd

Awesome Lists containing this project

README