https://github.com/wwwcojp/ja_sentence_segmenter

japanese sentence segmentation library for python
https://github.com/wwwcojp/ja_sentence_segmenter

nlp python rule-based sentence-boundary-detection sentence-tokenizer

Last synced: 3 months ago
JSON representation

japanese sentence segmentation library for python

Host: GitHub
URL: https://github.com/wwwcojp/ja_sentence_segmenter
Owner: wwwcojp
License: mit
Created: 2019-12-15T13:51:31.000Z (over 6 years ago)
Default Branch: main
Last Pushed: 2023-04-03T04:10:03.000Z (about 3 years ago)
Last Synced: 2025-12-16T14:45:46.291Z (6 months ago)
Topics: nlp, python, rule-based, sentence-boundary-detection, sentence-tokenizer
Language: Python
Homepage: https://wwwcojp.github.io/ja_sentence_segmenter/ja_sentence_segmenter.html
Size: 156 KB
Stars: 73
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # ja_sentence_segmenter

日本語のテキストに対して、ルールベースによる文区切り（sentence segmentation）を行います。

## Getting Started

### Prerequisites

* Python 3.6+

### Installing

`pip install ja_sentence_segmenter`

### Usage

```Python

import functools

from ja_sentence_segmenter.common.pipeline import make_pipeline

from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching

from ja_sentence_segmenter.normalize.neologd_normalizer import normalize

from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation

split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")

concat_tail_no = functools.partial(concatenate_matching, former_matching_rule=r"^(?P.+)(の)$", remove_former_matched=False)

segmenter = make_pipeline(normalize, split_newline, concat_tail_no, split_punc2)

# Golden Rule: Simple period to end sentence #001 (from https://github.com/diasks2/pragmatic_segmenter/blob/master/spec/pragmatic_segmenter/languages/japanese_spec.rb#L6)

text1 = "これはペンです。それはマーカーです。"

print(list(segmenter(text1)))

```

```

> ["これはペンです。", "それはマーカーです。"]

```

## Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

## Contributing

TODO

## License

MIT

## Acknowledgments

### テキストの正規化処理

テキスト正規化のコードは、[mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd)の以下のWIKIを参考に一部修正を加えています。

サンプルコードの提供者であるhideaki-t氏とoverlast氏に感謝します。

https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast

### 文区切り（sentence segmentation）のルール

文区切りのルールとして、[Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter)の日本語ルールを参考にしました。

https://github.com/diasks2/pragmatic_segmenter#golden-rules-japanese

また、以下のテストコード中で用いられているテストデータを、本PJのテストコードで利用しました。

https://github.com/diasks2/pragmatic_segmenter/blob/master/spec/pragmatic_segmenter/languages/japanese_spec.rb

作者のKevin S. Dias氏と[コントリビュータの方々](https://github.com/diasks2/pragmatic_segmenter/graphs/contributors)に感謝します。

Thanks to Kevin S. Dias and [contributors](https://github.com/diasks2/pragmatic_segmenter/graphs/contributors).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wwwcojp/ja_sentence_segmenter

Awesome Lists containing this project

README