https://github.com/bryant/punkt

Unsupervised multilingual sentence segmentation.
https://github.com/bryant/punkt

Last synced: 4 months ago
JSON representation

Unsupervised multilingual sentence segmentation.

Host: GitHub
URL: https://github.com/bryant/punkt
Owner: bryant
License: mit
Created: 2014-10-10T04:38:28.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2021-02-26T00:03:57.000Z (over 5 years ago)
Last Synced: 2025-12-08T11:17:11.595Z (6 months ago)
Language: Haskell
Size: 262 KB
Stars: 21
Watchers: 2
Forks: 5
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          punkt

=====

Multilingual unsupervised sentence tokenization with Punkt.

## Usage

Note that abbreviations are detected at run time without the aid of a pre-built

abbreviation list:

```haskell

import Data.Text (Text, pack)

import NLP.Punkt (split_sentences)

corpus :: Text

corpus = pack "Look, Ma! The quick brown Mr. T. rex swallowed the lazy dog. \

              \It really did!"

main :: IO ()

main = mapM_ print (split_sentences corpus)

```

yields:

```

"Look, Ma!"

"The quick brown Mr. T. rex swallowed the lazy dog."

"It really did!"

```

## References

Kiss, Tibor, and Jan Strunk. "Unsupervised multilingual sentence boundary

detection." Computational Linguistics 32.4 (2006): 485-525.

## TODO

- parallelize

- modularize tokenization

  - custom tokenization rules

- needs to go fasterer

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bryant/punkt

Awesome Lists containing this project

README