https://github.com/bryant/punkt
Unsupervised multilingual sentence segmentation.
https://github.com/bryant/punkt
Last synced: 4 months ago
JSON representation
Unsupervised multilingual sentence segmentation.
- Host: GitHub
- URL: https://github.com/bryant/punkt
- Owner: bryant
- License: mit
- Created: 2014-10-10T04:38:28.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2021-02-26T00:03:57.000Z (over 5 years ago)
- Last Synced: 2025-12-08T11:17:11.595Z (6 months ago)
- Language: Haskell
- Size: 262 KB
- Stars: 21
- Watchers: 2
- Forks: 5
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
punkt
=====
Multilingual unsupervised sentence tokenization with Punkt.
## Usage
Note that abbreviations are detected at run time without the aid of a pre-built
abbreviation list:
```haskell
import Data.Text (Text, pack)
import NLP.Punkt (split_sentences)
corpus :: Text
corpus = pack "Look, Ma! The quick brown Mr. T. rex swallowed the lazy dog. \
\It really did!"
main :: IO ()
main = mapM_ print (split_sentences corpus)
```
yields:
```
"Look, Ma!"
"The quick brown Mr. T. rex swallowed the lazy dog."
"It really did!"
```
## References
Kiss, Tibor, and Jan Strunk. "Unsupervised multilingual sentence boundary
detection." Computational Linguistics 32.4 (2006): 485-525.
## TODO
- parallelize
- modularize tokenization
- custom tokenization rules
- needs to go fasterer