https://github.com/uglytoad/pragmaticsegmenternet
Port of PragmaticSegmenter for sentence boundary detection
https://github.com/uglytoad/pragmaticsegmenternet
nlp segmentation sentence sentence-boundary-detection sentence-segmentation
Last synced: 6 months ago
JSON representation
Port of PragmaticSegmenter for sentence boundary detection
- Host: GitHub
- URL: https://github.com/uglytoad/pragmaticsegmenternet
- Owner: UglyToad
- License: other
- Created: 2018-09-12T18:26:07.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2021-09-21T11:18:13.000Z (about 4 years ago)
- Last Synced: 2025-03-25T05:41:34.629Z (6 months ago)
- Topics: nlp, segmentation, sentence, sentence-boundary-detection, sentence-segmentation
- Language: C#
- Size: 209 KB
- Stars: 35
- Watchers: 1
- Forks: 12
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PragmaticSegmenterNet #
[](https://ci.appveyor.com/project/EliotJones/pragmaticsegmenternet)
This project is a direct port of [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) which provides rule-based sentence
boundary detection.## Usage ##
The ```Segmenter``` class provides the ```Segment``` method which in the simplest usage takes a string:
using PragmaticSegmenterNet;
IReadOnlyList result = Segmenter.Segment("One Sentence. And another sentence.");
// ["One Sentence.", "And another sentence."]IReadOnlyList result2 = Segmenter.Segment("Anything.", Language.Italian);
// ["Anything"]
The Segment method has a number of optional parameters:
IReadOnlyList Segment(string text, Language language = Language.English, bool cleanText = true, DocumentType documentType = DocumentType.Any)
+ Language - An enum representing the supported languages, the default is English, see the supported languages list below for the list of currently supported languages.
+ CleanText - A boolean indicating whether the input text should be cleaned prior to segmentation. Cleaning removes extra newlines and whitespace. Defaults to ```true```.
+ DocumentType - Used by the text cleaning process to determine which reformatting to apply. For PDFs this handles newlines in the middle of a sentence whereas for HTML documents this will handle HMTL tags. Defaults to any which does not apply any special formatting.## Languages ##
+ English = 0 (default)
+ Amharic = 1
+ Arabic = 2
+ Armenian = 3
+ Bulgarian = 4
+ Burmese = 5
+ Chinese = 6
+ Danish = 7
+ Dutch = 8
+ French = 9
+ German = 10
+ Greek = 11
+ Hindi = 12
+ Italian = 13
+ Japanese = 14
+ Kazakh = 15 (partial support, potentially only for the Cyrillic form of the alphabet)
+ Persian = 16
+ Polish = 17
+ Russian = 18
+ Spanish = 19
+ Urdu = 20## Credit ##
This project wouldn't be possible without the work done by [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) team.