https://github.com/ugermann/ssplit-cpp
Approximate reimplementation of the sentence splitter from the Moses toolkit.
https://github.com/ugermann/ssplit-cpp
Last synced: about 1 year ago
JSON representation
Approximate reimplementation of the sentence splitter from the Moses toolkit.
- Host: GitHub
- URL: https://github.com/ugermann/ssplit-cpp
- Owner: ugermann
- License: other
- Created: 2019-09-16T19:06:07.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-04-12T14:24:55.000Z (about 4 years ago)
- Last Synced: 2025-02-06T00:23:01.991Z (over 1 year ago)
- Language: C++
- Size: 214 KB
- Stars: 4
- Watchers: 3
- Forks: 9
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# ssplit-cpp
This is an approximate reimplementation of the sentence splitter from the Moses toolkit.
- Currently doesn't support CJK character sets.
- requires the pcre2 libraries
On Ubuntu, `sudo apt-get install libpcre2` should do the trick
- The (non-breaking) prefix files included were originally copied from the Moses toolkit
but may have been edited there or here since they were copied.
## Build instructions
```
mkdir build
cd build
cmake ..
make -j
```
This produces an executable `ssplit`.
## Usage
### Command line:
run `ssplit -h` for usage instructions.
### In Code (Example)
```
#include "ssplit.h"
...
std::string prefix_file = "path/to/moses-style/prefix-file";
ug::ssplit::SentenceSplitter ssplit(prefix_file);
...
std::string chunk_of_text = "Sentence one. Sentence two.\nSentence three. Sentence four.";
ug::ssplit::splitmode mode = ug::ssplit::splitmode::one_paragraph_per_line;
ug::sssplit::SentenceStream sentence_stream(chunk_of_text, ssplit, mode);
std::string_view snt;
while(sentence_stream >> snt) { // false means end of chunk
if (snt.size() == 0) {
// empty string_view means end of paragraph except in one_sentence_per_line mode,
// which just returns one line (minus leading and training whitespace) at a time
// For one_paragraph_per_line each empty paragraph results in snt.size() == 0
// twice in a row: first the em
...
}
else { // this is the next non-empty paragraph
...
}
}
```