https://github.com/xamgore/segtok
A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features
https://github.com/xamgore/segtok
nlp segmenter tokenizer
Last synced: about 2 months ago
JSON representation
A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features
- Host: GitHub
- URL: https://github.com/xamgore/segtok
- Owner: xamgore
- Created: 2025-01-08T19:39:43.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-02-26T14:33:16.000Z (about 1 year ago)
- Last Synced: 2025-04-13T13:07:30.887Z (about 1 year ago)
- Topics: nlp, segmenter, tokenizer
- Language: Rust
- Homepage:
- Size: 101 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
README
# segtok [](https://crates.io/crates/segtok) [](https://docs.rs/segtok/)
Segtok is a fast, rule-based sentence segmentation and tokenization library for well-orthographed texts, particularly in
English, German, and Romance languages.
- Unicode support
- High precision for well-orthographed texts
- Minimal false positives
- Handles complex sentence boundaries
- Handles technical texts and URLs
It minimizes false positives, handles complex sentence structures, technical terms, and URLs, and supports Unicode.
It’s lightweight, customizable for developers, and integrates easily into Unix-based workflows. Segtok is ideal for
processing structured, regular texts where precision and speed are crucial.
Ported from the [python package](https://github.com/fnl/segtok) (not maintained anymore),
and fixes [a few bugs](https://github.com/fnl/segtok/issues/26) not fixed there. You may want to read about
[why segtok was made](https://github.com/xamgore/segtok/blob/master/README.md).
## Example
```rust
use segtok::{segmenter::*, tokenizer::*};
fn main() {
let input = include_str!("../tests/test_google.txt");
let sentences: Vec> = split_multi(input, SegmentConfig::default())
.into_iter()
.map(|span| split_contractions(web_tokenizer(&span)).collect())
.collect();
}
```