https://github.com/xamgore/segtok

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features
https://github.com/xamgore/segtok

nlp segmenter tokenizer

Last synced: about 2 months ago
JSON representation

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features

Host: GitHub
URL: https://github.com/xamgore/segtok
Owner: xamgore
Created: 2025-01-08T19:39:43.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2025-02-26T14:33:16.000Z (about 1 year ago)
Last Synced: 2025-04-13T13:07:30.887Z (about 1 year ago)
Topics: nlp, segmenter, tokenizer
Language: Rust
Homepage:
Size: 101 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md

Awesome Lists containing this project

README

          # segtok [![](https://img.shields.io/crates/v/segtok.svg)](https://crates.io/crates/segtok) [![](https://docs.rs/segtok/badge.svg)](https://docs.rs/segtok/)

Segtok is a fast, rule-based sentence segmentation and tokenization library for well-orthographed texts, particularly in

English, German, and Romance languages.

- Unicode support

- High precision for well-orthographed texts

- Minimal false positives

- Handles complex sentence boundaries

- Handles technical texts and URLs

It minimizes false positives, handles complex sentence structures, technical terms, and URLs, and supports Unicode.

It’s lightweight, customizable for developers, and integrates easily into Unix-based workflows. Segtok is ideal for

processing structured, regular texts where precision and speed are crucial.

Ported from the [python package](https://github.com/fnl/segtok) (not maintained anymore),

and fixes [a few bugs](https://github.com/fnl/segtok/issues/26) not fixed there. You may want to read about

[why segtok was made](https://github.com/xamgore/segtok/blob/master/README.md).

## Example

```rust

use segtok::{segmenter::*, tokenizer::*};

fn main() {

  let input = include_str!("../tests/test_google.txt");

  let sentences: Vec> = split_multi(input, SegmentConfig::default())

    .into_iter()

    .map(|span| split_contractions(web_tokenizer(&span)).collect())

    .collect();

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xamgore/segtok

Awesome Lists containing this project

README