Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zjaume/splitters
A CLI for Rust SRX sentence segmenation rules as Python package.
https://github.com/zjaume/splitters
pypi python rust sentence-segmentation sentence-splitter sentence-splitting srx
Last synced: 13 days ago
JSON representation
A CLI for Rust SRX sentence segmenation rules as Python package.
- Host: GitHub
- URL: https://github.com/zjaume/splitters
- Owner: ZJaume
- License: gpl-3.0
- Created: 2023-07-12T11:17:57.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-09-14T16:16:10.000Z (about 1 year ago)
- Last Synced: 2023-09-15T08:40:32.360Z (about 1 year ago)
- Topics: pypi, python, rust, sentence-segmentation, sentence-splitter, sentence-splitting, srx
- Language: Rust
- Homepage:
- Size: 68.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# splitte(.)rs
*There's still some work pending to make this usable*A CLI for Rust [SRX](https://github.com/bminixhofer/srx) implementation as a Python package.
## Installation
Installing from source needs Rust Cargo to be installed. Install it with your package manager or with https://rustup.rs/.Then, clone the repo and install it as any other Python package:
```
git clone https://github.com/ZJaume/splitters
pip install ./splitters
```## Usage
Example usage
```bash
echo "Yes this is a sentence. Another one." | splitters -i /dev/stdin --output /dev/stdout
```
```
Yes this is a sentence.
Another one.
```Full list of parameters:
```
splitters 0.1.0USAGE:
splitters [OPTIONS] --input --outputOPTIONS:
-h, --help Print help information
-i, --input
-l, --language ISO-639-1, 2 char language code [default: en]
-o, --output
-s, --srxfile [default: ]
-v, --verbose
-V, --version Print version information
```## Compatibility with Rust regex
Some regex expressions might not be loaded because of syntax incompatibilities with Rust regex engine.
To avoid that, the SRX rules bundled with this package have been partially fixed to minimize this.
The `scripts/fix_regex.sh` contains the following fixes being applied:
- Escape whitespace character at the begginging of ``. For some reason the Rust xml parser is removing the space inside the rule for ` +` so it ends up with the repetition operator missing its expression.
- Unescape 'ظ' character for Farsi. Rust regex does not require it to be escaped.
- `\Q` and `\E` expresssions are not supported, so removing them and escaping everything enclosed in it.
- Escape dash before `\d` and `\p{...}` causing invalid range literal.To see the loading errors, run `splitters` with `-v` option and use `-s` to provide one of the original SRX files in `data_orig` to see the fixed errors.