Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nytud/quntoken
Hungarian tokenizer.
https://github.com/nytud/quntoken
Last synced: about 2 months ago
JSON representation
Hungarian tokenizer.
- Host: GitHub
- URL: https://github.com/nytud/quntoken
- Owner: nytud
- License: gpl-3.0
- Created: 2015-08-26T11:53:05.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2022-03-15T09:55:37.000Z (almost 3 years ago)
- Last Synced: 2024-05-22T01:21:18.881Z (8 months ago)
- Language: C++
- Homepage: https://pypi.org/project/quntoken/
- Size: 12.6 MB
- Stars: 14
- Watchers: 15
- Forks: 5
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-hungarian-nlp - quntoken
README
# quntoken
New Hungarian tokenizer based on quex and huntoken.
This tool is also [integrated](https://github.com/dlt-rilmta/hunlp-GATE)
into the [e-magyar](http://www.e-magyar.hu) language processing system
under the name [emToken](http://e-magyar.hu/hu/textmodules/emtoken).## Requirements
* OS: linux x86-64
* python 3.6+Developer requirements:
* python 2.7 (for quex)
* g++ = 5## Install
```sh
pip3 install quntoken
```## Usage
### Command line
*quntoken* reads plain text in UTF-8 from STDIN and writes to STDOUT.
The default (and recommended) format of output is TSV. It has two columns.
The first contains the token, the second contains the white space sequence
after the token. Sentence boundaries are marked with empty lines.Example: tokenizing *input.txt* file, writing the TSV output into *output.tsv* file.
```
quntoken output.tsv
```Optional arguments:
```txt
-h, --help show this help message and exit
-f {json,raw,spl,tsv,xml}, --form {json,raw,spl,tsv,xml}
Valid formats: json, tsv, xml and spl (sentence per
line, ignores mode). Default format: tsv.
-m {sentence,token}, --mode {sentence,token}
Modes: sentence or token (does not apply for
form=spl). Default: token
-c, --conll-text Add CoNLL text metafield to contain the detokenized
sentence (only for mode == token and format == tsv).
Default: False
-w, --word-break Eliminate word break from end of lines.
-v, --version show program's version number and exit
```### Python API
quntoken.**tokenize**(*inp=sys.stdin, form='tsv', mode='token',
word_break=False, conll_text=False*)
>Entry point, returns an iterator object. Parameters:
>
>- *inp*: Input iterator, default: *sys.stdin*.
>- *form*: Format of output. Valid formats: `'tsv'` (default), `'json'`, `'xml'`
>and `'spl'` (sentence per line, ignores `mode`).
>- *mode*: `'sentence'` (only sentence segmenting) or `'token'` (full
>tokenization - default, does not apply for `form=spl`).
>- *word_break*: If `True`, eliminates word break from end of lines. Default:
>`False`.
>- *conll_text*: If `True`, add CoNLL text metafield to contain the detokenized
>sentence (Only for mode == token and format == tsv). Default:
>`False`.Example:
```py
from quntoken import tokenizefor tok in tokenize(open('input.txt')):
print(tok, end='')
```