Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/nytud/quntoken

Hungarian tokenizer.
https://github.com/nytud/quntoken

Last synced: 2 months ago
JSON representation

Hungarian tokenizer.

Host: GitHub
URL: https://github.com/nytud/quntoken
Owner: nytud
License: gpl-3.0
Created: 2015-08-26T11:53:05.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2022-03-15T09:55:37.000Z (over 2 years ago)
Last Synced: 2024-04-14T09:02:35.370Z (2 months ago)
Language: C++
Homepage: https://pypi.org/project/quntoken/
Size: 12.6 MB
Stars: 14
Watchers: 15
Forks: 5
Open Issues: 11
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Lists

awesome-hungarian-nlp - quntoken

README

        # quntoken

New Hungarian tokenizer based on quex and huntoken.

This tool is also [integrated](https://github.com/dlt-rilmta/hunlp-GATE)

into the [e-magyar](http://www.e-magyar.hu) language processing system

under the name [emToken](http://e-magyar.hu/hu/textmodules/emtoken).

## Requirements

* OS: linux x86-64

* python 3.6+

Developer requirements: 

* python 2.7 (for quex)

* g++ = 5

## Install

```sh

pip3 install quntoken

```

## Usage

### Command line

*quntoken* reads plain text in UTF-8 from STDIN and writes to STDOUT.

The default (and recommended) format of output is TSV. It has two columns.

The first contains the token, the second contains the white space sequence

after the token. Sentence boundaries are marked with empty lines.

Example: tokenizing *input.txt* file, writing the TSV output into *output.tsv* file.

```

quntoken output.tsv

```

Optional arguments:

```txt

  -h, --help            show this help message and exit

  -f {json,raw,spl,tsv,xml}, --form {json,raw,spl,tsv,xml}

                        Valid formats: json, tsv, xml and spl (sentence per

                        line, ignores mode). Default format: tsv.

  -m {sentence,token}, --mode {sentence,token}

                        Modes: sentence or token (does not apply for

                        form=spl). Default: token

  -c, --conll-text      Add CoNLL text metafield to contain the detokenized

                        sentence (only for mode == token and format == tsv).

                        Default: False

  -w, --word-break      Eliminate word break from end of lines.

  -v, --version         show program's version number and exit

```

### Python API

quntoken.**tokenize**(*inp=sys.stdin, form='tsv', mode='token',

word_break=False, conll_text=False*)

 

>Entry point, returns an iterator object. Parameters:

>

>- *inp*: Input iterator, default: *sys.stdin*.

>- *form*: Format of output. Valid formats: `'tsv'` (default), `'json'`, `'xml'`

>and `'spl'` (sentence per line, ignores `mode`).

>- *mode*: `'sentence'` (only sentence segmenting) or `'token'` (full

>tokenization - default, does not apply for `form=spl`).

>- *word_break*: If `True`, eliminates word break from end of lines. Default:

>`False`.

>- *conll_text*: If `True`, add CoNLL text metafield to contain the detokenized

>sentence (Only for mode == token and format == tsv). Default:

>`False`.

Example:

```py

from quntoken import tokenize

for tok in tokenize(open('input.txt')):

    print(tok, end='')

```