Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zseder/huntoken

word and sentence tokenizer
https://github.com/zseder/huntoken

Last synced: about 2 months ago
JSON representation

word and sentence tokenizer

Awesome Lists containing this project

README

        

# Hungarian (and a little bit English) raw text tokenisation

License: GNU LGPL

2003-2004 (c) Németh László

2013- (c) Zséder Attila

## Compile

~~~~
make
make install
~~~~

Need
- Unix environment (shell, Unix tools),
- Flex lexical analyzer generator,
- M4 macro processor.

## Usage

Need
- Unix shell, or CYGWIN on Windows
- sed

~~~~
huntoken xml_output
~~~~

## Options

- -h, --help: help
- -r: only sentence boundary detection
- -x: processing without hun_abbrev filter
- -b: break long sentences (need for tokenising long (\>4000 characters) sentences!!!)
- -n: output without XML header and footer
- -e: tokenize English (set English abbrevations)
- -v, --version: version

## Filters

See flex sources, and huntoken shell program.

László Németh
[email protected]

Attila Zséder
[email protected], [email protected]