Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zseder/huntoken
word and sentence tokenizer
https://github.com/zseder/huntoken
Last synced: 3 months ago
JSON representation
word and sentence tokenizer
- Host: GitHub
- URL: https://github.com/zseder/huntoken
- Owner: zseder
- License: lgpl-3.0
- Created: 2013-09-09T08:52:32.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2014-11-27T10:00:26.000Z (almost 10 years ago)
- Last Synced: 2024-04-20T09:33:09.541Z (7 months ago)
- Language: Shell
- Size: 707 KB
- Stars: 3
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
- awesome-hungarian-nlp - huntoken
README
# Hungarian (and a little bit English) raw text tokenisation
License: GNU LGPL
2003-2004 (c) Németh László
2013- (c) Zséder Attila
## Compile
~~~~
make
make install
~~~~Need
- Unix environment (shell, Unix tools),
- Flex lexical analyzer generator,
- M4 macro processor.## Usage
Need
- Unix shell, or CYGWIN on Windows
- sed~~~~
huntoken xml_output
~~~~## Options
- -h, --help: help
- -r: only sentence boundary detection
- -x: processing without hun_abbrev filter
- -b: break long sentences (need for tokenising long (\>4000 characters) sentences!!!)
- -n: output without XML header and footer
- -e: tokenize English (set English abbrevations)
- -v, --version: version## Filters
See flex sources, and huntoken shell program.
László Németh
[email protected]Attila Zséder
[email protected], [email protected]