https://github.com/zseder/huntoken

word and sentence tokenizer
https://github.com/zseder/huntoken

Last synced: about 2 months ago
JSON representation

word and sentence tokenizer

Host: GitHub
URL: https://github.com/zseder/huntoken
Owner: zseder
License: lgpl-3.0
Created: 2013-09-09T08:52:32.000Z (almost 12 years ago)
Default Branch: master
Last Pushed: 2014-11-27T10:00:26.000Z (over 10 years ago)
Last Synced: 2024-08-03T16:08:58.809Z (11 months ago)
Language: Shell
Size: 707 KB
Stars: 3
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: COPYING

Awesome Lists containing this project

awesome-hungarian-nlp - huntoken

README

        # Hungarian (and a little bit English) raw text tokenisation 

License: GNU LGPL

2003-2004 (c) Németh László

2013-     (c) Zséder Attila

## Compile

~~~~

make

make install

~~~~

Need

- Unix environment (shell, Unix tools),

- Flex lexical analyzer generator,

- M4 macro processor.

## Usage

Need

- Unix shell, or CYGWIN on Windows

- sed

~~~~

huntoken xml_output

~~~~

## Options

- -h, --help: help

- -r: only sentence boundary detection

- -x: processing without hun_abbrev filter

- -b: break long sentences (need for tokenising long (\>4000 characters) sentences!!!)

- -n: output without XML header and footer

- -e: tokenize English (set English abbrevations)

- -v, --version: version

## Filters

See flex sources, and huntoken shell program.

László Németh

[email protected]

Attila Zséder

[email protected], [email protected]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zseder/huntoken

Awesome Lists containing this project

README