https://github.com/mgproduction/mgtagger
A small, generic, single-C-source-code POS tagger, featuring ngrams with most common word spice, with Viterbi-like code.
https://github.com/mgproduction/mgtagger
c part-of-speech-tagger postagger tagger
Last synced: 9 days ago
JSON representation
A small, generic, single-C-source-code POS tagger, featuring ngrams with most common word spice, with Viterbi-like code.
- Host: GitHub
- URL: https://github.com/mgproduction/mgtagger
- Owner: MGProduction
- License: apache-2.0
- Created: 2017-11-10T19:41:27.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-11-11T17:24:14.000Z (over 8 years ago)
- Last Synced: 2025-02-25T11:46:47.058Z (over 1 year ago)
- Topics: c, part-of-speech-tagger, postagger, tagger
- Language: C
- Homepage:
- Size: 7.68 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
mgtagger
=====
*mgtagger* is small, generic, single-C-source-code POS tagger, featuring ngrams with most common word spice, with Viterbi-like code.
It can learn languages from conllu files or from in-line-tagging ones.
The source code in this repository is provided under the terms of the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.html).
## Information
*mgtagger* is able to learn the info needed to postag from inline pos tagged file (the/DT cat/NN is/VBZ on/IN the/DT table/NN) or from conllu files (in which case you can select which feature set to use, and you'll also get base forms in output).
After the quick learning phase it generates (and it's able to load) a (text) .mg file - lex + ngrams.
It natively works in *utf8* - but you can switch it to codepage (changing this setting into the code)
To use it you in your project you simply need to add to your project *mgtagger_postag.c* + *mgtagger_private.h* / *mgtagger.h*
*mgtagger* at the moment doesn't do tokenization (even if it's a built-in basic tokenizer that may fit for some languages - not surely
for Japanese, Chinese or Thai, anyway) - it just assign a POS to tokens after its analysis.