https://github.com/fergusq/yajwiz
Klingon morphological analyzer and other NLP tools
https://github.com/fergusq/yajwiz
klingon morphological-analysis
Last synced: 9 months ago
JSON representation
Klingon morphological analyzer and other NLP tools
- Host: GitHub
- URL: https://github.com/fergusq/yajwiz
- Owner: fergusq
- License: apache-2.0
- Created: 2020-10-14T21:40:25.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-04-21T18:26:31.000Z (about 2 years ago)
- Last Synced: 2025-07-28T15:15:53.946Z (11 months ago)
- Topics: klingon, morphological-analysis
- Language: Python
- Homepage:
- Size: 8 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
yajwI'
======
**yajwI'** is a Klingon NLP toolkit that includes basic tokenization, morphological analysis and POS tagging.
It heavily uses the `boQwI' dictionary `_.
Installation
------------
yajwI' requires Python 3.8 or newer.
It can be installed from PyPI::
pip install yajwiz
Updating and using the boQwI' dictionary
----------------------------------------
When yajwI' is first imported, it will download a copy of the boQwI' dictionary.
After this the ``update_dictionary()`` function must be called whenever the dictionary needs to be updated.
The function will check for updates and install them.
The downloaded dictionary can be accessed through the ``load_dictionary()`` function.
>>> import yajwiz
>>> yajwiz.update_dictionary()
>>> dictionary = yajwiz.load_dictionary()
>>> dictionary.version
'2021.03.18a'
Tokenization
------------
The library includes very simple tokenization.
>>> import yajwiz
>>> yajwiz.tokenize("Hegh neH chav qoH. qanchoHpa' qoH, Hegh qoH.")
[('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'neH'), ('SPACE', ' '), ('WORD', 'chav'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.'), ('SPACE', ' '), ('WORD', "qanchoHpa'"), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', ','), ('SPACE', ' '), ('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.')]
Morphological analysis
----------------------
The ``yajwiz.analyze`` function parses a word and returns a list of possible parses and a lot of extra information.
>>> yajwiz.analyze("yInwI'")
[{'BOQWIZ_ID': 'yIn:n',
'BOQWIZ_POS': 'n:klcp1',
'LEMMA': 'yIn',
'PARTS': ['yIn:n', "-wI':n"],
'POS': 'N',
'SUFFIX': {'N4': "-wI'"},
'UNGRAMMATICAL': 'ILLEGAL PLURAL OR POSSESSIVE SUFFIX',
'WORD': "yInwI'",
'XPOS': 'N',
'XPOS_GSUFF': 'N'},
{'BOQWIZ_ID': 'yIn:v',
'BOQWIZ_POS': 'v:t_c,klcp1',
'LEMMA': 'yIn',
'PARTS': ['yIn:v', "-wI':v"],
'POS': 'V',
'SUFFIX': {'V9': "-wI'"},
'WORD': "yInwI'",
'XPOS': 'VT',
'XPOS_GSUFF': "VT.wI'"}]
Currently the analyzer is very permissive and does allow using wrong plurals and possessive suffixes (eg. **yInwI'** instead of **yInwIj**). It will try to mark this kind of errors with ``'UNGRAMMATICAL': True``. It detects the following errors:
- Using **-pu'**, **-wI'**, **-lI'**, etc. when the noun is not a person noun
- Using **-Du'** when the noun is not a body part
- Using **-vIS** without using **-taH**
- Using **-lu'** with an illegal verb prefix
- Using intransitive verbs with prefixes indicating object
- Using **-ghach** without any other verb suffix
- Using aspect suffix with **-jaj**
There is also a simpler function ``yajwiz.split_to_morphemes``, that returns a set of tuples of strings (usually there will be only one tuple in the set):
>>> yajwiz.split_to_morphemes("yInwI'")
{('yIn', "-wI'")}
List of Parts of Speech
.......................
===== ===========
XPOS Explanation
===== ===========
VS Stative verb
VT Transitive verb
VI Intransitive verb
VA Transitive and intransitive verb
V? Verb with unknown transitivity
NL Person noun
NB Body part noun
PRON Pronoun (including **'Iv** and **nuq**: it is a noun that can function as a copula)
NUM Number
N Other noun
ADV Adverb
EXCL Exclamation
CONJ Conjunction
QUES Question word (other than **'Iv** and **nuq**)
UNK Unknown
===== ===========
Grammar checker
---------------
yajwI' can be used to find common grammar errors. You can either use the method ``yajwiz.grammar_check`` or the following command line interface:
.. code::
python -m yajwiz.grammar_check file.txt
CONLL-U files and POS tagger
----------------------------
CONLL-U files are a popular data format for storing annotated linguistic data.
yajwI' can generate CONLL-U files filled with morphological information (it does not support dependency parsing).
Below is an example script that first parses a text without a trained POS tagger,
then trains a POS tagger with it and finally parses the text with the tagger and saves the result to a CONLL-U file.
.. code:: python
import yajwiz
with open("prose-corpus.txt", "r") as f:
text = f.read()
conllu = yajwiz.text_to_conllu(text)
tagger = yajwiz.Tagger()
tagger.train(yajwiz.conllu_to_tagged_list(conllu))
conllu = yajwiz.text_to_conllu(text, tagger)
with open("prose-corpus.conllu", "w") as f:
f.write(conllu)
Without a trained POS tagger, ambiguous words will be left without a tag:
.. code::
# Hegh neH chav qoH.
1 Hegh _ _ _ _ _ _ _ _
2 neH _ _ _ _ _ _ _ _
3 chav _ _ _ _ _ _ _ _
4 qoH qoH NOUN N _ _ _ _ _
5 . . PUNCT PUNCT _ _ _ _ _
# qanchoHpa' qoH, Hegh qoH.
1 qanchoHpa' qan VERB V?.pa' Person=3|ObjPerson=3,0 _ _ _ SuffixV3=-choH|SuffixV9=-pa'
2 qoH qoH NOUN N _ _ _ _ _
3 , , PUNCT PUNCT _ _ _ _ _
4 Hegh _ _ _ _ _ _ _ _
5 qoH qoH NOUN N _ _ _ _ _
6 . . PUNCT PUNCT _ _ _ _ _
After training the tagger, it will take the "best guess" when deciding the POS.
.. code::
# Hegh neH chav qoH.
1 Hegh Hegh VERB VT Person=3|ObjPerson=3,0 _ _ _ _
2 neH neH ADV ADV _ _ _ _ _
3 chav chav VERB VT Person=3|ObjPerson=3,0 _ _ _ _
4 qoH qoH NOUN N _ _ _ _ _
5 . . PUNCT PUNCT _ _ _ _ _
# qanchoHpa' qoH, Hegh qoH.
1 qanchoHpa' qan VERB V?.pa' Person=3|ObjPerson=3,0 _ _ _ SuffixV3=-choH|SuffixV9=-pa'
2 qoH qoH NOUN N _ _ _ _ _
3 , , PUNCT PUNCT _ _ _ _ _
4 Hegh Hegh VERB VT Person=3|ObjPerson=3,0 _ _ _ _
5 qoH qoH NOUN N _ _ _ _ _
6 . . PUNCT PUNCT _ _ _ _ _
In this example the tagger made a mistake: it classified the first **Hegh** as VT, although it should be N. I don't have a correctly tagged corpus, so evaluating the tagger is currently impossible. :(
Copyright
---------
yajwiz (c) 2020 Iikka Hauhio
This program a uses the `boQwI' dictionary `_ (``data.json``) that is licensed under the Apache License 2.0.
The Python files are also licensed under the Apache License 2.0. See the LICENSE file for more details.