https://github.com/fergusq/yajwiz

Klingon morphological analyzer and other NLP tools
https://github.com/fergusq/yajwiz

klingon morphological-analysis

Last synced: 9 months ago
JSON representation

Klingon morphological analyzer and other NLP tools

Host: GitHub
URL: https://github.com/fergusq/yajwiz
Owner: fergusq
License: apache-2.0
Created: 2020-10-14T21:40:25.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-04-21T18:26:31.000Z (about 2 years ago)
Last Synced: 2025-07-28T15:15:53.946Z (11 months ago)
Topics: klingon, morphological-analysis
Language: Python
Homepage:
Size: 8 MB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

          yajwI'

======

**yajwI'** is a Klingon NLP toolkit that includes basic tokenization, morphological analysis and POS tagging.

It heavily uses the `boQwI' dictionary `_.

Installation

------------

yajwI' requires Python 3.8 or newer.

It can be installed from PyPI::

    pip install yajwiz

Updating and using the boQwI' dictionary

----------------------------------------

When yajwI' is first imported, it will download a copy of the boQwI' dictionary.

After this the ``update_dictionary()`` function must be called whenever the dictionary needs to be updated.

The function will check for updates and install them.

The downloaded dictionary can be accessed through the ``load_dictionary()`` function.

>>> import yajwiz

>>> yajwiz.update_dictionary()

>>> dictionary = yajwiz.load_dictionary()

>>> dictionary.version

'2021.03.18a'

Tokenization

------------

The library includes very simple tokenization.

>>> import yajwiz

>>> yajwiz.tokenize("Hegh neH chav qoH. qanchoHpa' qoH, Hegh qoH.")

[('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'neH'), ('SPACE', ' '), ('WORD', 'chav'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.'), ('SPACE', ' '), ('WORD', "qanchoHpa'"), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', ','), ('SPACE', ' '), ('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.')]

Morphological analysis

----------------------

The ``yajwiz.analyze`` function parses a word and returns a list of possible parses and a lot of extra information.

>>> yajwiz.analyze("yInwI'")

[{'BOQWIZ_ID': 'yIn:n',

  'BOQWIZ_POS': 'n:klcp1',

  'LEMMA': 'yIn',

  'PARTS': ['yIn:n', "-wI':n"],

  'POS': 'N',

  'SUFFIX': {'N4': "-wI'"},

  'UNGRAMMATICAL': 'ILLEGAL PLURAL OR POSSESSIVE SUFFIX',

  'WORD': "yInwI'",

  'XPOS': 'N',

  'XPOS_GSUFF': 'N'},

 {'BOQWIZ_ID': 'yIn:v',

  'BOQWIZ_POS': 'v:t_c,klcp1',

  'LEMMA': 'yIn',

  'PARTS': ['yIn:v', "-wI':v"],

  'POS': 'V',

  'SUFFIX': {'V9': "-wI'"},

  'WORD': "yInwI'",

  'XPOS': 'VT',

  'XPOS_GSUFF': "VT.wI'"}]

Currently the analyzer is very permissive and does allow using wrong plurals and possessive suffixes (eg. **yInwI'** instead of **yInwIj**). It will try to mark this kind of errors with ``'UNGRAMMATICAL': True``. It detects the following errors:

- Using **-pu'**, **-wI'**, **-lI'**, etc. when the noun is not a person noun

- Using **-Du'** when the noun is not a body part

- Using **-vIS** without using **-taH**

- Using **-lu'** with an illegal verb prefix

- Using intransitive verbs with prefixes indicating object

- Using **-ghach** without any other verb suffix

- Using aspect suffix with **-jaj**

There is also a simpler function ``yajwiz.split_to_morphemes``, that returns a set of tuples of strings (usually there will be only one tuple in the set):

>>> yajwiz.split_to_morphemes("yInwI'")

{('yIn', "-wI'")}

List of Parts of Speech

.......................

===== ===========

XPOS  Explanation

===== ===========

VS    Stative verb

VT    Transitive verb

VI    Intransitive verb

VA    Transitive and intransitive verb

V?    Verb with unknown transitivity

NL    Person noun

NB    Body part noun

PRON  Pronoun (including **'Iv** and **nuq**: it is a noun that can function as a copula)

NUM   Number

N     Other noun

ADV   Adverb

EXCL  Exclamation

CONJ  Conjunction

QUES  Question word (other than **'Iv** and **nuq**)

UNK   Unknown

===== ===========

Grammar checker

---------------

yajwI' can be used to find common grammar errors. You can either use the method ``yajwiz.grammar_check`` or the following command line interface:

.. code::

    python -m yajwiz.grammar_check file.txt

CONLL-U files and POS tagger

----------------------------

CONLL-U files are a popular data format for storing annotated linguistic data.

yajwI' can generate CONLL-U files filled with morphological information (it does not support dependency parsing).

Below is an example script that first parses a text without a trained POS tagger,

then trains a POS tagger with it and finally parses the text with the tagger and saves the result to a CONLL-U file.

.. code:: python

    import yajwiz

    with open("prose-corpus.txt", "r") as f:

        text = f.read()

    conllu = yajwiz.text_to_conllu(text)

    tagger = yajwiz.Tagger()

    tagger.train(yajwiz.conllu_to_tagged_list(conllu))

    conllu = yajwiz.text_to_conllu(text, tagger)

    with open("prose-corpus.conllu", "w") as f:

        f.write(conllu)

Without a trained POS tagger, ambiguous words will be left without a tag:

.. code::

    # Hegh neH chav qoH.

    1	Hegh	_	_	_	_	_	_	_	_

    2	neH	_	_	_	_	_	_	_	_

    3	chav	_	_	_	_	_	_	_	_

    4	qoH	qoH	NOUN	N	_	_	_	_	_

    5	.	.	PUNCT	PUNCT	_	_	_	_	_

    # qanchoHpa' qoH, Hegh qoH.

    1	qanchoHpa'	qan	VERB	V?.pa'	Person=3|ObjPerson=3,0	_	_	_	SuffixV3=-choH|SuffixV9=-pa'

    2	qoH	qoH	NOUN	N	_	_	_	_	_

    3	,	,	PUNCT	PUNCT	_	_	_	_	_

    4	Hegh	_	_	_	_	_	_	_	_

    5	qoH	qoH	NOUN	N	_	_	_	_	_

    6	.	.	PUNCT	PUNCT	_	_	_	_	_

After training the tagger, it will take the "best guess" when deciding the POS.

.. code::

    # Hegh neH chav qoH.

    1	Hegh	Hegh	VERB	VT	Person=3|ObjPerson=3,0	_	_	_	_

    2	neH	neH	ADV	ADV	_	_	_	_	_

    3	chav	chav	VERB	VT	Person=3|ObjPerson=3,0	_	_	_	_

    4	qoH	qoH	NOUN	N	_	_	_	_	_

    5	.	.	PUNCT	PUNCT	_	_	_	_	_

    # qanchoHpa' qoH, Hegh qoH.

    1	qanchoHpa'	qan	VERB	V?.pa'	Person=3|ObjPerson=3,0	_	_	_	SuffixV3=-choH|SuffixV9=-pa'

    2	qoH	qoH	NOUN	N	_	_	_	_	_

    3	,	,	PUNCT	PUNCT	_	_	_	_	_

    4	Hegh	Hegh	VERB	VT	Person=3|ObjPerson=3,0	_	_	_	_

    5	qoH	qoH	NOUN	N	_	_	_	_	_

    6	.	.	PUNCT	PUNCT	_	_	_	_	_

In this example the tagger made a mistake: it classified the first **Hegh** as VT, although it should be N. I don't have a correctly tagged corpus, so evaluating the tagger is currently impossible. :(

Copyright

---------

yajwiz (c) 2020 Iikka Hauhio

This program a uses the `boQwI' dictionary `_ (``data.json``) that is licensed under the Apache License 2.0.

The Python files are also licensed under the Apache License 2.0. See the LICENSE file for more details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fergusq/yajwiz

Awesome Lists containing this project

README