Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gtoffoli/spacy-cameltokenizer

Tokenizer extension for the Arabic language (MSA), integrating the Morphological Tokenizer of the camel_tools project (CAMeL Lab).
https://github.com/gtoffoli/spacy-cameltokenizer

arabic nlp spacy spacy-pipeline tokenizer tools

Last synced: about 1 month ago
JSON representation

Tokenizer extension for the Arabic language (MSA), integrating the Morphological Tokenizer of the camel_tools project (CAMeL Lab).

Host: GitHub
URL: https://github.com/gtoffoli/spacy-cameltokenizer
Owner: gtoffoli
License: mit
Created: 2024-02-14T22:21:50.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-08-29T10:41:19.000Z (5 months ago)
Last Synced: 2024-08-29T16:51:57.384Z (5 months ago)
Topics: arabic, nlp, spacy, spacy-pipeline, tokenizer, tools
Language: Python
Homepage:
Size: 1.6 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# spacy-cameltokenizer

***The cameltokenizer package***

*cameltokenizer* wraps part of the *camel_tools* library and extends it, to perform *morphological tokenization* downstream the standard *spaCy* tokenizer; *cameltokenizer* reconfigures itself, based on the kind of input it gets, in order to work ..
- .. as a complete tokenizer, inside the *training pipeline*, by subclassing the standard *Tokenizer* class;
- .. as an extension of the *processing pipeline*.

***The use context of cameltokenizer***

To support the *Arabic* language (MSA) inside the spaCy framework with a trained language model, which was missing when *cameltokenizer* has been developed, we devised a solution including two packages, besides the *spaCy* distribution:
- the *cameltokenizer* package, which implements a tokenizer extension; it is written in *Cython*, in order to interface spaCy code also written in Cython;
- a package named *ar_core_news_md*, similar to other similarly named language packages;

plus
- some initialization code, which *registers* a pipeline component to be called first, so that it gets its input (a *Doc*), from the standard tokenizer;
- a customization of the *punctuation* module and of *init.py* inside *spacy.lang.ar*.

***How cameltokenizer works***

The model training is based on the *Universal Dependencies Arabic-PADT treebank*. We "cleaned" it in some way, running the function *ar_padt_fix_conllu*, which is defined in the *utils* module; it
- removes the *Tatweel/Kashida* character (it is only typographic stuff) and the *Superscript Alef* character from the sentences;
- removes the *vocalization diacritics* from the sentences and from the tokens text, to improve the *token alignment* when running the spaCy *debug data* command; eventually, we removed them from the lemmas as well, thus obtaining some minor improvements;
- "fixes" the splitting of a few composite *particles*, to avoid cases of "destructive" (non consevative) tokenization;
- for the same reason, "fixes" the splitting of words starting with *Lam-Lam* sequence, removing the intermediate *Alif* inserted by the annotator.

We think that, without the morphological tokenization, no parsing would be possible at all. For that, we chose the *MorphologicalTokenizer* of *CamelTools*. It mainly looks for prefixed prepositions and suffixed pronouns, not declension and conjugation affixes.
We apply to its input and its output, in a dirty and inefficient way, a lot of small "fixes" aimed at reducing the number of *misaligned tokens*, that is to better match the tokenization done by the annotators of the training set (the .conllu files); possibly, said fixes could result in some *overfitting*.

***Some results***

The use of cameltokenizer ha allowed us to drastically improve the results of the execution of the spacy commands *debug data* and *train*, over those obtained running them with the native tokenizer, even if the *accuracy* evaluated with the spaCy *benchmark* is not very good:
- *debug data* - The percentage of *misaligned tokens* is slightly more than 1%; it was over 16% at the start, without the tokenizer extensions;
- *train* - The overall best score in training a *pipeline* with *tagger*, *trainable lemmatizer* and *parser* is 0.88; it was 0.66 at the start;
- *benchmark accuracy* -
TOK 99.19
TAG 88.63
POS 94.09
MORPH 88.83
LEMMA 94.53
UAS 80.24
LAS 73.01
SENT P 60.60
SENT R 71.03
SENT F 65.40
SPEED 1607

*Performances* above are worse than the spaCy language models for most European languages. We won't discuss here why dealing with the Arabic language is more complex. Our intent is just to understand if those performances are acceptable in some *text-analysis* tasks; namely, in tasks related to *linguistic education*.

***Some caveats***

- *cameltokenizer* is work in progress;
- the time performance is poor; this is mainly due to the heavy task carried out by the *MorphologicalTokenizer*;
- our code needs a lot of cleaning;
- currently, we create manually the package *ar_core_news_md* inside the *site-packages* directory of a Python *virtual environment* and put inside it a *symlink* to the *output/model-best* directory produced by the training pipeline, which contains the individual trained models.

More information on the the problems encountered and on the motivations of some choices can be found in the *discussion* https://github.com/explosion/spaCy/discussions/7146