Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gtoffoli/spacy-ar_core_news_md

Unofficial Arabic language model for spaCy
https://github.com/gtoffoli/spacy-ar_core_news_md

arabic-language camel nlp python spacy spacy-pipeline tokenizer

Last synced: 3 months ago
JSON representation

Unofficial Arabic language model for spaCy

Host: GitHub
URL: https://github.com/gtoffoli/spacy-ar_core_news_md
Owner: gtoffoli
License: mit
Created: 2024-06-14T09:09:50.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-08-29T20:37:50.000Z (5 months ago)
Last Synced: 2024-09-30T05:40:51.544Z (4 months ago)
Topics: arabic-language, camel, nlp, python, spacy, spacy-pipeline, tokenizer
Language: Python
Homepage:
Size: 7.27 MB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Unofficial Arabic pipeline for the *spaCy* framework

## About

Basic information on this release can be found in the README of the package https://github.com/gtoffoli/spacy-cameltokenizer, which constitutes a prerequisite, together with the *CAMeL Tools* library by *CAMeL-Lab* (https://github.com/CAMeL-Lab/camel_tools).

Further information on the the problems encountered and on the motivations of some choices can be found in the *discussion* https://github.com/explosion/spaCy/discussions/7146

## Installation

I assume that you work in a Python "virtual environment" (*venv*), where possibly you already installed *spaCy*.

You also need a local *git* directory to clone 2 packages from *GitHub*:

```

git clone https://github.com/gtoffoli/spacy-cameltokenizer.git

git clone https://github.com/gtoffoli/spacy-ar_core_news_md.git

```

In the *site-packages* directory of your *venv*, create 2 *symbolic links*:

- `cameltokenizer`, linking to the `cameltokenizer` sub-directory of the local `spacy-cameltokenizer` repository;

- `ar_core_news_md`, linking to the `ar_core_news_md` sub-directory of the local `spacy-ar_core_news_md` repository.

In the *site-packages* directory, create also the sub-directory `ar_core_news_md-1.1.0.dist-info`;

in said sub-directory, copy the `METADATA` file from the top-level folder of the `spacy-ar_core_news_md` repository.

Finally, install *spaCy* (if needed) and the *CAMeL Tools* library:

```

pip install spacy

pip install camel-tools

```

## spaCy customization

Replace 2 modules in the `spacy/lang/ar` subdirectory of the `spaCy` directory in *site-packages*, taking the new ones from the `spacy_lang_ar_custom` sub-directory of the local `spacy-ar_core_news_md` repository:

- `__init__.py`

- `punctuation.py`

## Pipeline initialization

In a *settings* module of your applications (in my case it is the `settings.py` of a *Django* app), put the following code:

```

	import spacy

	from cameltokenizer import tokenizer

	ar = spacy.load('ar_core_news_md')

	cameltokenizer = tokenizer.CamelTokenizer(ar.vocab)

	@Language.component("cameltokenizer")

	def tokenizer_extra_step(doc):

		return cameltokenizer(doc)

	ar.add_pipe("cameltokenizer", name="cameltokenizer", first=True)

```