Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mush42/libtashkeel
Add Arabic diacritics (tashkeel/harakat) using Rust/Python/C++/WASM and NLP models
https://github.com/mush42/libtashkeel
arabic diacritics nlp tashkeel
Last synced: about 2 months ago
JSON representation
Add Arabic diacritics (tashkeel/harakat) using Rust/Python/C++/WASM and NLP models
- Host: GitHub
- URL: https://github.com/mush42/libtashkeel
- Owner: mush42
- License: mit
- Created: 2023-03-14T21:47:51.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-29T17:40:01.000Z (4 months ago)
- Last Synced: 2024-10-29T18:59:05.558Z (4 months ago)
- Topics: arabic, diacritics, nlp, tashkeel
- Language: Rust
- Homepage: https://mush42.github.io/libtashkeel/
- Size: 96.8 MB
- Stars: 20
- Watchers: 4
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Libtashkeel
`Libtashkeel`is a cross-platform library for diacritic restoration of Arabic text.
`Libtashkeel` is written in Rust, and provides both a **standalone** linkable library and a command line tool.
The library uses models trained mainly on MSA data, from [Hareef](https://github.com/mush42/hareef).
## Getting Libtashkeel
You need to build the project yourself, see the **Building** section for a step-by-step guide.
## Usage
### Using the library
To use `Libtashkeel` from your C/C++ project, just include [libtashkeel.h](./libtashkeel/libtashkeel.h) and you are good to go.
The API consists of a single entry point for diacritizing a **utf-8 ** encoded string. Please take a look at [ffi_usage_example.py](./ffi_usage_example.py) for sample usage.
### From Python
**Python** bindings are also provided.
After building the wheels (see the **Building** section), install the wheel using `pip`:
```bash
pip install ./target/wheels/pylibtashkeel*.whl
```and then:
```python
>>> from pylibtashkeel import tashkeel
>>> tashkeel("إن روعة اللغة العربية لا تتبدى إلا لعشاقها")
'إِنَّ رَوْعَةَ اللُّغَةِ الْعَرَبِيَّةِ لَا تَتَبَدَّى إِلَّا لِعُشَّاقِهَا'
```### Command line tool
`Libtashkeel` provides a standalone executable called **tashkeel** for diacritizing text from the command line.
```bash
$ tashkeel --help
Arabic-text diacritic restoration using neural networksUsage: tashkeel [OPTIONS]
Options:
-f, --input-file Input file (default `stdin`)
-o, --output-file Output file (default `stdout`)
-i, --interactive Use interactive mode (useful for testing)
-t, --taskeen Use sukoon for case-ending diacritic if the model is uncertain
-p, --prob Taskeen threshold probability [default: 0.95]
-x, --onnx ONNX model (default: use bundled model if available)
-h, --help Print help
-V, --version Print version```
## Building
`Libtashkeel` is written in **Rust**, [you need to install Rust first](https://www.rust-lang.org/tools/install)
To build the linkable library `libtashkeel`, and the command line tool `tashkeel`, run the following command from the root of the repository:
```bash
$ cargo build --release
```Then, the built library and executable is found under `target` directory.
To build **Python** bindings as a wheel, you need to install [maturin](https://github.com/pyo3/maturin)
Run the following to build the wheel:
```bash
$ cd pylibtashkeel
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install maturin
$ maturin build --release --strip -i .venv/bin/python
```Then, the built wheel is found under `target/wheels` directory.
# Licence
Copyright (c) Musharraf Omer. This project is licenced under the terms of The MIT License