Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lgug2z/tashkil
A lightweight Rust library for removing Arabic diacritics
https://github.com/lgug2z/tashkil
arabic dari diacritics diacritics-removal pashto persian rust text-processing urdu
Last synced: about 2 months ago
JSON representation
A lightweight Rust library for removing Arabic diacritics
- Host: GitHub
- URL: https://github.com/lgug2z/tashkil
- Owner: LGUG2Z
- License: mit
- Created: 2022-10-16T21:27:38.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2022-10-16T21:31:28.000Z (about 2 years ago)
- Last Synced: 2024-11-09T18:46:23.379Z (about 2 months ago)
- Topics: arabic, dari, diacritics, diacritics-removal, pashto, persian, rust, text-processing, urdu
- Language: Rust
- Homepage:
- Size: 2.93 KB
- Stars: 19
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Tashkil
A lightweight Rust library for removing Arabic diacritics (تَشْكِيل)This library exposes a single function, `tashkil::remove()`, which removes from a `&str` all diacritics in the [unicode specification for the Arabic alphabet and its variants](https://www.unicode.org/charts/PDF/U0600.pdf).
It is my hope that this library can be used to improve search results in [Meilisearch](https://github.com/meilisearch/MeiliSearch/) for languages using the Arabic alphabet and its variants, similarly to how [`niqqud`](https://github.com/benny-n/niqqud) has been used to [improve search results for Hebrew](https://docs.meilisearch.com/learn/advanced/tokenization.html#deep-dive-the-meilisearch-tokenizer).