{"id":34854021,"url":"https://github.com/amirivojdan/shekar","last_synced_at":"2026-02-08T03:09:51.775Z","repository":{"id":268684484,"uuid":"902628968","full_name":"amirivojdan/shekar","owner":"amirivojdan","description":"Simplifying Persian NLP for Modern Applications","archived":false,"fork":false,"pushed_at":"2025-12-25T18:43:47.000Z","size":26099,"stargazers_count":59,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-28T01:37:39.925Z","etag":null,"topics":["embeddings","keyword-extraction","lemmatization","morphology","named-entity-recognition","natural-language-processing","ner","nlp","normalization","offensive-language-detection","part-of-speech-tagging","persian","persian-nlp","pos","sentiment-analysis","spell-checker","text-processing","wordcloud"],"latest_commit_sha":null,"homepage":"https://lib.shekar.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amirivojdan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-12-13T00:20:54.000Z","updated_at":"2026-01-13T11:20:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"77f84661-1c04-4851-b738-ba022c5c746f","html_url":"https://github.com/amirivojdan/shekar","commit_stats":null,"previous_names":["amirivojdan/shekar"],"tags_count":36,"template":false,"template_full_name":null,"purl":"pkg:github/amirivojdan/shekar","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amirivojdan%2Fshekar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amirivojdan%2Fshekar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amirivojdan%2Fshekar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amirivojdan%2Fshekar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amirivojdan","download_url":"https://codeload.github.com/amirivojdan/shekar/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amirivojdan%2Fshekar/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29218701,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-08T02:25:35.815Z","status":"ssl_error","status_checked_at":"2026-02-08T02:24:27.970Z","response_time":57,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","keyword-extraction","lemmatization","morphology","named-entity-recognition","natural-language-processing","ner","nlp","normalization","offensive-language-detection","part-of-speech-tagging","persian","persian-nlp","pos","sentiment-analysis","spell-checker","text-processing","wordcloud"],"created_at":"2025-12-25T19:57:18.015Z","updated_at":"2026-02-08T03:09:51.769Z","avatar_url":"https://github.com/amirivojdan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n![Shekar](https://amirivojdan.io/wp-content/uploads/2025/01/shekar-lib.png)\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"\u003e\u003cimg alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/shekar?color=00A693\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"\u003e\u003cimg alt=\"GitHub Actions Workflow Status\" src=\"https://img.shields.io/github/actions/workflow/status/amirivojdan/shekar/test.yml?color=00A693\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"\u003e\u003cimg alt=\"Codecov\" src=\"https://img.shields.io/codecov/c/github/amirivojdan/shekar?color=00A693\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"\u003e\u003cimg alt=\"PyPI - License\" src=\"https://img.shields.io/pypi/l/shekar?color=00A693\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"\u003e\u003cimg alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/shekar?color=00A693\"\u003e\u003c/a\u003e\n\u003ca href=\"https://doi.org/10.21105/joss.09128\" target=\"_blank\"\u003e\n\u003cimg alt=\"Static Badge\" src=\"https://img.shields.io/badge/JOSS-10.21105%2Fjoss.09128-00A693\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cem\u003eSimplifying Persian NLP for Modern Applications\u003c/em\u003e\n\u003c/p\u003e\n\n**Shekar** is an open-source Python library for Persian natural language processing, inspired by the satirical story *[فارسی شکر است (Persian is Sugar)](https://fa.wikipedia.org/wiki/%D9%81%D8%A7%D8%B1%D8%B3%DB%8C_%D8%B4%DA%A9%D8%B1_%D8%A7%D8%B3%D8%AA)* by Mohammad Ali Jamalzadeh. Reflecting its emphasis on clear and accessible language, Shekar provides fast, modular tools for Persian text processing, including normalization, tokenization, POS tagging, NER, embeddings, and spell checking, enabling reproducible workflows for both research and production.\n\n## Why Shekar?\n\n- **Advanced text normalization**: Built for the complexity of Persian text.\n- **Blazing fast and production-ready**: Optimized for large-scale processing and real-time use.\n- **Modular and highly customizable**: Independent, composable components for flexible NLP pipelines.\n- **Lightweight and efficient**: Minimal dependencies and small models for fast CPU inference.  \n- **Reliable and well-tested**: Backed by **hundreds of test cases** with **95%+ code coverage**.\n\n## Installation\n\nYou can install Shekar with pip. By default, the `CPU` runtime of ONNX is included, which works on all platforms.\n\n### CPU Installation (All Platforms)\n\n\u003c!-- termynal --\u003e\n```bash\n$ pip install shekar\n```\nThis works on **Windows**, **Linux**, and **macOS** (including Apple Silicon M1/M2/M3).\n\n### GPU Acceleration (NVIDIA CUDA)\nIf you have an NVIDIA GPU and want hardware acceleration, you need to replace the CPU runtime with the GPU version.\n\n**Prerequisites**\n\n- NVIDIA GPU with CUDA support\n- Appropriate CUDA Toolkit installed\n- Compatible NVIDIA drivers\n\n\u003c!-- termynal --\u003e\n```bash\n$ pip install shekar \u0026\u0026 pip uninstall -y onnxruntime \u0026\u0026 pip install onnxruntime-gpu\n```\n\n## Preprocessing\n\n[![Notebook](https://img.shields.io/badge/Notebook-Jupyter-00A693.svg)](examples/preprocessing.ipynb)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/preprocessing.ipynb)\n\n### Normalizer\n\nThe built-in `Normalizer` class provides a ready-to-use, opinionated normalization pipeline for Persian text. It combines the most common and error-prone normalization steps into a single component, covering the majority of real-world use cases such as web text, social media, OCR output, and mixed informal–formal writing.\n\nMost importantly, the normalization rules in Shekar strictly follow the official guidelines of **[Academy of Persian Language and Literature](https://apll.ir/)** (فرهنگستان زبان و ادب فارسی). This makes the output suitable not only for NLP pipelines, but also for linguistically correct and publishable Persian text.\n\n```python\nfrom shekar import Normalizer\n\nnormalizer = Normalizer()\n\ntext = \"«فارسی شِکَر است» نام داستان ڪوتاه طنز    آمێزی از محمد علی جمالــــــــزاده ی گرامی می   باشد که در سال 1921 منتشر  شده است و آغاز   ڱر تحول بزرگی در ادَبێات معاصر ایران بۃ شمار میرود.\"\nprint(normalizer(text))\n\n# نرمال‌سازی نویسه‌های گفتاری و روزمره\ntext = normalizer(\"می دونی که نمیخاستم ناراحتت کنم.اما خونه هاشون خیلی گرون تر شده\")\nprint(text)\n\n# نرمال‌سازی واژه‌های مرکب و افعال پیشوندی! \ntext = normalizer(\"یک کار آفرین نمونه و سخت کوش ، پیروز مندانه از پس دشواری ها برخواهدآمد.\")\nprint(text) \n\n```\n\n```shell\n«فارسی شکر است» نام داستان کوتاه طنزآمیزی از محمد‌علی جمالزاده‌ی گرامی می‌باشد که در سال ۱۹۲۱ منتشر شده‌است و آغازگر تحول بزرگی در ادبیات معاصر ایران به شمار می‌رود.\n\nمی‌دونی که نمی‌خاستم ناراحتت کنم. اما خونه‌هاشون خیلی گرون‌تر شده\n\nیک کارآفرین نمونه و سخت‌کوش، پیروزمندانه از پس دشواری‌ها بر خواهد آمد.\n```\n\n### Customization\n\nShekar is built around a modular and composable preprocessing framework that allows fine-grained control over each step of text processing. Preprocessing is implemented as small, independent operators such as `filters`, `normalizers`, and `maskers`, which can be used on their own or combined into flexible pipelines.\n\nPipelines are constructed using the Pipeline abstraction and composed with the `|` operator, making preprocessing logic explicit, readable, and easy to customize. Any operator from the [full list of preprocessing components](https://lib.shekar.io/tutorials/preprocessing/)\n can be freely combined.\n\nFor example, the following pipeline is functionally equivalent to the default normalizer:\n\n```python\nfrom shekar.preprocessing import (\n    PunctuationNormalizer,\n    AlphabetNormalizer,\n    DigitNormalizer,\n    SpacingNormalizer,\n    RemoveDiacritics,\n    RepeatedLetterNormalizer,\n    ArabicUnicodeNormalizer,\n    YaNormalizer,\n)\n\nnormalizer = (\n            AlphabetNormalizer()\n            | ArabicUnicodeNormalizer()\n            | DigitNormalizer()\n            | PunctuationNormalizer()\n            | RemoveDiacritics()\n            | RepeatedLetterNormalizer()\n            | SpacingNormalizer()\n            | YaNormalizer(style=\"joda\")\n        )\n```\n\nOperators can also be composed for lightweight, task-specific preprocessing. For example, removing emojis and punctuation:\n\n```python\nfrom shekar.preprocessing import EmojiRemover, PunctuationRemover\n\ntext = \"ز ایران دلش یاد کرد و بسوخت! 🌍🇮🇷\"\npipeline = EmojiRemover() | PunctuationRemover()\noutput = pipeline(text)\nprint(output)\n```\n\n```shell\nز ایران دلش یاد کرد و بسوخت\n```\n\n## Tokenization\n\n### WordTokenizer\nThe WordTokenizer class in Shekar is a simple, rule-based tokenizer for Persian that splits text based on punctuation and whitespace using Unicode-aware regular expressions.\n\n```python\nfrom shekar import WordTokenizer\n\ntokenizer = WordTokenizer()\n\ntext = \"چه سیب‌های قشنگی! حیات نشئهٔ تنهایی است.\"\ntokens = list(tokenizer(text))\nprint(tokens)\n```\n\n```shell\n[\"چه\", \"سیب‌های\", \"قشنگی\", \"!\", \"حیات\", \"نشئهٔ\", \"تنهایی\", \"است\", \".\"]\n```\n\n### SentenceTokenizer\n\nThe `SentenceTokenizer` class is designed to split a given text into individual sentences. This class is particularly useful in natural language processing tasks where understanding the structure and meaning of sentences is important. The `SentenceTokenizer` class can handle various punctuation marks and language-specific rules to accurately identify sentence boundaries.\n\nBelow is an example of how to use the `SentenceTokenizer`:\n\n```python\nfrom shekar.tokenization import SentenceTokenizer\n\ntext = \"هدف ما کمک به یکدیگر است! ما می‌توانیم با هم کار کنیم.\"\ntokenizer = SentenceTokenizer()\nsentences = tokenizer(text)\n\nfor sentence in sentences:\n    print(sentence)\n```\n\n```output\nهدف ما کمک به یکدیگر است!\nما می‌توانیم با هم کار کنیم.\n```\n\n## Embeddings\n\n[![Notebook](https://img.shields.io/badge/Notebook-Jupyter-00A693.svg)](examples/embeddings.ipynb)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/embeddings.ipynb)\n\n**Shekar** offers two main embedding classes:\n\n- **`WordEmbedder`**: Provides static word embeddings using pre-trained FastText models.\n- **`ContextualEmbedder`**: Provides contextual embeddings using a fine-tuned ALBERT model.\n\nBoth classes share a consistent interface:\n\n- `embed(text)` returns a NumPy vector.\n- `transform(text)` is an alias for `embed(text)` to integrate with pipelines.\n\n### Word Embeddings\n\n`WordEmbedder` supports two static FastText models:\n\n- **`fasttext-d100`**: A 100-dimensional CBOW model trained on [Persian Wikipedia](https://huggingface.co/datasets/codersan/Persian-Wikipedia-Corpus).\n- **`fasttext-d300`**: A 300-dimensional CBOW model trained on the large-scale [Naab dataset](https://huggingface.co/datasets/SLPL/naab).\n\n\n```python\nfrom shekar.embeddings import WordEmbedder\n\nembedder = WordEmbedder(model=\"fasttext-d100\")\n\nembedding = embedder(\"کتاب\")\nprint(embedding.shape)\n\nsimilar_words = embedder.most_similar(\"کتاب\", top_n=5)\nprint(similar_words)\n```\n\n### Contextual Embeddings\n\n`ContextualEmbedder` uses an ALBERT model trained with Masked Language Modeling (MLM) on the Naab dataset to generate high-quality contextual embeddings.\nThe resulting embeddings are 768-dimensional vectors representing the semantic meaning of entire phrases or sentences.\n\n```python\nfrom shekar.embeddings import ContextualEmbedder\n\nembedder = ContextualEmbedder(model=\"albert\")\n\nsentence = \"کتاب‌ها دریچه‌ای به جهان دانش هستند.\"\nembedding = embedder(sentence)\nprint(embedding.shape)  # (768,)\n```\n\n## Stemming\n\nThe `Stemmer` is a lightweight, rule-based reducer for Persian word forms. It trims common suffixes while respecting Persian orthography and Zero Width Non-Joiner usage. The goal is to produce stable stems for search, indexing, and simple text analysis without requiring a full morphological analyzer.\n\n```python\nfrom shekar import Stemmer\n\nstemmer = Stemmer()\n\nprint(stemmer(\"نوه‌ام\"))\nprint(stemmer(\"کتاب‌ها\"))\nprint(stemmer(\"خانه‌هایی\"))\nprint(stemmer(\"خونه‌هامون\"))\n```\n\n```output\nنوه\nکتاب\nخانه\nخانه\n```\n\n## Lemmatization\n\nThe `Lemmatizer` maps Persian words to their base dictionary form. Unlike stemming, which only trims affixes, lemmatization uses explicit verb conjugation rules, vocabulary lookups, and a stemmer fallback to ensure valid lemmas. This makes it more accurate for tasks like part-of-speech tagging, text normalization, and linguistic analysis where the canonical form of a word is required.\n\n```python\nfrom shekar import Lemmatizer\n\nlemmatizer = Lemmatizer()\n\n# ریشه‌یابی افعال\nprint(lemmatizer(\"رفتند\"))\nprint(lemmatizer(\"گفته بوده‌ایم\"))\n\n# ریشه‌یابی واژه‌ها\nprint(lemmatizer(\"کتاب‌ها\"))\nprint(lemmatizer(\"خانه‌هایی\"))\nprint(lemmatizer(\"خونه‌هامون\"))\n\n# ریشه‌یابی افعال پیشوندی\nprint(lemmatizer(\"بر نخواهم گشت\"))\nprint(lemmatizer(\"برنمی‌دارم\"))\n```\n\n```output\nرفت/رو\nگفت/گو\nکتاب\nخانه\nخانه\nبرگشت/برگرد\nبرداشت/بردار\n```\n\n## Part-of-Speech Tagging\n\n[![Notebook](https://img.shields.io/badge/Notebook-Jupyter-00A693.svg)](examples/pos_tagging.ipynb)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/pos_tagging.ipynb)\n\nThe POSTagger class provides part-of-speech tagging for Persian text using a transformer-based model (default: ALBERT). It returns one tag per word based on Universal POS tags (following the Universal Dependencies standard).\n\nExample usage:\n\n```python\nfrom shekar import POSTagger\n\npos_tagger = POSTagger()\ntext = \"نوروز، جشن سال نو ایرانی، بیش از سه هزار سال قدمت دارد و در کشورهای مختلف جشن گرفته می‌شود.\"\n\nresult = pos_tagger(text)\nfor word, tag in result:\n    print(f\"{word}: {tag}\")\n```\n\n```output\nنوروز: PROPN\n،: PUNCT\nجشن: NOUN\nسال: NOUN\nنو: ADJ\nایرانی: ADJ\n،: PUNCT\nبیش: ADJ\nاز: ADP\nسه: NUM\nهزار: NUM\nسال: NOUN\nقدمت: NOUN\nدارد: VERB\nو: CCONJ\nدر: ADP\nکشورهای: NOUN\nمختلف: ADJ\nجشن: NOUN\nگرفته: VERB\nمی‌شود: VERB\n.: PUNCT\n```\n\n## Named Entity Recognition (NER)\n\n[![Notebook](https://img.shields.io/badge/Notebook-Jupyter-00A693.svg)](examples/ner.ipynb)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/ner.ipynb)\n\nThe `NER` module offers a fast, quantized Named Entity Recognition pipeline using a fine-tuned ALBERT model. It detects common Persian entities such as persons, locations, organizations, and dates. This model is designed for efficient inference and can be easily combined with other preprocessing steps.\n\nExample usage:\n\n```python\nfrom shekar import NER\nfrom shekar import Normalizer\n\ninput_text = (\n    \"شاهرخ مسکوب به سالِ ۱۳۰۴ در بابل زاده شد و دوره ابتدایی را در تهران و در مدرسه علمیه پشت \"\n    \"مسجد سپهسالار گذراند. از کلاس پنجم ابتدایی مطالعه رمان و آثار ادبی را شروع کرد. از همان زمان \"\n    \"در دبیرستان ادب اصفهان ادامه تحصیل داد. پس از پایان تحصیلات دبیرستان در سال ۱۳۲۴ از اصفهان به تهران رفت و \"\n    \"در رشته حقوق دانشگاه تهران مشغول به تحصیل شد.\"\n)\n\nnormalizer = Normalizer()\nnormalized_text = normalizer(input_text)\n\nalbert_ner = NER()\nentities = albert_ner(normalized_text)\n\nfor text, label in entities:\n    print(f\"{text} → {label}\")\n```\n\n```output\nشاهرخ مسکوب → PER\nسال ۱۳۰۴ → DAT\nبابل → LOC\nدوره ابتدایی → DAT\nتهران → LOC\nمدرسه علمیه → LOC\nمسجد سپهسالار → LOC\nدبیرستان ادب اصفهان → LOC\nدر سال ۱۳۲۴ → DAT\nاصفهان → LOC\nتهران → LOC\nدانشگاه تهران → ORG\nفرانسه → LOC\n```\n\n## Classification\n\nThe `classification` module provides high-level text classification utilities for Persian, covering both sentiment analysis and offensive language detection through a unified and consistent interface. Each classifier returns a predicted label along with a confidence score.\n\n### Sentiment Analysis\n\nThe `SentimentClassifier` module enables automatic sentiment analysis of Persian text using transformer-based models. It currently supports the `AlbertBinarySentimentClassifier`, a lightweight ALBERT model fine-tuned on Snapfood dataset to classify text as **positive** or **negative**, returning both the predicted label and its confidence score.\n\n**Example usage:**\n\n```python\nfrom shekar.classification import SentimentClassifier\n\nsentiment_classifier = SentimentClassifier()\n\nprint(sentiment_classifier(\"سریال قصه‌های مجید عالی بود!\"))\nprint(sentiment_classifier(\"فیلم ۳۰۰ افتضاح بود!\"))\n```\n\n```output\n('positive', 0.9923112988471985)\n('negative', 0.9330866932868958)\n```\n\n### Toxicity Detection\n\nThe `toxicity` module currently includes a Logistic Regression classifier trained on TF-IDF features extracted from the [Naseza (ناسزا) dataset](https://github.com/amirivojdan/naseza), a large-scale collection of Persian text labeled for offensive and neutral language. The `OffensiveLanguageClassifier` processes input text to determine whether it is neutral or offensive, returning both the predicted label and its confidence score.\n\n```python\nfrom shekar.classification import OffensiveLanguageClassifier\n\noffensive_classifier = OffensiveLanguageClassifier()\n\nprint(offensive_classifier(\"زبان فارسی میهن من است!\"))\nprint(offensive_classifier(\"تو خیلی احمق و بی‌شرفی!\"))\n```\n\n```output\n('neutral', 0.7651197910308838)\n('offensive', 0.7607775330543518)\n```\n\n## Keyword Extraction\n\n[![Notebook](https://img.shields.io/badge/Notebook-Jupyter-00A693.svg)](examples/keyword_extraction.ipynb)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/keyword_extraction.ipynb)\n\nThe **shekar.keyword_extraction** module provides tools for automatically identifying and extracting key terms and phrases from Persian text. These algorithms help identify the most important concepts and topics within documents.\n\n```python\nfrom shekar import KeywordExtractor\n\nextractor = KeywordExtractor(max_length=2, top_n=10)\n\ninput_text = (\n    \"زبان فارسی یکی از زبان‌های مهم منطقه و جهان است که تاریخچه‌ای کهن دارد. \"\n    \"زبان فارسی با داشتن ادبیاتی غنی و شاعرانی برجسته، نقشی بی‌بدیل در گسترش فرهنگ ایرانی ایفا کرده است. \"\n    \"از دوران فردوسی و شاهنامه تا دوران معاصر، زبان فارسی همواره ابزار بیان اندیشه، احساس و هنر بوده است. \"\n)\n\nkeywords = extractor(input_text)\n\nfor kw in keywords:\n    print(kw)\n```\n\n```output\nفرهنگ ایرانی\nگسترش فرهنگ\nایرانی ایفا\nزبان فارسی\nتاریخچه‌ای کهن\n```\n\n## Spell Checking\n\nThe `SpellChecker` class provides simple and effective spelling correction for Persian text. It can automatically detect and fix common errors such as extra characters, spacing mistakes, or misspelled words. You can use it directly as a callable on a sentence to clean up the text, or call `suggest()` to get a ranked list of correction candidates for a single word.\n\n```python\nfrom shekar import SpellChecker\n\nspell_checker = SpellChecker()\nprint(spell_checker(\"سسلام بر ششما ددوست من\"))\nprint(spell_checker.suggest(\"درود\"))\n```\n\n```output\nسلام بر شما دوست من\n['درود', 'درصد', 'ورود', 'درد', 'درون']\n```\n\n## WordCloud\n\n[![Notebook](https://img.shields.io/badge/Notebook-Jupyter-00A693.svg)](examples/word_cloud.ipynb)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/word_cloud.ipynb)\n\nThe `WordCloud` class provides a convenient interface for generating Persian word clouds with correct shaping, directionality, and typography. It is specifically designed to work with right-to-left Persian text and integrates seamlessly with Shekar’s normalization utilities to produce visually accurate and linguistically correct results.\n\nThe WordCloud functionality depends on visualization libraries that are not installed by default. To enable this feature, install Shekar with the optional visualization dependencies:\n\n\u003c!-- termynal --\u003e\n```bash\n$ pip install 'shekar[viz]'\n```\n**Example usage:**\n\n```python\nimport requests\nfrom collections import Counter\n\nfrom shekar.visualization import WordCloud\nfrom shekar import WordTokenizer\nfrom shekar.preprocessing import (\n  HTMLTagRemover,\n  PunctuationRemover,\n  StopWordRemover,\n  NonPersianRemover,\n)\npreprocessing_pipeline = HTMLTagRemover() | PunctuationRemover() | StopWordRemover() | NonPersianRemover()\n\n\nurl = f\"https://shahnameh.me/p.php?id=F82F6CED\"\nresponse = requests.get(url)\nhtml_content = response.text\nclean_text = preprocessing_pipeline(html_content)\n\nword_tokenizer = WordTokenizer()\ntokens = word_tokenizer(clean_text)\n\nword_freqs = Counter(tokens)\n\nwordCloud = WordCloud(\n        mask=\"Iran\",\n        width=640,\n        height=480,\n        max_font_size=220,\n        min_font_size=6,\n        bg_color=\"white\",\n        contour_color=\"black\",\n        contour_width=5,\n        color_map=\"greens\",\n    )\n\n# if shows disconnect words, try again with bidi_reshape=True\nimage = wordCloud.generate(word_freqs, bidi_reshape=False)\nimage.show()\n```\n\n![](https://raw.githubusercontent.com/amirivojdan/shekar/main/assets/wordcloud_example.png)\n\n\n## Download Models\n\nIf Shekar Hub is unavailable, you can manually download the models and place them in the cache directory at `home/[username]/.shekar/` \n\n| Model Name                | Download Link |\n|----------------------------|---------------|\n| FastText Embedding d100    | [Download](https://drive.google.com/file/d/1qgd0slGA3Ar7A2ShViA3v8UTM4qXIEN6/view?usp=drive_link) (50MB)|\n| FastText Embedding d300    | [Download](https://drive.google.com/file/d/1yeAg5otGpgoeD-3-E_W9ZwLyTvNKTlCa/view?usp=drive_link) (500MB)|\n| SentenceEmbedding    | [Download](https://drive.google.com/file/d/1PftSG2QD2M9qzhAltWk_S38eQLljPUiG/view?usp=drive_link) (60MB)|\n| POS Tagger  | [Download](https://drive.google.com/file/d/1d80TJn7moO31nMXT4WEatAaTEUirx2Ju/view?usp=drive_link) (38MB)|\n| NER       | [Download](https://drive.google.com/file/d/1DLoMJt8TWlNnGGbHDWjwNGsD7qzlLHfu/view?usp=drive_link) (38MB)|\n| Sentiment Classifier       | [Download](https://drive.google.com/file/d/17gTip7RwipEkA7Rf3-Cv1W8XNHTdaS4c/view?usp=drive_link) (38MB)|\n| Offensive Language Classifier       | [Download](https://drive.google.com/file/d/1ZLiFI6nzpQ2rYjJTKxOYKTfD9IqHZ5tc/view?usp=drive_link) (8MB)|\n| AlbertTokenizer   | [Download](https://drive.google.com/file/d/1w-oe53F0nPePMcoor5FgXRwRMwkYqDqM/view?usp=drive_link) (2MB)|\n\n-----\n\n## Citation\n\nIf you find **Shekar** useful in your research, please consider citing the following paper:\n\n```\n@article{Amirivojdan_Shekar,\nauthor = {Amirivojdan, Ahmad},\ndoi = {10.21105/joss.09128},\njournal = {Journal of Open Source Software},\nmonth = oct,\nnumber = {114},\npages = {9128},\ntitle = {{Shekar: A Python Toolkit for Persian Natural Language Processing}},\nurl = {https://joss.theoj.org/papers/10.21105/joss.09128},\nvolume = {10},\nyear = {2025}\n}\n```\n\n\u003cp align=\"center\"\u003e\u003cem\u003eWith ❤️ for \u003cstrong\u003eIRAN\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famirivojdan%2Fshekar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famirivojdan%2Fshekar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famirivojdan%2Fshekar/lists"}