{"id":13827006,"url":"https://github.com/Dadmatech/DadmaTools","last_synced_at":"2025-07-09T02:33:05.976Z","repository":{"id":43587866,"uuid":"416312227","full_name":"Dadmatech/DadmaTools","owner":"Dadmatech","description":"DadmaTools is a Persian NLP tools developed by Dadmatech Co.","archived":false,"fork":false,"pushed_at":"2024-10-28T07:28:36.000Z","size":97085,"stargazers_count":184,"open_issues_count":13,"forks_count":40,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-10-29T17:33:31.569Z","etag":null,"topics":["chunker","constituency-parser","dataset-loader","dependency-parser","embedding-vectors","embeddings","lemmatizer","natural-language-processing","ner","nlptoolkit","persian","persian-nlp","postagger","spacy","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Dadmatech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-12T11:40:00.000Z","updated_at":"2024-10-28T07:28:06.000Z","dependencies_parsed_at":"2023-11-23T13:38:16.742Z","dependency_job_id":"4390c45d-e72e-4623-bcc4-bdbc30ada5d9","html_url":"https://github.com/Dadmatech/DadmaTools","commit_stats":{"total_commits":229,"total_committers":13,"mean_commits":"17.615384615384617","dds":0.6986899563318778,"last_synced_commit":"e902d78b53c84d5f4b8aaffee7992e11842182bb"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dadmatech%2FDadmaTools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dadmatech%2FDadmaTools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dadmatech%2FDadmaTools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dadmatech%2FDadmaTools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Dadmatech","download_url":"https://codeload.github.com/Dadmatech/DadmaTools/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225481151,"owners_count":17481159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunker","constituency-parser","dataset-loader","dependency-parser","embedding-vectors","embeddings","lemmatizer","natural-language-processing","ner","nlptoolkit","persian","persian-nlp","postagger","spacy","tokenizer"],"created_at":"2024-08-04T09:01:48.301Z","updated_at":"2025-07-09T02:33:05.956Z","avatar_url":"https://github.com/Dadmatech.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003c!-- \u003ch1 align=\"center\"\u003e\n  \u003cimg src=\"images/dadmatech.jpeg\"  width=\"150\"  /\u003e\n   Dadmatools\n\u003c/h1\u003e --\u003e\n\n\n\u003ch2 align=\"center\"\u003eDadmaTools: A Python NLP Library for Persian\u003c/h2\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/dadmatools/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/dadmatools.svg\"\u003e\u003c/a\u003e\n  \u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache%202-blue.svg\"\u003e\u003c/a\u003e\n  \u003ca href='https://dadmatools.readthedocs.io/en/latest/'\u003e\u003cimg src='https://readthedocs.org/projects/danlp-alexandra/badge/?version=latest' alt='Documentation Status' /\u003e\u003c/a\u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003ch5\u003e\n      Named Entity Recognition\n    \u003cspan\u003e | \u003c/span\u003e\n      Part of Speech Tagging\n    \u003cspan\u003e | \u003c/span\u003e\n      Dependency Parsing\n    \u003cspan\u003e | \u003c/span\u003e\n      Informal To Formal\n  \u003c/h5\u003e\n  \u003ch5\u003e\n      Constituency Parsing\n    \u003cspan\u003e | \u003c/span\u003e\n      Chunking\n    \u003cspan\u003e | \u003c/span\u003e\n      Kasreh Ezafe Detection\n  \u003c/h5\u003e\n  \u003ch5\u003e\n      Spellchecker\n    \u003cspan\u003e | \u003c/span\u003e\n       Normalizer\n    \u003cspan\u003e | \u003c/span\u003e\n      Tokenizer\n    \u003cspan\u003e | \u003c/span\u003e\n      Lemmatizer\n    \u003cspan\u003e | \u003c/span\u003e\n      Sentiment Analysis\n  \u003c/h5\u003e\n  \u003ch5\u003e\n  \u003c/h5\u003e\n\u003c/div\u003e\n\n\n# **DadmaTools**\nDadmaTools is a repository for Natural Language Processing resources for the Persian Language. The aim is to make it easier and more applicable to practitioners in the industry to use Persian NLP, and hence this project is licensed to allow commercial use. The project features code examples on how to use the models in popular NLP frameworks such as spaCy and Transformers, as well as Deep Learning frameworks such as PyTorch. Furthermore, DadmaTools support common Persian embedding and Persian datasets.\nfor more details about how to use this tool read the instruction below.\n\nContents:\n- [Installation](#installation)\n- [NLP Models](#nlp-models)\n  - [Normalizer](#normalizer)\n  - [Pipline (tok,lem,dep,pos,cons,chunk,kasreh,spellchecker)](#pipeline)\n- [Datasets](#loading-persian-nlp-datasets)\n- [Embeddings](#loading-persian-word-embeddings)\n- [Evaluation](#evaluation)\n- [How to use in colab](#how-to-use)\n- [Cite us](#cite)\n\n## Installation\n\nTo get started using DadmaTools, install the project with pip:\n\n- **Full Version**  \n  Includes all features, including transformers and trainable modules:  \n  ```bash\n  pip install dadmatools\n  ```\n\n- **Light Version**  \n  For users who prefer only datasets and non-trainable modules without transformers:  \n  ```bash\n  pip install dadmatools[light]\n  ```\n\n\n\n### Install from github\nAlternatively you can install the latest version from github using:\n```bash\npip install git+https://github.com/Dadmatech/dadmatools.git\n```\n\n\n## NLP Models\n\nNatural Language Processing is an active area of research, and it consists of many different tasks. \nThe DadmaTools repository provides an overview of Persian models for some of the most basic NLP tasks (and is continuously evolving). \n\nHere is the list of NLP tasks we currently cover in the repository. These NLP tasks are defined as pipelines. Therefore, a pipeline list must be created and passed through the model. This will allow the user to choose the only task needed without loading others. \nEach task has its abbreviation as follows:\n-  Named Entity Recognition: ```ner```\n-  Part of speech tagging: ```pos```\n-  Dependency parsing: ```dep```\n-  Constituency parsing: ```cons```\n-  Kasreh Ezafe Detection: ```kasreh```\n-  Chunking: ```chunk```\n-  Lemmatizing: ```lem```\n-  Tokenizing: ```tok```\n-  Spellchecker: ```spellchecker```\n-  Normalizing\n-  informal2formal: ```itf```\n-  Sentiment analysis: ```sent```\n\n**Note** that the normalizer can be used outside of the pipeline as there are several configs (the default config is in the pipeline with the name of def-norm).\n**Note** that if no pipeline is passed to the model, the tokenizer will be loaded as default.\n\n\u003c!--### Use Case --\u003e\n\n\u003c!-- These NLP tasks are defined as pipelines. Therefore, a pipeline list must be created and passed through the model. This will allow the user to choose the only task needed without loading others. \nEach task has its abbreviation as following:\n-  ```ner```: Named entity recognition\n-  ```pos```: Part of speech tagging\n-  ```dep```: Dependency parsing\n-  ```cons```: Constituency parsing\n-  ```chunk```: Chunking\n-  ```kasreh```: Kasreh Ezafe Detection\n-  ```spellchecker```: SpellChecker\n-  ```lem```: Lemmatizing\n-  ```tok```: Tokenizing\n-  ```itf```: informal to formal\n-  ```sent```: Sentiment analysis\n\nNote that the normalizer can be used outside of the pipeline as there are several configs.\nNote that if no pipeline is passed to the model the tokenizer will be load as default. --\u003e\n\n### Normalizer\ncleaning text and unify characters.\n\nNote: None means no action! \n```python\nfrom dadmatools.normalizer import Normalizer\n\nnormalizer = Normalizer(\n    full_cleaning=False,\n    unify_chars=True,\n    refine_punc_spacing=True,\n    remove_extra_space=True,\n    remove_puncs=False,\n    remove_html=False,\n    remove_stop_word=False,\n    replace_email_with=\"\u003cEMAIL\u003e\",\n    replace_number_with=None,\n    replace_url_with=\"\",\n    replace_mobile_number_with=None,\n    replace_emoji_with=None,\n    replace_home_number_with=None\n)\n\ntext = \"\"\"\n\u003cp\u003e\nدادماتولز اولین نسخش سال ۱۴۰۰ منتشر شده. \nامیدواریم که این تولز بتونه کار با متن رو براتون شیرین‌تر و راحت‌تر کنه\nلطفا با ایمیل dadmatools@dadmatech.ir با ما در ارتباط باشید\nآدرس گیت‌هاب هم که خب معرف حضور مبارک هست:\n https://github.com/Dadmatech/DadmaTools\n\u003c/p\u003e\n\"\"\"\nnormalized_text = normalizer.normalize(text)\n# \u003cp\u003e دادماتولز اولین نسخش سال 1400 منتشر شده. امیدواریم که این تولز بتونه کار با متن رو براتون شیرین‌تر و راحت‌تر کنه لطفا با ایمیل \u003cEMAIL\u003e با ما در ارتباط باشید آدرس گیت‌هاب هم که خب معرف حضور مبارک هست: \u003c/p\u003e\n\n# full cleaning\nnormalizer = Normalizer(full_cleaning=True)\nnormalized_text = normalizer.normalize(text)\n# دادماتولز نسخش سال منتشر تولز بتونه کار متن براتون شیرین‌تر راحت‌تر کنه ایمیل ارتباط آدرس گیت‌هاب معرف حضور مبارک\n\n```\n\n### Pipeline\nContaining Tokenizer, Lemmatizer, POS Tagger, Dependancy Parser, Constituency Parser, Kasreh, Spellcheker, Infromal To Formal, Name Entity Recognation.\n\n```python\nimport dadmatools.pipeline.language as language\n\n# here lemmatizer and pos tagger will be loaded\n# as tokenizer is the default tool, it will be loaded as well even without calling\npips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh,itf,ner,sent'\nnlp = language.Pipeline(pips)\n# doc is an SpaCy object\ndoc = nlp('کشور بزرگ ایران توانسته در طی سال‌ها اغشار مختلفی از قومیت‌های گوناگون رو به خوبی تو خودش  جا بده')\n\n```\n[```doc```](https://spacy.io/api/doc) object has different extensions. First, there are ```sentences``` in ```doc``` which is the list of the list of [```Token```](https://spacy.io/api/token). Each [```Token```](https://spacy.io/api/token) also has its own extensions. Note that we defined our own extension as well in DadmaTools. If any pipeline related to the specific extensions is not called, that extension will have no value.\n\nTo better see the results which you can use this code:\n\n```python\nprint(doc)\n```\n\n```python\n{'spellchecker': {'orginal': 'کشور بزرگ ایران توانسته در طی سال\\u200cها اغشار مختلفی از قومیت\\u200cهای گوناگون رو به خوبی تو خودش  جا بده', 'corrected': 'کشور بزرگ ایران توانسته در طی سال\\u200cها اقشار مختلفی از قومیت\\u200cهای گوناگون رو به خوبی تو خودش جا بده', 'checked_words': [('اغشار', 'اقشار')]}, 'itf': ' کشور بزرگ ایران توانسته در طی سال\\u200cها اغشار مختلفی از قومیت های گوناگون را به خوبی در خودش جا بده', 'sentences': [{'id': 1, 'tokens': [{'id': 1, 'text': 'کشور', 'upos': 'NOUN', 'xpos': 'N_PL', 'feats': 'Number=Plur|Person=2|Polarity=Neg|Tense=Pres', 'head': 14, 'deprel': 'nsubj', 'lemma': 'کشور', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 2, 'text': 'بزرگ', 'upos': 'ADJ', 'xpos': 'ADJ', 'feats': 'Degree=Pos', 'head': 8, 'deprel': 'amod', 'lemma': 'بزرگ', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 3, 'text': 'ایران', 'upos': 'SCONJ', 'xpos': 'CON', 'feats': 'Number=Plur|Person=3|PronType=Prs', 'head': 2, 'deprel': 'nmod:poss', 'lemma': 'ایران', 'ner': 'S-loc', 'kasreh': 'O'}, {'id': 4, 'text': 'توانسته', 'upos': 'VERB', 'xpos': 'V_PP', 'feats': 'Number=Sing|Person=3|VerbForm=Part', 'head': 14, 'deprel': 'aux', 'lemma': 'توانست#توان', 'ner': 'O', 'kasreh': 'O'}, {'id': 5, 'text': 'در', 'upos': 'ADP', 'xpos': 'P', 'head': 14, 'deprel': 'case', 'lemma': 'در', 'ner': 'O', 'kasreh': 'O'}, {'id': 6, 'text': 'طی', 'upos': 'ADP', 'xpos': 'P', 'head': 5, 'deprel': 'fixed', 'lemma': 'طی', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 7, 'text': 'سال\\u200cها', 'upos': 'AUX', 'xpos': 'V_PRS', 'feats': 'Number=Sing|Person=3|Tense=Pres', 'head': 14, 'deprel': 'fixed', 'lemma': 'سال', 'ner': 'O', 'kasreh': 'O'}, {'id': 8, 'text': 'اغشار', 'upos': 'NOUN', 'xpos': 'N_PL', 'feats': 'Number=Plur', 'head': 19, 'deprel': 'nsubj', 'lemma': 'اغشار', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 9, 'text': 'مختلفی', 'upos': 'ADJ', 'xpos': 'ADJ', 'feats': 'Degree=Pos', 'head': 8, 'deprel': 'amod', 'lemma': 'مختلفی', 'ner': 'O', 'kasreh': 'O'}, {'id': 10, 'text': 'از', 'upos': 'ADP', 'xpos': 'P', 'head': 15, 'deprel': 'case', 'lemma': 'از', 'ner': 'O', 'kasreh': 'O'}, {'id': 11, 'text': 'قومیت\\u200cهای', 'upos': 'NOUN', 'xpos': 'N_PL', 'feats': 'Number=Plur', 'head': 8, 'deprel': 'nmod:poss', 'lemma': 'قومیت', 'ner': 'O', 'kasreh': 'S-kasreh'}, {'id': 12, 'text': 'گوناگون', 'upos': 'ADJ', 'xpos': 'ADJ', 'feats': 'Degree=Pos', 'head': 11, 'deprel': 'amod', 'lemma': 'گوناگون', 'ner': 'O', 'kasreh': 'O'}, {'id': 13, 'text': 'رو', 'upos': 'PART', 'xpos': 'CLITIC', 'head': 8, 'deprel': 'case', 'lemma': 'رو', 'ner': 'O', 'kasreh': 'O'}, {'id': 14, 'text': 'به', 'upos': 'ADP', 'xpos': 'P', 'head': 19, 'deprel': 'case', 'lemma': 'به', 'ner': 'O', 'kasreh': 'O'}, {'id': 15, 'text': 'خوبی', 'upos': 'ADJ', 'xpos': 'ADJ', 'feats': 'Degree=Pos', 'head': 14, 'deprel': 'advcl', 'lemma': 'خوب', 'ner': 'O', 'kasreh': 'O'}, {'id': 16, 'text': 'تو', 'upos': 'ADP', 'xpos': 'P', 'feats': 'Number=Sing|Person=2|PronType=Prs', 'head': 19, 'deprel': 'case', 'lemma': 'تو', 'ner': 'O', 'kasreh': 'O'}, {'id': 17, 'text': 'خودش', 'upos': 'PRON', 'xpos': 'PRO', 'feats': 'Number=Sing|Person=3|PronType=Prs|Reflex=Yes', 'head': 19, 'deprel': 'obl', 'lemma': 'خودش', 'ner': 'O', 'kasreh': 'O'}, {'id': 18, 'text': 'جا', 'upos': 'VERB', 'xpos': 'PREV', 'feats': 'Number=Sing|Person=3|Tense=Pres', 'head': 19, 'deprel': 'compound:lvc', 'lemma': 'جا', 'ner': 'O', 'kasreh': 'O'}, {'id': 19, 'text': 'بده', 'upos': 'VERB', 'xpos': 'V_SUB', 'feats': 'Mood=Sub', 'head': 0, 'deprel': 'root', 'lemma': 'داد#ده', 'ner': 'O', 'kasreh': 'O'}]}], 'lang': 'persian', 'sentiment': [{'label': 'positive', 'score': 0.7366364598274231}]}\n```\n\n\n## Loading Persian NLP Datasets\nWe provide an easy-to-use way to load some popular Persian NLP datasets\n\nHere is the list of supported datasets.\n\n   |    Dataset             | Task \n|       :----------------:               |  :----------------:   \n   |    PersianNER           |   Named Entity Recognition   | \n   |       ARMAN             |   Named Entity Recognition\n   |       Peyma             | Named Entity Recognition\n  |       FarsTail           | Textual Entailment\n |        FaSpell           | Spell Checking\n  |      PersianNews        | Text Classification\n  |       PerUDT            | Universal Dependency\n  |      PnSummary          | Text Summarization\n  |    SnappfoodSentiment   | Sentiment Classification\n  |           TEP           | Text Translation(eng-fa)\n| WikipediaCorpus               | Corpus\n| PersianTweets           | Corpus\n\n\nall datasets are iterator and can be used like below:\n```python\nfrom dadmatools.datasets import FarsTail\nfrom dadmatools.datasets import SnappfoodSentiment\nfrom dadmatools.datasets import Peyma\nfrom dadmatools.datasets import PerUDT\nfrom dadmatools.datasets import PersianTweets\nfrom dadmatools.datasets import PnSummary\n\n\nfarstail = FarsTail()\n#len of dataset\nprint(len(farstail.train))\n\n#like a generator\nprint(next(farstail.train))\n\n#dataset details\npn_summary = PnSummary()\nprint('PnSummary dataset information: ', pn_summary.info)\n\n#loop over dataset\nsnpfood_sa = SnappfoodSentiment()\nfor i, item in enumerate(snpfood_sa.test):\n    print(item['comment'], item['label'])\n\n#get first tokens' lemma of all dev items\nperudt = PerUDT()\nfor token_list in perudt.dev:\n    print(token_list[0]['lemma'])\n\n#get NER tag of first Peyma's data\npeyma = Peyma()\nprint(next(peyma.data)[0]['tag'])\n\n#corpus \ntweets = PersianTweets()\nprint('tweets count : ', len(tweets.data))\nprint('sample tweet: ', next(tweets.data))\n```\nget dataset info:\n```python\n\nfrom dadmatools.datasets import get_all_datasets_info\n\nget_all_datasets_info().keys()\n#dict_keys(['Persian-NEWS', 'fa-wiki', 'faspell', 'PnSummary', 'TEP', 'PerUDT', 'FarsTail', 'Peyma', 'snappfoodSentiment', 'Persian-NER', 'Arman', 'PerSent'])\n\n#specify task\nget_all_datasets_info(tasks=['NER', 'Sentiment-Analysis'])\n```\nthe output will be:\n\n```json\n{\"ARMAN\": {\"description\": \"ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.\\n\\nOrganization\\nLocation\\nFacility\\nEvent\\nProduct\\nPerson\",\n  \"filenames\": [\"train_fold1.txt\",\n   \"train_fold2.txt\",\n   \"train_fold3.txt\",\n   \"test_fold1.txt\",\n   \"test_fold2.txt\",\n   \"test_fold3.txt\"],\n  \"name\": \"ARMAN\",\n  \"size\": {\"test\": 7680, \"train\": 15361},\n  \"splits\": [\"train\", \"test\"],\n  \"task\": \"NER\",\n  \"version\": \"1.0.0\"},\n \"PersianNer\": {\"description\": \"source: https://github.com/Text-Mining/Persian-NER\",\n  \"filenames\": [\"Persian-NER-part1.txt\",\n   \"Persian-NER-part2.txt\",\n   \"Persian-NER-part3.txt\",\n   \"Persian-NER-part4.txt\",\n   \"Persian-NER-part5.txt\"],\n  \"name\": \"PersianNer\",\n  \"size\": 976599,\n  \"splits\": [],\n  \"task\": \"NER\",\n  \"version\": \"1.0.0\"},\n \"Peyma\": {\"description\": \"source: http://nsurl.org/2019-2/tasks/task-7-named-entity-recognition-ner-for-farsi/\",\n  \"filenames\": [\"peyma/600K\", \"peyma/300K\"],\n  \"name\": \"Peyma\",\n  \"size\": 10016,\n  \"splits\": [],\n  \"task\": \"NER\",\n  \"version\": \"1.0.0\"},\n \"snappfoodSentiment\": {\"description\": \"source: https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood\",\n  \"filenames\": [\"snappfood/train.csv\",\n   \"snappfood/test.csv\",\n   \"snappfood/dev.csv\"],\n  \"name\": \"snappfoodSentiment\",\n  \"size\": {\"dev\": 6274, \"test\": 6972, \"train\": 56516},\n  \"splits\": [\"train\", \"test\", \"dev\"],\n  \"task\": \"Sentiment-Analysis\",\n  \"version\": \"1.0.0\"}}\n```\n\n\n## Loading Persian Word Embeddings\nTo start using embedding please install fasttext:\n\n`pip install fasttext`\n\ndownload, load and use some pre-trained Persian word embeddings.\n\ndadmatools supports all glove, fasttext, and word2vec formats.\n```python\nfrom dadmatools.embeddings import get_embedding, get_all_embeddings_info, get_embedding_info\nfrom pprint import pprint\n\npprint(get_all_embeddings_info())\n\n#get embedding information of specific embedding\nembedding_info = get_embedding_info('glove-wiki')\n\n#### load embedding ####\nword_embedding = get_embedding('glove-wiki')\n\n#get vector of the word\nprint(word_embedding['سلام'])\n\n#vocab\nvocab = word_embedding.get_vocab()\n\n### some useful functions ###\nprint(word_embedding.top_nearest(\"زمستان\", 10))\nprint(word_embedding.similarity('کتب', 'کتاب'))\nprint(word_embedding.embedding_text('امروز هوای خوبی بود'))\n```\nThe following word embeddings are currently supported: \n\n| Name | Embedding Algorithm | Corpus | \n| :-------------: | :-------------:  | :-------------:  | \n| [`glove-wiki`](https://github.com/Text-Mining/Persian-Wikipedia-Corpus/tree/master/models/glove)  | glove | Wikipedia  |\n| [`fasttext-commoncrawl-bin`](https://fasttext.cc/docs/en/crawl-vectors.html) | fasttext | CommonCrawl |\n| [`fasttext-commoncrawl-vec`](https://fasttext.cc/docs/en/crawl-vectors.html) | fasttext | CommonCrawl |\n| [`word2vec-conll`](http://vectors.nlpl.eu/) | word2vec | Persian CoNLL17 corpus  |\n\n## Evaluation\nWe have compared our pos tagging, dependancy parsing, and lemmatization models to `stanza` and `hazm`.\n\n\u003ctable\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd colspan=\"4\"\u003e\u003cb\u003ePerDT (F1 score)\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003e\u003cb\u003eToolkit\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003ePOS Tagger (UPOS)\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003eDependancy Parser (UAS/LAS)\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003eLemmatizer\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003eDadmaTools\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e97.52%\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e95.36%\u003c/b\u003e  /  \u003cb\u003e92.54%\u003c/b\u003e \u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e99.14%\u003c/b\u003e \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003estanza\u003c/td\u003e\n    \u003ctd\u003e97.35%\u003c/td\u003e\n    \u003ctd\u003e93.34%  /  91.05% \u003c/td\u003e\n    \u003ctd\u003e98.97% \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003ehazm\u003c/td\u003e\n    \u003ctd\u003e-\u003c/td\u003e\n    \u003ctd\u003e- \u003c/td\u003e\n    \u003ctd\u003e89.01% \u003c/td\u003e\n  \u003c/tr\u003e\n\n\n  \u003ctr align='center'\u003e\n    \u003ctd colspan=\"4\"\u003e\u003cb\u003eSeraji (F1 score)\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003e\u003cb\u003eToolkit\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003ePOS Tagger (UPOS)\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003eDependancy Parser (UAS/LAS)\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003eLemmatizer\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003eDadmaTools\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e97.83%\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e92.5%\u003c/b\u003e  /  \u003cb\u003e89.23%\u003c/b\u003e \u003c/td\u003e\n    \u003ctd\u003e - \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003estanza\u003c/td\u003e\n    \u003ctd\u003e97.43%\u003c/td\u003e\n    \u003ctd\u003e87.20% /  83.89% \u003c/td\u003e\n    \u003ctd\u003e - \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003ehazm\u003c/td\u003e\n    \u003ctd\u003e-\u003c/td\u003e\n    \u003ctd\u003e- \u003c/td\u003e\n    \u003ctd\u003e86.93% \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n\u003ctable\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd colspan=\"2\"\u003e\u003cb\u003eTehran university tree bank (F1 score)\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003e\u003cb\u003eToolkit\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003eConstituency Parser\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003eDadmaTools (without preprocess))\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e82.88%\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr align='center'\u003e\n    \u003ctd\u003eStanford (with some preprocess on POS tags)\u003c/td\u003e\n    \u003ctd\u003e80.28\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n## How to use\nYou can see the codes and the output in colab.\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1BEhOA9Ju0ZyY81MAM_IT9ADz1MFeXF7S?usp=sharing)\n\n\n## Cite\n```\n@inproceedings{jafari2025dadmatools,\n  title={DadmaTools V2: an Adapter-Based Natural Language Processing Toolkit for the Persian Language},\n  author={Jafari, Sadegh and Farsi, Farhan and Ebrahimi, Navid and Sajadi, Mohamad Bagher and Eetemadi, Sauleh},\n  booktitle={Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script},\n  pages={37--43},\n  year={2025}\n}\n``` \n\n\u003c!-- Read the paper here.  --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDadmatech%2FDadmaTools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDadmatech%2FDadmaTools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDadmatech%2FDadmaTools/lists"}