{"id":17718564,"url":"https://github.com/linuxscout/arabicstopwords","last_synced_at":"2025-04-26T18:11:06.424Z","repository":{"id":138754852,"uuid":"77249070","full_name":"linuxscout/arabicstopwords","owner":"linuxscout","description":"Arabic Stop Word List","archived":false,"fork":false,"pushed_at":"2024-01-11T20:29:05.000Z","size":8633,"stargazers_count":34,"open_issues_count":3,"forks_count":9,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-25T04:04:30.898Z","etag":null,"topics":["arabic-nlp","language","nlp"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linuxscout.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null},"funding":{"patreon":"linuxscout"}},"created_at":"2016-12-23T20:28:01.000Z","updated_at":"2025-01-24T22:55:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"0b7acd15-03fd-4ade-9245-c1b084e6cecd","html_url":"https://github.com/linuxscout/arabicstopwords","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linuxscout%2Farabicstopwords","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linuxscout%2Farabicstopwords/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linuxscout%2Farabicstopwords/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linuxscout%2Farabicstopwords/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linuxscout","download_url":"https://codeload.github.com/linuxscout/arabicstopwords/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250847255,"owners_count":21497152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arabic-nlp","language","nlp"],"created_at":"2024-10-25T14:54:52.716Z","updated_at":"2025-04-25T15:43:45.523Z","avatar_url":"https://github.com/linuxscout.png","language":"Python","funding_links":["https://patreon.com/linuxscout"],"categories":[],"sub_categories":[],"readme":"# Arabic Stop words\n![Arabic Stop words logo](doc/arabicStopWordsheader.png \"Arabic Stop Words logo\")\n\n![PyPI - Downloads](https://img.shields.io/pypi/dm/Arabic-Stopwords)\n\n  Developers:  Taha Zerrouki: http://tahadz.com\n    taha dot zerrouki at gmail dot com\n    \n\nFeatures |   value\n---------|---------------------------------------------------------------------------------\nAuthors  | [Authors.md](https://github.com/linuxscout/arabicstopwords/main/AUTHORS.md)\nRelease  | 0.9\nLicense  |[GPL](https://github.com/linuxscout/arabicstopwords/main/LICENSE)\nTracker  |[linuxscout/arabicstopwords/Issues](https://github.com/linuxscout/arabicstopwords/issues)\nSource  |[Github](http://github.com/linuxscout/arabicstopwords)\nWebsite  |[ArabicStopwords on SourceForge](https://arabicstopwords.sf.net)\nDoc  |[package Documentation](https://arabicstopwords.readthedocs.io/)\nDownload  |[Python Library](https://pypi.python.org/pypi/https://pypi.org/project/Arabic-Stopwords/)\nDownload  | Data set [CSV/SQL/Python](https://github.com/linuxscout/arabicstopwords/releases/latest)\nFeedbacks |[Comments](https://github.com/linuxscout/arabicstopwords/)\nAccounts  |[@Twitter](https://twitter.com/linuxscout))\nCitation |[T. Zerrouki‏, Arabic Stop Words](#Citation)\n\n## Description\n\nIt's not easy to detemine the stop words, and in other hand, stop words differs according to the case,\nfor this purpos, we propose a  classified list which can be parametered by  developper.\n\nThe Word list contains only words in its common forms, and we have generated all forms by a script.\n\nIt can used as library 'see section [arabicstopwords](#Arabic-Stop-words-Library) library'\n\n## Files\n\n* data-source/ : contains  source data of stopwords\n* data-source/classified/stopwords.ods: data in LibreOffice format with more valuble informations, and classified stopwords\n* releases/latest: csv/sql/python formats:\n  * Classified stop words (lemmas)\n  * Inflected forms\n  * Corpus based lists\n\n* docs: docs files\n* scripts: scripts used to generate all forms, and file formats\n\n## Data\nThis project contains two parts:\n- Data part, which contains classified stopwords, all generated forms,  in multiple  format\n  - CSV\n  - Python\n  - SQL / Sqlite\n  - another list of most frequent in corpus like (Wikipedia and Tashkeela Corpus)\n- Python library for handling stopwords.\n\n### Data Structure\nTwo fromats of data are given:\n- classified words (lemma) with features to generate inflected froms\n- Generated forms from lemmas with adding affixes.\n\n![Stopwords Example](doc/images/stopwords.png  \"Stopwords Example\")\n    \nMinimal classified  data .ODS/CSV file \n- 1st field : unvocalised word ( في)\n- 2nd field : type of the word: e.g. حرف\n- 3rd field : class of word : e.g. preposition\n\nAffixation infomration in other fields:\n-    4th field : AIN in Arabic , if word accept Conjunction 'العطف', '*' else\n-    5th field : TEH in Arabic , if word accept definate article 'ال التعريف', '*' else\n-    6th field : JEEM in Arabic , if word accept preposition  article 'حروف الجر المتصلة', '*' else      \n-    7th field : DAD in Arabic , if word accept IDAFA  articles 'الضمائر المتصلة', '*' else              \n-    7th field : SAD in Arabic , if word accept verb conjugation  articles 'التصريف', '*' else       \n-    8th field : LAM in Arabic , if word accept LAM QASAM   articles 'لام القسم', '*' else       \n-    8th field : MEEM in Arabic , if word has ALEF LAM as definition article 'معرف', '*' else        \n\n\nAll forms data CSV file\n- 1st field : unvocalised word ( بأنك)\n- 2nd field : vocalised inflected word with : e.g. ف-ب-خمسين-ي\n- 3rd field:  word type (super class): noun, verb, tool حرف\n- 4th field:  word type (sub class): إنّ وأخواتها \n- 5th field:  original or lemma: إن\n- 6th field:  procletic : ب\n- 7th field:  stem : أن\n- 8th field:  encletic: ك\n- 9th field:  tags: جر:مضاف\n\n\n\n```csv\nword    vocalized   type    category    original    procletic   stem    encletic    tags\nبأنك    بِأَنّكَ    حرف إن و أخواتها    أن  ب-      -ك  جر:مضاف\nبأنكما  بِأَنّكُمَا حرف إن و أخواتها    أن  ب-      -كما    جر:مضاف\n```\n## How to customize stop word list\n\n* check the minimal form data file (stopwords.csv)\n* comment by \"#\" all words which you don't need\n* run \n```\nmake\n```\n* catch the output of script in releases folder.\n\n\n## How to update data\n\n* check if the word doesn't exist in the minimal form data file ( classified/stopwords.ods)\n* add affixation information\n* run \n```\nmake\n```\n* catch the output of script in releases folder.\n\n## Arabic Stop words Library\n### Install\n``` shell\npip install arabicstopwords\n```\n### Usage\n* test if a word is stop\n``` python\n\u003e\u003e\u003e import arabicstopwords.arabicstopwords as stp\n\u003e\u003e\u003e # test if a word is a stop\n... stp.is_stop(u'ممكن')\nFalse\n\u003e\u003e\u003e stp.is_stop(u'منكم')\nTrue\n```\n\n* stem a stopword\n```python\n\u003e\u003e\u003e word = u\"لعلهم\"\n\u003e\u003e\u003e stp.stop_stem(word)\nu'لعل'\n\n```\n* list all stop words\n```\n\u003e\u003e\u003e stp.stopwords_list()\n......\n\u003e\u003e\u003e len(stp.stopwords_list())\n13629\n\u003e\u003e\u003e len(stp.classed_stopwords_list())\n 507\n```\n* give all forms of a stopword\n```python\n\u003e\u003e\u003e stp.stopword_forms(u\"على\")\n....\n\u003e\u003e\u003e len(stp.stopword_forms(u\"على\"))\n144\n```\n\n\n* get stopword as list of dictionaries\n``` python\n\u003e\u003e\u003e from arabicstopwords.stopwords_lexicon import stopwords_lexicon \n\u003e\u003e\u003e lexicon = stopwords_lexicon()\n\u003e\u003e\u003e # test if a word is a stop\n... lexicon.is_stop(u'ممكن')\nFalse\n\u003e\u003e\u003e lexicon.is_stop(u'منكم')\nTrue\n\u003e\u003e\u003e lexicon.get_features_dict(u'منكم')\n[{'vocalized': 'منكم', 'procletic': '', 'tags': 'حرف;حرف جر;ضمير', 'stem': 'من', 'type': 'حرف', 'original': 'من', 'encletic': '-كم'}]\n```\n\n* get stopword as tuple\n``` python\n\u003e\u003e\u003e from arabicstopwords.stopwords_lexicon import stopwords_lexicon \n\u003e\u003e\u003e lexicon = stopwords_lexicon()\n\u003e\u003e\u003e tuples = lexicon.get_stopwordtuples(u'منكم')\n\u003e\u003e\u003e tuples\n[\u003cstopwordtuple.stopwordTuple object at 0x7fd93b3d12b0\u003e]\n\u003e\u003e\u003e for tup in tuples:\n...     print(tup)\n... \n{'vocalized': 'منكم', 'procletic': '', 'tags': 'حرف;حرف جر;ضمير', 'stem': 'من', 'type': 'حرف', 'original': 'من', 'encletic': '-كم'}\n\u003e\u003e\u003e \u003e\u003e\u003e for tup in tuples:\n...     dir(tup)\n... \n['accept_conjuction', 'accept_conjugation', 'accept_definition', 'accept_inflection', 'accept_interrog', 'accept_preposition', 'accept_pronoun', 'accept_qasam', 'accept_tanwin', 'get_action', 'get_enclitic', 'get_feature', 'get_features_dict', 'get_lemma', 'get_need', 'get_object_type', 'get_procletic', 'get_stem', 'get_tags', 'get_vocalized', 'get_wordclass', 'get_wordtype', 'is_defined', 'stop_dict']\n\u003e\u003e\u003e \n```\n\n* get stopword by categories\n``` python\n\u003e\u003e\u003e from arabicstopwords.stopwords_lexicon import stopwords_lexicon \n\u003e\u003e\u003e lexicon = stopwords_lexicon()\n\u003e\u003e\u003e lexicon.get_categories()\n['حرف', 'ضمير', 'فعل', 'اسم', 'اسم فعل', 'حرف ابجدي']\n\u003e\u003e\u003e lexicon.get_by_category(\"اسم فعل\", lemma=True, vocalized=True)\n['آهاً', 'بَسّْ', 'بَسْ', 'حَايْ', 'صَهْ', 'صَهٍ', 'طَاقْ', 'طَقْ', 'عَدَسْ', 'كِخْ', 'نَخْ', 'هَجْ', 'وَا', 'وَا', 'وَاهاً', 'وَيْ', 'آمِينَ', 'آهٍ', 'أُفٍّ', 'أُفٍّ', 'أَمَامَكَ', 'أَوَّهْ', 'إِلَيْكَ', 'إِلَيْكُمْ', 'إِلَيْكُمَا', 'إِلَيْكُنَّ', 'إيهِ', 'بخٍ', 'بُطْآنَ', 'بَلْهَ', 'حَذَارِ', 'حَيَّ', 'دُونَكَ', 'رُوَيْدَكَ', 'سُرْعَانَ', 'شَتَّانَ', 'عَلَيْكَ', 'مَكَانَكَ', 'مَكَانَكِ', 'مَكَانَكُمْ', 'مَكَانَكُمَا', 'مَكَانَكُنَّ', 'مَهْ', 'هَا', 'هَاؤُمُ', 'هَاكَ', 'هَلُمَّ', 'هَيَّا', 'هِيتَ', 'هَيْهَاتَ', 'وَرَاءَكَ', 'وَرَاءَكِ', 'وُشْكَانَ', 'وَيْكَأَنَّ', 'وَرَاءَكُما', 'وَرَاءَكُمْ', 'وَرَاءَكُنَّ', 'بِئْسَمَا']\n```\n\n## Citation\n\nIf you would cite it in academic work, can you use this citation\n\n```text\nT. Zerrouki‏, Arabic Stop Words,  https://github.com/linuxscout/arabicstopwords/, 2010\n```\n\nAnother Citation:\n\n```text\nZerrouki, Taha. \"Towards An Open Platform For Arabic Language Processing.\" (2020).\n```\n\nor in bibtex format\n\n```bibtex\n@misc{zerrouki2010arabicstopwords,\n  title={Arabic Stop Words},\n  author={Zerrouki, Taha},\n  url={https://github.com/linuxscout/arabicstopwords},\n  year={2010}\n}\n@thesis{zerrouki2020towards,\n  title={Towards An Open Platform For Arabic Language Processing},\n  author={Zerrouki, Taha},\n  year={2020}\n}\n\n\n```\n\n## \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinuxscout%2Farabicstopwords","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinuxscout%2Farabicstopwords","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinuxscout%2Farabicstopwords/lists"}