Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/linuxscout/arabicstopwords

Arabic Stop Word List
https://github.com/linuxscout/arabicstopwords
arabic-nlp language nlp
Last synced: 14 days ago
JSON representation
Arabic Stop Word List
Host: GitHub
URL: https://github.com/linuxscout/arabicstopwords
Owner: linuxscout
License: other
Created: 2016-12-23T20:28:01.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2024-01-11T20:29:05.000Z (10 months ago)
Last Synced: 2024-09-22T11:06:14.997Z (about 2 months ago)
Topics: arabic-nlp, language, nlp
Language: Python
Size: 8.23 MB
Stars: 32
Watchers: 5
Forks: 9
Open Issues: 3
Metadata Files:
- Readme: README.md
- Changelog: ChangeLog
- Funding: .github/FUNDING.yml
- License: LICENSE
- Authors: AUTHORS
Awesome Lists containing this project

README

        # Arabic Stop words

![Arabic Stop words logo](doc/arabicStopWordsheader.png "Arabic Stop Words logo")

![PyPI - Downloads](https://img.shields.io/pypi/dm/Arabic-Stopwords)

  Developers:  Taha Zerrouki: http://tahadz.com

    taha dot zerrouki at gmail dot com

    

Features |   value

---------|---------------------------------------------------------------------------------

Authors  | [Authors.md](https://github.com/linuxscout/arabicstopwords/main/AUTHORS.md)

Release  | 0.9

License  |[GPL](https://github.com/linuxscout/arabicstopwords/main/LICENSE)

Tracker  |[linuxscout/arabicstopwords/Issues](https://github.com/linuxscout/arabicstopwords/issues)

Source  |[Github](http://github.com/linuxscout/arabicstopwords)

Website  |[ArabicStopwords on SourceForge](https://arabicstopwords.sf.net)

Doc  |[package Documentation](https://arabicstopwords.readthedocs.io/)

Download  |[Python Library](https://pypi.python.org/pypi/https://pypi.org/project/Arabic-Stopwords/)

Download  | Data set [CSV/SQL/Python](https://github.com/linuxscout/arabicstopwords/releases/latest)

Feedbacks |[Comments](https://github.com/linuxscout/arabicstopwords/)

Accounts  |[@Twitter](https://twitter.com/linuxscout))

Citation |[T. Zerrouki‏, Arabic Stop Words](#Citation)

## Description

It's not easy to detemine the stop words, and in other hand, stop words differs according to the case,

for this purpos, we propose a  classified list which can be parametered by  developper.

The Word list contains only words in its common forms, and we have generated all forms by a script.

It can used as library 'see section [arabicstopwords](#Arabic-Stop-words-Library) library'

## Files

* data-source/ : contains  source data of stopwords

* data-source/classified/stopwords.ods: data in LibreOffice format with more valuble informations, and classified stopwords

* releases/latest: csv/sql/python formats:

  * Classified stop words (lemmas)

  * Inflected forms

  * Corpus based lists

* docs: docs files

* scripts: scripts used to generate all forms, and file formats

## Data

This project contains two parts:

- Data part, which contains classified stopwords, all generated forms,  in multiple  format

  - CSV

  - Python

  - SQL / Sqlite

  - another list of most frequent in corpus like (Wikipedia and Tashkeela Corpus)

- Python library for handling stopwords.

### Data Structure

Two fromats of data are given:

- classified words (lemma) with features to generate inflected froms

- Generated forms from lemmas with adding affixes.

![Stopwords Example](doc/images/stopwords.png  "Stopwords Example")

    

Minimal classified  data .ODS/CSV file 

- 1st field : unvocalised word ( في)

- 2nd field : type of the word: e.g. حرف

- 3rd field : class of word : e.g. preposition

Affixation infomration in other fields:

-    4th field : AIN in Arabic , if word accept Conjunction 'العطف', '*' else

-    5th field : TEH in Arabic , if word accept definate article 'ال التعريف', '*' else

-    6th field : JEEM in Arabic , if word accept preposition  article 'حروف الجر المتصلة', '*' else      

-    7th field : DAD in Arabic , if word accept IDAFA  articles 'الضمائر المتصلة', '*' else              

-    7th field : SAD in Arabic , if word accept verb conjugation  articles 'التصريف', '*' else       

-    8th field : LAM in Arabic , if word accept LAM QASAM   articles 'لام القسم', '*' else       

-    8th field : MEEM in Arabic , if word has ALEF LAM as definition article 'معرف', '*' else        

All forms data CSV file

- 1st field : unvocalised word ( بأنك)

- 2nd field : vocalised inflected word with : e.g. ف-ب-خمسين-ي

- 3rd field:  word type (super class): noun, verb, tool حرف

- 4th field:  word type (sub class): إنّ وأخواتها 

- 5th field:  original or lemma: إن

- 6th field:  procletic : ب

- 7th field:  stem : أن

- 8th field:  encletic: ك

- 9th field:  tags: جر:مضاف

```csv

word    vocalized   type    category    original    procletic   stem    encletic    tags

بأنك    بِأَنّكَ    حرف إن و أخواتها    أن  ب-      -ك  جر:مضاف

بأنكما  بِأَنّكُمَا حرف إن و أخواتها    أن  ب-      -كما    جر:مضاف

```

## How to customize stop word list

* check the minimal form data file (stopwords.csv)

* comment by "#" all words which you don't need

* run 

```

make

```

* catch the output of script in releases folder.

## How to update data

* check if the word doesn't exist in the minimal form data file ( classified/stopwords.ods)

* add affixation information

* run 

```

make

```

* catch the output of script in releases folder.

## Arabic Stop words Library

### Install

``` shell

pip install arabicstopwords

```

### Usage

* test if a word is stop

``` python

>>> import arabicstopwords.arabicstopwords as stp

>>> # test if a word is a stop

... stp.is_stop(u'ممكن')

False

>>> stp.is_stop(u'منكم')

True

```

* stem a stopword

```python

>>> word = u"لعلهم"

>>> stp.stop_stem(word)

u'لعل'

```

* list all stop words

```

>>> stp.stopwords_list()

......

>>> len(stp.stopwords_list())

13629

>>> len(stp.classed_stopwords_list())

 507

```

* give all forms of a stopword

```python

>>> stp.stopword_forms(u"على")

....

>>> len(stp.stopword_forms(u"على"))

144

```

* get stopword as list of dictionaries

``` python

>>> from arabicstopwords.stopwords_lexicon import stopwords_lexicon 

>>> lexicon = stopwords_lexicon()

>>> # test if a word is a stop

... lexicon.is_stop(u'ممكن')

False

>>> lexicon.is_stop(u'منكم')

True

>>> lexicon.get_features_dict(u'منكم')

[{'vocalized': 'منكم', 'procletic': '', 'tags': 'حرف;حرف جر;ضمير', 'stem': 'من', 'type': 'حرف', 'original': 'من', 'encletic': '-كم'}]

```

* get stopword as tuple

``` python

>>> from arabicstopwords.stopwords_lexicon import stopwords_lexicon 

>>> lexicon = stopwords_lexicon()

>>> tuples = lexicon.get_stopwordtuples(u'منكم')

>>> tuples

[]

>>> for tup in tuples:

...     print(tup)

... 

{'vocalized': 'منكم', 'procletic': '', 'tags': 'حرف;حرف جر;ضمير', 'stem': 'من', 'type': 'حرف', 'original': 'من', 'encletic': '-كم'}

>>> >>> for tup in tuples:

...     dir(tup)

... 

['accept_conjuction', 'accept_conjugation', 'accept_definition', 'accept_inflection', 'accept_interrog', 'accept_preposition', 'accept_pronoun', 'accept_qasam', 'accept_tanwin', 'get_action', 'get_enclitic', 'get_feature', 'get_features_dict', 'get_lemma', 'get_need', 'get_object_type', 'get_procletic', 'get_stem', 'get_tags', 'get_vocalized', 'get_wordclass', 'get_wordtype', 'is_defined', 'stop_dict']

>>> 

```

* get stopword by categories

``` python

>>> from arabicstopwords.stopwords_lexicon import stopwords_lexicon 

>>> lexicon = stopwords_lexicon()

>>> lexicon.get_categories()

['حرف', 'ضمير', 'فعل', 'اسم', 'اسم فعل', 'حرف ابجدي']

>>> lexicon.get_by_category("اسم فعل", lemma=True, vocalized=True)

['آهاً', 'بَسّْ', 'بَسْ', 'حَايْ', 'صَهْ', 'صَهٍ', 'طَاقْ', 'طَقْ', 'عَدَسْ', 'كِخْ', 'نَخْ', 'هَجْ', 'وَا', 'وَا', 'وَاهاً', 'وَيْ', 'آمِينَ', 'آهٍ', 'أُفٍّ', 'أُفٍّ', 'أَمَامَكَ', 'أَوَّهْ', 'إِلَيْكَ', 'إِلَيْكُمْ', 'إِلَيْكُمَا', 'إِلَيْكُنَّ', 'إيهِ', 'بخٍ', 'بُطْآنَ', 'بَلْهَ', 'حَذَارِ', 'حَيَّ', 'دُونَكَ', 'رُوَيْدَكَ', 'سُرْعَانَ', 'شَتَّانَ', 'عَلَيْكَ', 'مَكَانَكَ', 'مَكَانَكِ', 'مَكَانَكُمْ', 'مَكَانَكُمَا', 'مَكَانَكُنَّ', 'مَهْ', 'هَا', 'هَاؤُمُ', 'هَاكَ', 'هَلُمَّ', 'هَيَّا', 'هِيتَ', 'هَيْهَاتَ', 'وَرَاءَكَ', 'وَرَاءَكِ', 'وُشْكَانَ', 'وَيْكَأَنَّ', 'وَرَاءَكُما', 'وَرَاءَكُمْ', 'وَرَاءَكُنَّ', 'بِئْسَمَا']

```

## Citation

If you would cite it in academic work, can you use this citation

```text

T. Zerrouki‏, Arabic Stop Words,  https://github.com/linuxscout/arabicstopwords/, 2010

```

Another Citation:

```text

Zerrouki, Taha. "Towards An Open Platform For Arabic Language Processing." (2020).

```

or in bibtex format

```bibtex

@misc{zerrouki2010arabicstopwords,

  title={Arabic Stop Words},

  author={Zerrouki, Taha},

  url={https://github.com/linuxscout/arabicstopwords},

  year={2010}

}

@thesis{zerrouki2020towards,

  title={Towards An Open Platform For Arabic Language Processing},

  author={Zerrouki, Taha},

  year={2020}

}

```

##