Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kargaranamir/parstdex

A package that extracts Persian time and date markers by applying regexes -- AACL 2022
https://github.com/kargaranamir/parstdex

datetime event-extract event-extraction hengam hengamtagger information-extraction nlp parstdex persian persian-calendar persian-datetime persian-time regex-pattern time-date

Last synced: 3 months ago
JSON representation

A package that extracts Persian time and date markers by applying regexes -- AACL 2022

Host: GitHub
URL: https://github.com/kargaranamir/parstdex
Owner: kargaranamir
License: mit
Created: 2021-11-09T13:16:57.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-11-29T02:47:16.000Z (almost 2 years ago)
Last Synced: 2024-07-10T18:17:01.015Z (4 months ago)
Topics: datetime, event-extract, event-extraction, hengam, hengamtagger, information-extraction, nlp, parstdex, persian, persian-calendar, persian-datetime, persian-time, regex-pattern, time-date
Language: Python
Homepage: https://huggingface.co/spaces/kargaranamir/parstdex
Size: 674 KB
Stars: 25
Watchers: 3
Forks: 2
Open Issues: 2
Metadata Files:
- Readme: README.md
- Contributing: contributing.md
- License: LICENSE

Awesome Lists containing this project

README

        # HengamTagger or Parstdex (persian time date extractor)

[![Pypi Package](https://badgen.net/pypi/v/parstdex)](https://pypi.org/project/parstdex/)

[![Documentation Status](https://readthedocs.org/projects/parstdex/badge/?version=latest)](https://parstdex.readthedocs.io)

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/kargaranamir/parstdex)

[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kargaranamir/parstdex/blob/main/performance_test.ipynb)

## Description 

**Parstdex** (knwon as **HengamTagger** in our paper at [aacl](https://aclanthology.org/2022.aacl-main.74/)) is a rule-based Persian temporal extractor built on top of regular expressions specifying pattern units and patterns that can match temporal expressions. 

## How to Install parstdex

```bash

pip install parstdex

```

## How to use

```python

from parstdex import Parstdex

model = Parstdex()

sentence = """ماریا شنبه عصر راس ساعت ۱۷ و بیست و سه دقیقه به نادیا زنگ زد اما تا سه روز بعد در تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش. خبری از نادیا نشد"""

```

### Extract spans

```python

model.extract_span(sentence)

```

output :

```json

{"datetime": [[6, 47], [68, 78], [82, 111]], "date": [[6, 10], [68, 78], [82, 111]], "time": [[11, 47]]}

```

### Extract markers

```python

model.extract_marker(sentence)

```

```json

{

   "datetime":{

      "[6, 47]":"شنبه عصر راس ساعت ۱۷ و بیست و سه دقیقه به",

      "[68, 78]":"سه روز بعد",

      "[82, 111]":"تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش."

   },

   "date":{

      "[6, 10]":"شنبه",

      "[68, 78]":"سه روز بعد",

      "[82, 111]":"تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش."

   },

   "time":{

      "[11, 47]":"عصر راس ساعت ۱۷ و بیست و سه دقیقه به"

   }

}

```

### Extract TimeML scheme

```python

model.extract_time_ml(sentence)

```

output :

```html

ماریا 

شنبه

عصر راس ساعت ۱۷ و بیست و سه دقیقه به

 نادیا زنگ زد اما 

تا سه روز بعد

 در 

تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش.

خبری از نادیا نشد

```

### Extract markers' NER tags

#### DATTIM mode (Default):

```python

model.extract_ner(sentence, mode="dattim")

```

output :

```

[

    ("ماریا", "O"),

    ("شنبه", "B-DAT"),

    ("عصر", "B-TIM"),

    ("راس", "I-TIM"),

    ("ساعت", "I-TIM"),

    ("۱۷", "I-TIM"),

    ("و", "I-TIM"),

    ("بیست", "I-TIM"),

    ("و", "I-TIM"),

    ("سه", "I-TIM"),

    ("دقیقه", "I-TIM"),

    ("به", "I-TIM"),

    ("نادیا", "O"),

    ("زنگ", "O"),

    ("زد", "O"),

    ("اما", "O"),

    ("تا", "B-DAT"),

    ("سه", "I-DAT"),

    ("روز", "I-DAT"),

    ("بعد", "I-DAT"),

    ("در", "I-DAT"),

    ("تاریخ", "I-DAT"),

    ("۱۸", "I-DAT"),

    ("شهریور", "I-DAT"),

    ("سال", "I-DAT"),

    ("۱۳۷۸", "I-DAT"),

    ("ه", "I-DAT"),

    (".", "I-DAT"),

    ("ش", "I-DAT"),

    (".", "I-DAT"),

    ("خبری", "O"),

    ("از", "O"),

    ("نادیا", "O"),

    ("نشد", "O"),

]

```

#### TMP mode:

```python

model.extract_ner(sentence, mode="tmp")

```

output :

```

[

    ("ماریا", "O"),

    ("شنبه", "B-TMP"),

    ("عصر", "I-TMP"),

    ("راس", "I-TMP"),

    ("ساعت", "I-TMP"),

    ("۱۷", "I-TMP"),

    ("و", "I-TMP"),

    ("بیست", "I-TMP"),

    ("و", "I-TMP"),

    ("سه", "I-TMP"),

    ("دقیقه", "I-TMP"),

    ("به", "I-TMP"),

    ("نادیا", "O"),

    ("زنگ", "O"),

    ("زد", "O"),

    ("اما", "O"),

    ("تا", "B-TMP"),

    ("سه", "I-TMP"),

    ("روز", "I-TMP"),

    ("بعد", "I-TMP"),

    ("در", "I-TMP"),

    ("تاریخ", "I-TMP"),

    ("۱۸", "I-TMP"),

    ("شهریور", "I-TMP"),

    ("سال", "I-TMP"),

    ("۱۳۷۸", "I-TMP"),

    ("ه", "I-TMP"),

    (".", "I-TMP"),

    ("ش", "I-TMP"),

    (".", "I-TMP"),

    ("خبری", "O"),

    ("از", "O"),

    ("نادیا", "O"),

    ("نشد", "O"),

]

```

## File Structure:

Parstdex architecture is very flexible and scalable and therefore suggests an easy solution to adapt to new patterns which haven't been considered yet.

```

├── parstdex                 

│   └── utils

|   |   └── annotation

|   |   |   └── ...

|   |   └── pattern

|   |   |   └── ...

|   |   └── special_words

|   |   |   └── words.txt

|   |   └── const.py

|   |   └── normalizer.py

|   |   └── pattern_to_regex.py

|   |   └── deprecation.py

|   |   └── regex_tool.py

|   |   └── spans.py

|   |   └── tokenizer.py

|   └── marker_extractor.py

|   └── settings.py

└── Test           

│   └── data.json

|   └── test_parstdex.py

|      

└── examples.py

└── performance_test.ipynb

└── requirement.txt

└── setup.py

```

## Performance Test 

Executable codes and performance test results are accessible on [google colab](https://colab.research.google.com/github/kargaranamir/parstdex/blob/main/performance_test.ipynb).

The average time required to obtain temporal expressions is `6 ms`. This test was conducted using 264 sentences with an average length of 50 characters that covered all of the patterns.

## How to contribute

Please feel free to provide us with any feedback or suggestions.  You can find more information on how to contribute to Parstdex by reading the 

[contribution document](https://github.com/kargaranamir/parstdex/blob/main/contributing.md).

## Citation

If you use any part of this repository in your research, please cite it using the following BibTex entry.

```python

@inproceedings{mirzababaei-etal-2022-hengam,

	title        = {Hengam: An Adversarially Trained Transformer for {P}ersian Temporal Tagging},

	author       = {Mirzababaei, Sajad  and Kargaran, Amir Hossein  and Sch{\"u}tze, Hinrich  and Asgari, Ehsaneddin},

	year         = 2022,

	booktitle    = {Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing},

	publisher    = {Association for Computational Linguistics},

	address      = {Online only},

	pages        = {1013--1024},

	url          = {https://aclanthology.org/2022.aacl-main.74}

}

```