Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kargaranamir/parstdex
A package that extracts Persian time and date markers by applying regexes -- AACL 2022
https://github.com/kargaranamir/parstdex
datetime event-extract event-extraction hengam hengamtagger information-extraction nlp parstdex persian persian-calendar persian-datetime persian-time regex-pattern time-date
Last synced: 3 months ago
JSON representation
A package that extracts Persian time and date markers by applying regexes -- AACL 2022
- Host: GitHub
- URL: https://github.com/kargaranamir/parstdex
- Owner: kargaranamir
- License: mit
- Created: 2021-11-09T13:16:57.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-11-29T02:47:16.000Z (almost 2 years ago)
- Last Synced: 2024-07-10T18:17:01.015Z (4 months ago)
- Topics: datetime, event-extract, event-extraction, hengam, hengamtagger, information-extraction, nlp, parstdex, persian, persian-calendar, persian-datetime, persian-time, regex-pattern, time-date
- Language: Python
- Homepage: https://huggingface.co/spaces/kargaranamir/parstdex
- Size: 674 KB
- Stars: 25
- Watchers: 3
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Contributing: contributing.md
- License: LICENSE
Awesome Lists containing this project
README
# HengamTagger or Parstdex (persian time date extractor)
[![Pypi Package](https://badgen.net/pypi/v/parstdex)](https://pypi.org/project/parstdex/)
[![Documentation Status](https://readthedocs.org/projects/parstdex/badge/?version=latest)](https://parstdex.readthedocs.io)
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/kargaranamir/parstdex)
[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kargaranamir/parstdex/blob/main/performance_test.ipynb)## Description
**Parstdex** (knwon as **HengamTagger** in our paper at [aacl](https://aclanthology.org/2022.aacl-main.74/)) is a rule-based Persian temporal extractor built on top of regular expressions specifying pattern units and patterns that can match temporal expressions.## How to Install parstdex
```bash
pip install parstdex
```## How to use
```python
from parstdex import Parstdexmodel = Parstdex()
sentence = """ماریا شنبه عصر راس ساعت ۱۷ و بیست و سه دقیقه به نادیا زنگ زد اما تا سه روز بعد در تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش. خبری از نادیا نشد"""
```
### Extract spans
```python
model.extract_span(sentence)
```
output :
```json
{"datetime": [[6, 47], [68, 78], [82, 111]], "date": [[6, 10], [68, 78], [82, 111]], "time": [[11, 47]]}
```### Extract markers
```python
model.extract_marker(sentence)
``````json
{
"datetime":{
"[6, 47]":"شنبه عصر راس ساعت ۱۷ و بیست و سه دقیقه به",
"[68, 78]":"سه روز بعد",
"[82, 111]":"تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش."
},
"date":{
"[6, 10]":"شنبه",
"[68, 78]":"سه روز بعد",
"[82, 111]":"تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش."
},
"time":{
"[11, 47]":"عصر راس ساعت ۱۷ و بیست و سه دقیقه به"
}
}
```### Extract TimeML scheme
```python
model.extract_time_ml(sentence)
```
output :
```html
ماریاشنبه
عصر راس ساعت ۱۷ و بیست و سه دقیقه به
نادیا زنگ زد اما
تا سه روز بعد
در
تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش.
خبری از نادیا نشد
```### Extract markers' NER tags
#### DATTIM mode (Default):
```python
model.extract_ner(sentence, mode="dattim")
```
output :
```
[
("ماریا", "O"),
("شنبه", "B-DAT"),
("عصر", "B-TIM"),
("راس", "I-TIM"),
("ساعت", "I-TIM"),
("۱۷", "I-TIM"),
("و", "I-TIM"),
("بیست", "I-TIM"),
("و", "I-TIM"),
("سه", "I-TIM"),
("دقیقه", "I-TIM"),
("به", "I-TIM"),
("نادیا", "O"),
("زنگ", "O"),
("زد", "O"),
("اما", "O"),
("تا", "B-DAT"),
("سه", "I-DAT"),
("روز", "I-DAT"),
("بعد", "I-DAT"),
("در", "I-DAT"),
("تاریخ", "I-DAT"),
("۱۸", "I-DAT"),
("شهریور", "I-DAT"),
("سال", "I-DAT"),
("۱۳۷۸", "I-DAT"),
("ه", "I-DAT"),
(".", "I-DAT"),
("ش", "I-DAT"),
(".", "I-DAT"),
("خبری", "O"),
("از", "O"),
("نادیا", "O"),
("نشد", "O"),
]```
#### TMP mode:
```python
model.extract_ner(sentence, mode="tmp")
```
output :
```
[
("ماریا", "O"),
("شنبه", "B-TMP"),
("عصر", "I-TMP"),
("راس", "I-TMP"),
("ساعت", "I-TMP"),
("۱۷", "I-TMP"),
("و", "I-TMP"),
("بیست", "I-TMP"),
("و", "I-TMP"),
("سه", "I-TMP"),
("دقیقه", "I-TMP"),
("به", "I-TMP"),
("نادیا", "O"),
("زنگ", "O"),
("زد", "O"),
("اما", "O"),
("تا", "B-TMP"),
("سه", "I-TMP"),
("روز", "I-TMP"),
("بعد", "I-TMP"),
("در", "I-TMP"),
("تاریخ", "I-TMP"),
("۱۸", "I-TMP"),
("شهریور", "I-TMP"),
("سال", "I-TMP"),
("۱۳۷۸", "I-TMP"),
("ه", "I-TMP"),
(".", "I-TMP"),
("ش", "I-TMP"),
(".", "I-TMP"),
("خبری", "O"),
("از", "O"),
("نادیا", "O"),
("نشد", "O"),
]```
## File Structure:
Parstdex architecture is very flexible and scalable and therefore suggests an easy solution to adapt to new patterns which haven't been considered yet.
```
├── parstdex
│ └── utils
| | └── annotation
| | | └── ...
| | └── pattern
| | | └── ...
| | └── special_words
| | | └── words.txt
| | └── const.py
| | └── normalizer.py
| | └── pattern_to_regex.py
| | └── deprecation.py
| | └── regex_tool.py
| | └── spans.py
| | └── tokenizer.py
| └── marker_extractor.py
| └── settings.py
└── Test
│ └── data.json
| └── test_parstdex.py
|
└── examples.py
└── performance_test.ipynb
└── requirement.txt
└── setup.py
```## Performance Test
Executable codes and performance test results are accessible on [google colab](https://colab.research.google.com/github/kargaranamir/parstdex/blob/main/performance_test.ipynb).The average time required to obtain temporal expressions is `6 ms`. This test was conducted using 264 sentences with an average length of 50 characters that covered all of the patterns.
## How to contribute
Please feel free to provide us with any feedback or suggestions. You can find more information on how to contribute to Parstdex by reading the
[contribution document](https://github.com/kargaranamir/parstdex/blob/main/contributing.md).## Citation
If you use any part of this repository in your research, please cite it using the following BibTex entry.
```python
@inproceedings{mirzababaei-etal-2022-hengam,
title = {Hengam: An Adversarially Trained Transformer for {P}ersian Temporal Tagging},
author = {Mirzababaei, Sajad and Kargaran, Amir Hossein and Sch{\"u}tze, Hinrich and Asgari, Ehsaneddin},
year = 2022,
booktitle = {Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing},
publisher = {Association for Computational Linguistics},
address = {Online only},
pages = {1013--1024},
url = {https://aclanthology.org/2022.aacl-main.74}
}
```