Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/de-mh/persian_phonemizer

A tool for translating Persian text to IPA (International Phonetic Alphabet).
https://github.com/de-mh/persian_phonemizer

dependency-parser natural-language-processing part-of-speech-tagger persian phonemization python

Last synced: 3 months ago
JSON representation

A tool for translating Persian text to IPA (International Phonetic Alphabet).

Host: GitHub
URL: https://github.com/de-mh/persian_phonemizer
Owner: de-mh
License: mit
Created: 2022-05-19T11:48:32.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-08-26T06:58:58.000Z (about 2 years ago)
Last Synced: 2024-07-09T12:03:03.217Z (4 months ago)
Topics: dependency-parser, natural-language-processing, part-of-speech-tagger, persian, phonemization, python
Language: Python
Homepage:
Size: 11 MB
Stars: 56
Watchers: 3
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # persian_phonemizer

A tool for translating Persian text to IPA (International Phonetic Alphabet).

In Persian, one written word can have different pronunciations and different meanings according to the pronunciation. 

This library helps with disambiguation of such words.

A few examples of use cases of this library are:

* Input for TTS systems

* Helping people in learning Persian

* Adding pronunciation for Persian words in texts of other languages

## Installation

```bash

pip install persian_phonemizer

```

## Usage

Fast start:

```python

>>> from persian_phonemizer import Phonemizer

>>> phonemizer = Phonemizer()

>>> phonemizer.phonemize("آن مرد مرد.")

'ʔɒːn mæɾd moɾd .'

>>> phonemizer.phonemize("دوچرخه جدید علی گم شد.")

'dovtʃʰæɾxeje dʒædiːde ʔæliː ɡom ʃod .'

```

you can set the package to output Persian text with eraab instead of IPA:

```python

>>> phonemizer = Phonemizer(output_format='eraab')

>>> phonemizer.phonemize("آن مرد مرد.")

'آن مَرد مُرد .'

```

## What's inside?

- A database containing words, part-of-speech, pronunciation and meaning according to Moen dictionary

    - script for parsing Dehkhoda dictionary is available in the dataset directory. Still, the results are not used in the package because some pronunciations are outdated and will do more harm than good.

- A Part-of-Speech tagger and a Dependency Parser trained on [Universal Dependencies](https://universaldependencies.org/) dataset using [spaCy](https://spacy.io/)

- A Grapheme to Phoneme model using a seq-to-seq neural network implemented in Pytorch. More info is provided in [g2p_fa](https://github.com/de-mh/g2p_fa) repo.

These assets were created to be used in this repo but each one has the ability to be used separately.

## How does it work?

This package uses several approaches for finding the proper pronunciation. 

1. Input text gets normalized and tokenized

2. Root word for each word in input is calculated using a lemmatizer to cover complex verbs and nouns

3. Each word is looked up for pronunciations in the database.

    - If there is no pronunciation available, pronounce is predicted using [g2p_fa](https://github.com/de-mh/g2p_fa).

    - If there is one pronunciation, that one is used.

    - If there is more than one pronunciation, the correct one is chosen based on the Part-of-Speech tag for that word.

4. Suffix and prefix pronunciations are added for each word

5. Add `e` or `je` between words when needed using the dependency parser