https://github.com/mt-empty/assyrian-translation-model

Assyrian translation model using huggingface
https://github.com/mt-empty/assyrian-translation-model

assyrian nena neo-aramaic northeastern-neo-aramaic syriac translation translation-model

Last synced: 5 months ago
JSON representation

Assyrian translation model using huggingface

Host: GitHub
URL: https://github.com/mt-empty/assyrian-translation-model
Owner: mt-empty
Created: 2022-03-03T12:28:41.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2025-02-22T10:57:43.000Z (over 1 year ago)
Last Synced: 2025-02-22T11:28:29.423Z (over 1 year ago)
Topics: assyrian, nena, neo-aramaic, northeastern-neo-aramaic, syriac, translation, translation-model
Language: Python
Homepage:
Size: 1.22 MB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# English to Assyrian/Eastern Syriac Model

This is an English to Assyrian/Eastern Syriac machine translation model, it uses [English to Arabic](https://huggingface.co/Helsinki-NLP/opus-mt-en-ar) model as the base model.
The source code is well documented, and was made such that it can be read by inexperienced developers.

Although the project aim is to Build a English to Assyrian - the ones that fall under [Northeastern Neo-Aramaic](https://en.wikipedia.org/wiki/Northeastern_Neo-Aramaic) - the current model mostly provides translation for Classical Syriac. This model is a good initial step, but I hope future work will make it more inline with Assyrian dialects.

*Please note that Assyrian and Easter Syriac are used interchangeably.*

## To Use

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("mt-empty/english-assyrian")

model = AutoModelForSeq2SeqLM.from_pretrained("mt-empty/english-assyrian")

translator = pipeline("translation", model=model, tokenizer=tokenizer)

print("tomorrow morning", translator("tomorrow morning"))

```

[test.py](./test.py)/[test.ipynb](./test.ipynb) contains examples on how to use the translation pipeline.

## Dataset

The dataset are sourced from:

* [Classical Syriac bible](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models/en-syr)
* [pericopes](https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:114404/tab/2)

## Tokenizer

[SentencePiece](https://github.com/google/sentencepiece/) was used for tokenization.

## Evaluation

[SacreBlue](https://github.com/mjpost/sacrebleu) was used for evaluation. Running it for 50 epochs produced a score of **33**.

## To Train

Please make sure you have installed all the required dependencies,
Run `python model.py`, it will train for `50` epochs, this can be changed in the code.

## Before Contributing

This project utilizes pre-commit hooks, so please run the following before submitting a pull request:

1. Install requirements, `pip install -r requirements.txt`
2. Configure pre-commit hooks, `pre-commit install`
3. (Optional) Run hooks manually, `pre-commit run --all-files`
4. Submit a pull request

### Main Dependencies

```
datasets
transformers
sentencepiece
pandas
pytorch
sacrebleu
```

Please install the appropriate version of [pytorch](https://pytorch.org/) for your machine, `cuda` is needed if you want to train on GPU.

## Interesting Research Projects

* [North Eastern Neo-Aramaic Database](https://github.com/CambridgeSemiticsLab)
* [Neo-Aramaic Web Corpora: Christian Urmi and Turoyo](http://neo-aramaic.web-corpora.net/index_en.html)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mt-empty/assyrian-translation-model

Awesome Lists containing this project

README