https://github.com/shreydan/multilingual-translation

Training a transformer for multilingual translation from scratch. Translates English to Hindi or Telugu. Trained on the Opus100 dataset for learning purposes.
https://github.com/shreydan/multilingual-translation

indic-languages transformers translation

Last synced: 8 months ago
JSON representation

Training a transformer for multilingual translation from scratch. Translates English to Hindi or Telugu. Trained on the Opus100 dataset for learning purposes.

Host: GitHub
URL: https://github.com/shreydan/multilingual-translation
Owner: shreydan
Created: 2023-11-01T18:00:42.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-11-01T18:09:13.000Z (over 2 years ago)
Last Synced: 2025-06-04T09:03:49.601Z (about 1 year ago)
Topics: indic-languages, transformers, translation
Language: Jupyter Notebook
Homepage: https://www.kaggle.com/code/shreydan/en-hi-te-translation/
Size: 333 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Multilingual Machine Translation with Transformers

> This project was for learning purposes only. Hence, focused on getting decent results rather than building an alternative to existing multilingual models.

- Implemented a [7M parameter model](./model.py).
- Trained a BERT style tokenizer.
- Trained on [Opus100](https://huggingface.co/datasets/opus100) Dataset with `en-hi` & `en-te` subsets.
- Go through the entirety on [Kaggle](https://www.kaggle.com/code/shreydan/en-hi-te-translation).

```
ENGLISH ----> HINDI
|
--> TELUGU
```

## Working

- The model understands which language to translate to based on the preceding beginning-of-sentence `bos` token:
- english sentences start with `` token
- hindi sentences start with `` token
- telugu sentences start with `` token
- all sentences end with `` token
- trained as a Sequence-to-Sequence transformer model with an encoder-decoder style architecture. Encoder handles english and decoder handles both hindi & telugu.

## Model Config
```py
config = {
'dim': 128,
'n_heads': 4,
'attn_dropout': 0.1,
'mlp_dropout': 0.1,
'depth': 8,
'vocab_size': 30000,
'max_len': 128
}
```

## Inference Results

```
python inference.py --text 'how are you?' -l hi -s
>>> आप कैसे हैं?

python inference.py --text 'please call me' -l hi
>>> कृपया मुझे पुकारो

python inference.py --text 'what are you doing?' -l te -s -t 0.5
>>> మీరు ఏం చేస్తున్నారు?

python inference.py --text "what's wrong?" -l te -s
>>> ఏమి తప్పు?
```

> The results are kinda hilarious but atleast it works.

> Here's the SOTA model if you really want good quality multilingual indic translation: [ai4bharat/indictrans2-indic-en-1B](https://huggingface.co/ai4bharat/indictrans2-indic-en-1B), it's even used by the govt. of India officially.

```
I have refrained my feet from every evil way,
That I might keep thy word.
Psalm 119:101
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shreydan/multilingual-translation

Awesome Lists containing this project

README