https://github.com/shreydan/multilingual-translation
Training a transformer for multilingual translation from scratch. Translates English to Hindi or Telugu. Trained on the Opus100 dataset for learning purposes.
https://github.com/shreydan/multilingual-translation
indic-languages transformers translation
Last synced: 8 months ago
JSON representation
Training a transformer for multilingual translation from scratch. Translates English to Hindi or Telugu. Trained on the Opus100 dataset for learning purposes.
- Host: GitHub
- URL: https://github.com/shreydan/multilingual-translation
- Owner: shreydan
- Created: 2023-11-01T18:00:42.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-11-01T18:09:13.000Z (over 2 years ago)
- Last Synced: 2025-06-04T09:03:49.601Z (about 1 year ago)
- Topics: indic-languages, transformers, translation
- Language: Jupyter Notebook
- Homepage: https://www.kaggle.com/code/shreydan/en-hi-te-translation/
- Size: 333 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Multilingual Machine Translation with Transformers
> This project was for learning purposes only. Hence, focused on getting decent results rather than building an alternative to existing multilingual models.
- Implemented a [7M parameter model](./model.py).
- Trained a BERT style tokenizer.
- Trained on [Opus100](https://huggingface.co/datasets/opus100) Dataset with `en-hi` & `en-te` subsets.
- Go through the entirety on [Kaggle](https://www.kaggle.com/code/shreydan/en-hi-te-translation).
```
ENGLISH ----> HINDI
|
--> TELUGU
```
## Working
- The model understands which language to translate to based on the preceding beginning-of-sentence `bos` token:
- english sentences start with `` token
- hindi sentences start with `` token
- telugu sentences start with `` token
- all sentences end with `` token
- trained as a Sequence-to-Sequence transformer model with an encoder-decoder style architecture. Encoder handles english and decoder handles both hindi & telugu.
## Model Config
```py
config = {
'dim': 128,
'n_heads': 4,
'attn_dropout': 0.1,
'mlp_dropout': 0.1,
'depth': 8,
'vocab_size': 30000,
'max_len': 128
}
```
## Inference Results
```
python inference.py --text 'how are you?' -l hi -s
>>> आप कैसे हैं?
python inference.py --text 'please call me' -l hi
>>> कृपया मुझे पुकारो
python inference.py --text 'what are you doing?' -l te -s -t 0.5
>>> మీరు ఏం చేస్తున్నారు?
python inference.py --text "what's wrong?" -l te -s
>>> ఏమి తప్పు?
```
> The results are kinda hilarious but atleast it works.
> Here's the SOTA model if you really want good quality multilingual indic translation: [ai4bharat/indictrans2-indic-en-1B](https://huggingface.co/ai4bharat/indictrans2-indic-en-1B), it's even used by the govt. of India officially.
```
I have refrained my feet from every evil way,
That I might keep thy word.
Psalm 119:101
```