https://github.com/tahirzia-1/transformer-based-english-urdu-neural-machine-translation

a comprehensive approach to English-to-Urdu neural machine translation (NMT) utilizing transformer-based architectures.
https://github.com/tahirzia-1/transformer-based-english-urdu-neural-machine-translation

bleu-score colab-notebook english englishtourdu huggingface meteor nlp transformer translation urdu

Last synced: 4 months ago
JSON representation

a comprehensive approach to English-to-Urdu neural machine translation (NMT) utilizing transformer-based architectures.

Host: GitHub
URL: https://github.com/tahirzia-1/transformer-based-english-urdu-neural-machine-translation
Owner: TahirZia-1
Created: 2025-05-11T00:39:25.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-05-15T18:28:03.000Z (5 months ago)
Last Synced: 2025-05-19T20:18:31.002Z (5 months ago)
Topics: bleu-score, colab-notebook, english, englishtourdu, huggingface, meteor, nlp, transformer, translation, urdu
Language: Jupyter Notebook
Homepage:
Size: 12.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

- **Videos could not be uploaded to the repo due to file size greater than 100 MB**
- **dataset**: Contains parallel corpora for training, validation, and testing.
- **code/2021465_2021758.ipynb**: Notebook implementing a custom Transformer model from scratch.
- **code/huggingface_Model.ipynb**: Notebook fine-tuning the pretrained MarianMT model.

---

## 🎯 Objectives

- Develop a custom Transformer model for English-to-Urdu translation.
- Fine-tune a pretrained MarianMT model for the same task.
- Evaluate and compare both models using BLEU and METEOR metrics.

---

## 🧠 Methodology

### Data Preprocessing

- Lowercasing text.
- Removing punctuation and extra whitespace.
- Tokenization.
- Applying Byte Pair Encoding (BPE).
- Padding/truncating sequences to a fixed length.

### Model Architectures

- **Custom Transformer**:
- Implemented based on Vaswani et al.'s architecture.
- Features encoder-decoder structure, multi-head attention, and positional encoding.
- Hyperparameters: 6 layers, 512 hidden units, 8 attention heads.

- **Pretrained MarianMT**:
- Utilizes Hugging Face's MarianMTModel.
- Fine-tuned on the provided English-Urdu dataset.
- Employs MarianTokenizer for preprocessing.

### Training Details

- **Custom Transformer**:
- Optimizer: Adam with learning rate scheduler.
- Loss Function: Cross-Entropy with label smoothing.
- Batch Size: 64.
- Epochs: 20.
- Early stopping and gradient clipping applied.

- **MarianMT Model**:
- Optimizer: AdamW.
- Batch Size: 32.
- Epochs: 10.
- Early stopping based on validation loss.

---

## 📊 Evaluation Metrics

- **BLEU (Bilingual Evaluation Understudy)**:
- Measures n-gram precision between machine and reference translations.
- Scores range from 0 to 100; higher scores indicate better translations.

- **METEOR (Metric for Evaluation of Translation with Explicit ORdering)**:
- Considers synonymy and stemming for more flexible evaluation.
- Scores range from 0 to 1; higher scores indicate better translations.

---

## 📈 Results

| Model | BLEU Score | METEOR Score |
|---------------------|------------|--------------|
| Custom Transformer | 18.5 | 21.3 |
| Pretrained MarianMT | 26.7 | 29.8 |

The pretrained MarianMT model outperformed the custom Transformer model in both BLEU and METEOR scores, highlighting the advantages of transfer learning in low-resource language translation tasks.

---

## 🚀 How to Run

1. **Clone the Repository**:
```bash
git clone https://github.com/TahirZia-1/Transformer-Based-English-Urdu-Neural-Machine-Translation.git

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tahirzia-1/transformer-based-english-urdu-neural-machine-translation

Awesome Lists containing this project

README