https://github.com/rasyosef/text-embedding-models-training

Notebooks to train and evaluate Amharic Text Embedding Models based on BERT and RoBERTa for Passage Retrieval
https://github.com/rasyosef/text-embedding-models-training

bert embedding-models embeddings huggingface model-training roberta sentence-transformers transformers

Last synced: 10 months ago
JSON representation

Notebooks to train and evaluate Amharic Text Embedding Models based on BERT and RoBERTa for Passage Retrieval

Host: GitHub
URL: https://github.com/rasyosef/text-embedding-models-training
Owner: rasyosef
Created: 2025-02-17T15:47:32.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-17T16:06:01.000Z (over 1 year ago)
Last Synced: 2025-09-14T09:02:19.500Z (10 months ago)
Topics: bert, embedding-models, embeddings, huggingface, model-training, roberta, sentence-transformers, transformers
Language: Jupyter Notebook
Homepage:
Size: 71.3 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Training Amharic Text Embedding Models for Passage Retrieval

This repo contains code for training Amharic Text Embedding models based on 3 Amharic Encoder Base models 

- [RoBERTa Base Amharic](https://huggingface.co/rasyosef/roberta-base-amharic)

- [RoBERTa Medium Amharic](https://huggingface.co/rasyosef/roberta-medium-amharic)

- [BERT Medium Amharic](https://huggingface.co/rasyosef/bert-medium-amharic)

We also Evaluate the highest ranking multilingual embedding models form the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) using the same test set as our embedding models.

# Results

Our largest embedding model, RoBERTa-Base-Amharic-Embed beats all of the multilingual embedding models on the MRR@10, NDCG@10 and Recall metrics while having 1/5th of their param count.

|Model | Params | MRR@10 | NDCG@10 | Recall@10 | Recall@50 | Recall@100 |

|------|--------|--------|---------|-----------|-----------|------------|

|gte-modernbert-base | 149M | 0.019 | 0.022 | 0.030 | 0.054 | 0.065 |

|gte-multilingual-base | 305M | 0.649 | 0.684 | 0.794 | 0.876 | 0.904 |

|multilingual-e5-large-instruct | 560M | 0.713 | 0.747 | 0.853 | 0.924 | 0.946 |

|snowflake-arctic-embed-l-v2.0 | 568M | 0.719 | 0.755 | 0.868 | 0.941 | 0.957 |

|BERT-Medium-Amharic-Embed | 40M | 0.657 | 0.696 | 0.817 | 0.916 | 0.945 |

|RoBERTa-Medium-Amharic-Embed | 42M | 0.707 | 0.744 | 0.861 | 0.941 | 0.963 |

|RoBERTa-Base-Amharic-Embed | 110M | **0.755** |**0.790** | **0.897** | **0.957** | **0.971** |

# Code

The training and eval code can be found in the `notebooks` folder.

# Embedding Models

You can download and use our Amharic Text Embedding models from HuggingFace and they are fully compatable with the Sentence Transformers Library

 - RoBERTa-Base-Amharic-Embed: https://huggingface.co/rasyosef/roberta-amharic-text-embedding-base

 - RoBERTa-Medium-Amharic-Embed: https://huggingface.co/rasyosef/roberta-amharic-text-embedding-medium

 - BERT-Medium-Amharic-Embed: https://huggingface.co/rasyosef/bert-amharic-text-embedding-medium

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rasyosef/text-embedding-models-training

Awesome Lists containing this project

README