https://github.com/rasyosef/text-embedding-models-training
Notebooks to train and evaluate Amharic Text Embedding Models based on BERT and RoBERTa for Passage Retrieval
https://github.com/rasyosef/text-embedding-models-training
bert embedding-models embeddings huggingface model-training roberta sentence-transformers transformers
Last synced: 8 months ago
JSON representation
Notebooks to train and evaluate Amharic Text Embedding Models based on BERT and RoBERTa for Passage Retrieval
- Host: GitHub
- URL: https://github.com/rasyosef/text-embedding-models-training
- Owner: rasyosef
- Created: 2025-02-17T15:47:32.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-17T16:06:01.000Z (over 1 year ago)
- Last Synced: 2025-09-14T09:02:19.500Z (9 months ago)
- Topics: bert, embedding-models, embeddings, huggingface, model-training, roberta, sentence-transformers, transformers
- Language: Jupyter Notebook
- Homepage:
- Size: 71.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Training Amharic Text Embedding Models for Passage Retrieval
This repo contains code for training Amharic Text Embedding models based on 3 Amharic Encoder Base models
- [RoBERTa Base Amharic](https://huggingface.co/rasyosef/roberta-base-amharic)
- [RoBERTa Medium Amharic](https://huggingface.co/rasyosef/roberta-medium-amharic)
- [BERT Medium Amharic](https://huggingface.co/rasyosef/bert-medium-amharic)
We also Evaluate the highest ranking multilingual embedding models form the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) using the same test set as our embedding models.
# Results
Our largest embedding model, RoBERTa-Base-Amharic-Embed beats all of the multilingual embedding models on the MRR@10, NDCG@10 and Recall metrics while having 1/5th of their param count.
|Model | Params | MRR@10 | NDCG@10 | Recall@10 | Recall@50 | Recall@100 |
|------|--------|--------|---------|-----------|-----------|------------|
|gte-modernbert-base | 149M | 0.019 | 0.022 | 0.030 | 0.054 | 0.065 |
|gte-multilingual-base | 305M | 0.649 | 0.684 | 0.794 | 0.876 | 0.904 |
|multilingual-e5-large-instruct | 560M | 0.713 | 0.747 | 0.853 | 0.924 | 0.946 |
|snowflake-arctic-embed-l-v2.0 | 568M | 0.719 | 0.755 | 0.868 | 0.941 | 0.957 |
|BERT-Medium-Amharic-Embed | 40M | 0.657 | 0.696 | 0.817 | 0.916 | 0.945 |
|RoBERTa-Medium-Amharic-Embed | 42M | 0.707 | 0.744 | 0.861 | 0.941 | 0.963 |
|RoBERTa-Base-Amharic-Embed | 110M | **0.755** |**0.790** | **0.897** | **0.957** | **0.971** |
# Code
The training and eval code can be found in the `notebooks` folder.
# Embedding Models
You can download and use our Amharic Text Embedding models from HuggingFace and they are fully compatable with the Sentence Transformers Library
- RoBERTa-Base-Amharic-Embed: https://huggingface.co/rasyosef/roberta-amharic-text-embedding-base
- RoBERTa-Medium-Amharic-Embed: https://huggingface.co/rasyosef/roberta-amharic-text-embedding-medium
- BERT-Medium-Amharic-Embed: https://huggingface.co/rasyosef/bert-amharic-text-embedding-medium