https://github.com/evilfreelancer/enbeddrus
Collection of scripts for training bert-based embedder for Russian<>English embeddings extraction
https://github.com/evilfreelancer/enbeddrus
Last synced: 10 months ago
JSON representation
Collection of scripts for training bert-based embedder for Russian<>English embeddings extraction
- Host: GitHub
- URL: https://github.com/evilfreelancer/enbeddrus
- Owner: EvilFreelancer
- License: mit
- Created: 2024-05-19T14:28:34.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-10T18:33:55.000Z (over 1 year ago)
- Last Synced: 2025-03-20T22:09:49.551Z (10 months ago)
- Language: Jupyter Notebook
- Size: 754 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Enbedrus - ENglish and RUSsian emBEDDer
This is a BERT (uncased) [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional
dense vector space and can be used for tasks like clustering or semantic search.
- **Parameters**: 168 million
- **Layers**: 12
- **Hidden Size**: 768
- **Attention Heads**: 12
- **Vocabulary Size**: 119,547
- **Maximum Sequence Length**: 512 tokens
The Enbeddrus model is designed to extract similar embeddings for comparable English and Russian phrases. It is based on
the [bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-cased) model and was
trained over 20 epochs on the following datasets:
- [evilfreelancer/opus-php-en-ru-cleaned](https://huggingface.co/datasets/evilfreelancer/opus-php-en-ru-cleaned) (train): 1.6k lines
- [evilfreelancer/golang-en-ru](https://huggingface.co/datasets/evilfreelancer/golang-en-ru) (train): 554 lines
- [Helsinki-NLP/opus_books](https://huggingface.co/datasets/Helsinki-NLP/opus_books/viewer/en-ru) (en-ru, train): 17.5k lines
The goal of this model is to generate identical or very similar embeddings regardless of whether the text is written in
English or Russian.
[Enbeddrus GGUF](https://ollama.com/evilfreelancer/enbeddrus) version available via Ollama.
## Envaluation test
Models tested via [encodechka](https://github.com/avidale/encodechka)
| Name | evilfreelancer/enbeddrus-v0.1 | evilfreelancer/enbeddrus-v0.1-domain | evilfreelancer/enbeddrus-v0.2 |
| --------------------- | ----------------------------- | ------------------------------------ | ----------------------------- |
| STSBTask | 0.6418501890569303 | 0.6418501890569303 | 0.6382642407246252 |
| ParaphraserTask | 0.5396186809125094 | 0.5396186809125094 | 0.5491558495250873 |
| XnliTask | 0.37045908183632736 | 0.37045908183632736 | 0.36666666666666664 |
| SentimentTask | 0.7306666666666667 | 0.7306666666666667 | 0.7246666666666667 |
| ToxicityTask | 0.8923319999999999 | 0.8923319999999999 | 0.894758 |
| InappropriatenessTask | 0.7092166782043772 | 0.7092166782043772 | 0.719323712657756 |
| IntentsTask | 0.7086 | 0.7162 | 0.7128 |
| IntentsXTask | 0.5116 | 0.46 | 0.5314 |
| FactRuTask | n/a | n/a | n/a |
| RudrTask | n/a | n/a | n/a |
| SpeedTask (cuda) | 4.313722451527913 | 4.339381853739421 | 4.251763025919597 |
| SpeedTask (cpu) | 34.0190052986145 | 34.990905125935875 | 34.441959857940674 |