{"id":16098877,"url":"https://github.com/defasium/slov2idiom","last_synced_at":"2025-07-24T14:33:23.201Z","repository":{"id":47742447,"uuid":"395267694","full_name":"Defasium/slov2idiom","owner":"Defasium","description":"Telegram bot for semantic search of russian idioms powered by BERT embeddings approximation, Python, 2021","archived":false,"fork":false,"pushed_at":"2021-10-09T17:18:14.000Z","size":5926,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-02T10:04:03.405Z","etag":null,"topics":["bert","huggingface","nlp","sentence-embeddings","telegram","telegram-bot"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Defasium.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-08-12T09:38:19.000Z","updated_at":"2023-10-01T17:09:00.000Z","dependencies_parsed_at":"2022-09-23T03:30:23.816Z","dependency_job_id":null,"html_url":"https://github.com/Defasium/slov2idiom","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Defasium/slov2idiom","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Defasium%2Fslov2idiom","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Defasium%2Fslov2idiom/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Defasium%2Fslov2idiom/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Defasium%2Fslov2idiom/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Defasium","download_url":"https://codeload.github.com/Defasium/slov2idiom/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Defasium%2Fslov2idiom/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266855804,"owners_count":23995554,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-24T02:00:09.469Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","huggingface","nlp","sentence-embeddings","telegram","telegram-bot"],"created_at":"2024-10-09T18:24:57.510Z","updated_at":"2025-07-24T14:33:23.176Z","avatar_url":"https://github.com/Defasium.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# slov2idiom\nTelegram bot for the semantic search of Russian idioms powered by BERT's embeddings approximation, Python, 2021\nTry it yourself in tg: @rudiombot\n\nHow it works:\n\n\u003cp align='center'\u003e\u003cimg src=\"https://user-images.githubusercontent.com/47502256/129484805-1bbadfa9-8b11-4834-8d31-0509b1b20268.gif\" alt=\"bert\"  width=\"300\"/\u003e\u003c/p\u003e\n\nTODO list:\n* add support for inline mode✅\n* add dynamically changing interface✅\n* add support for random search of idioms✅\n* add history and undo button✅\n* add support for searching similar idioms by clicking on them✅\n* change language interface depending on user's region code ❌\n* calculate other ranking metrics (MAP, NDCG) ❌\n* add Paraphrase+ benchmark ❌\n* add support for emoji in queries ❌\n* add reranking by idiom's popularity ❌\n* add daily limits for users ❌\n* finetune LaBse on STSb and Paraphrase+ to get better embeddings ❌\n\n\n## Model architecture\nFirst of all, we need to choose the best model for calculating semantic similarity.\nWe can consider BERT architectures. There are two popular approaches, the steady but accurate one and the production-friendly.\nFirst one is sentence-pair encoding:\n\u003cp align='center'\u003e\u003cimg src=\"https://www.researchgate.net/publication/334783045/figure/fig2/AS:786570592391170@1564544451786/BERT-sentence-pair-encoding.ppm\" alt=\"bert\"  width=\"300\"/\u003e\u003c/p\u003e\n\nHere two sentences are encoding simultaneously. This means that in order to find similar sentences we have to compute every new prompt with our dataset. It is a huge drawback of such architecture.\n\nSecond one is based on siamese-networks and metric learning:\n\n\u003cp align='center'\u003e\u003cimg src=\"https://miro.medium.com/max/808/1*GNhALCfeEGz5JaXWjc106w.png\" alt=\"sbert\" align=\"center\" width=\"300\"/\u003e\u003c/p\u003e\n\nSuch an approach implies that each sentence or document has its own embedding in a learned metric space. Acquired embeddings can be successfully precomputed and stored.\nTo calculate semantic similarity between two sentences we can simply calculate cosine distance. The model's parameters are shared in such a setting during training.\n\nSo the second approach is more preferable\n\n## STSb comparison\nWe tested accuracies of some pre-trained Russian BERT models on **STSb benchmark**. We calculate cosine similarities between normalized embeddings (from the class token (CLS) or by averaging encoded tokens (MEAN)):\n\n|MODEL|PARAMS|EMBEDDING SIZE|POOLING TYPE|TRAIN SPEARMAN CORR|TEST SPEARMAN CORR|\n|:---:|:---:|:---:|:---:|:---:|:---:|\n|[cointegrate/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny)|12M|312|CLS|0.48472829|0.49825618|\n|[cointegrate/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny)|12M|312|MEAN|0.57088664|0.5875781|\n|[sberbank-ai/sbert_large_nlu_ru](https://huggingface.co/sberbank-ai/sbert_large_nlu_ru)|335M|1024|MEAN|0.5677551|0.5845757|\n|[sberbank-ai/ruRoberta-large](https://huggingface.co/sberbank-ai/ruRoberta-large)|355M|1024|MEAN|0.5847077|0.5958275|\n|[DeepPavlov/rubert-base-cased-sentence](https://huggingface.co/DeepPavlov/rubert-base-cased-sentence)|180M|768|CLS|0.6538959|0.66192624|\n|[DeepPavlov/rubert-base-cased-sentence](https://huggingface.co/DeepPavlov/rubert-base-cased-sentence)|180M|768|MEAN|0.6617157|0.6686508|\n|[sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)|470M|768|CLS|0.743942|0.7541933|\n|[sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)|470M|768|MEAN|0.733942|0.7364084|\n|[cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)|127M|768|CLS|0.733942|0.7364201|\n|[cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)|127M|768|MEAN| **0.754444** | **0.754402** |\n\n\nIn an ideal case scenario, we would like to use a model with a minimum number of parameters and with high accuracy. However, even if we finetune **rubert-tiny** on STSb dataset we would get at best only 66% spearman rank correlation.\n\n## Feature selection\nAt the same time, we would like to use smaller embeddings due to the curse of dimensionality when searching among our embeddings.\nThe solution is to find a \"good\" set of features among these hundreds of dimensions that contains high information. Random search or even worse grid search will require eternity to get such a subset of features. A better method is by using embedded feature selection algorithms via Lasso or L1 Regression. The algorithm for finding a good subset is the following:\n* calculate BERT MEAN embeddings **E1** and **E2** on train and test part of STSb for sentence1 and sentence2 separetely\n* calculate element-wise multiplication **E1**x**E2** (one step before summation to get cosine similarity) and element-wise L1 difference abs(**E1**-**E2**)\n* use acquired numbers as features (so 768+768 = 1526) for Lasso regression as X with intercept and target (similarity score) as y\n* find optimal C value via cross-validation on train set\n* Argsort weights of the first 768 parameters of the fitted regression model\n* Find the best subset of size N on train set via selecting top-N features corresponding to largest  signed weights and calculating cosine distances between **E1p** and **E2p**\n\nHere are the dependency between Spearman correlation on train set and top-N features of **cointegrated/LaBSE-en-ru** embeddings:\n\n\u003cp align='center'\u003e\u003cimg src='https://user-images.githubusercontent.com/47502256/129255306-fd38fd85-32ae-4955-a14e-c04d1067ad3b.png' alt=\"feature selection\" width=\"400\"/\u003e\u003c/p\u003e\n\n**You can find the full example here:**\n\nGoogle Colab: [URL1](https://colab.research.google.com/drive/1e0hZhAHy408VZAqTDKT2Y0_eHB41sYA6?usp=sharing) \n\nGithub Gist: [URL2](https://gist.github.com/Defasium/e28dc5b1dfde8eab1dd5aa45fd7bb208)\n\nAs you can see from the above figure we can achieve a **2.5%** boost in performance simply by using a subset of 125 features.\n\n## BERT's embeddings approximation\nThe smallest model (rubert-tiny) perform around 6-10 ms on the CPU. LaBse performs 80-100 ms. Instead of calculating embeddings, we can simply approximate it via some classical ml approaches - TF-IDF. \nTo further reduce the gap between we can use modern tokenization algorithms like [**WordPiece**](https://paperswithcode.com/method/wordpiece). \nThe cons of this algorithm:\n * better handles OOV (out-of-vocabulary) cases and is used in all BERT and GPT architectures by default.\n * smaller vocab size =\u003e smaller TfIdf vectors\n * may be faster than lemmatization and don't rely on memory-consuming dictionaries\n\nFinally, we can include a special \u003c UNK \u003e token in vocab, otherwise, we will lose information\n\nWe can use simple linear regression as a universal approximator trained to predict from given TfIdf vector a BERT's embedding.\nTo achieve better results instead of fitting to highly-correlated embeddings we can predict PCA and then transform it to original embeddings space.\nSo the final architecture can be illustrated as following:\n\n\u003cp align='center'\u003e\u003cimg src=\"https://user-images.githubusercontent.com/47502256/129262858-a8c38af6-0dc5-4e41-a0d0-131ad76d1808.png\" alt=\"architecture\"  width=\"500\"/\u003e\u003c/p\u003e\n\nGrey-colored nodes indicate that weights of PCA are not updating during training.\n\nWe can also 'contaminate' our training data by purposely dropping some tokens at random or replacing by unknown ones - '\u003c UNK \u003e'. Such an approach provides better results on the fitted model compared to LaBse embeddings:\n  \n|MODEL|Params|P@1|P@3|P@5|P@10|Speed|\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n|ApproximateBERT|1M|75.47|84.82|86.88|88.35|125 us*|\n|[cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)|127M|73.92|81.16|82.62|83.80|80 ms|\n\nSuch great speedup was acquired by reimplementing TFiDF transformation and using NumPy array multiplications in float64\n\n## Search engine\nTo find similar idioms we utilize [**annoy**](https://github.com/spotify/annoy) library on 125 embedding's subset from LaBse.\nOn Intel Xeon 2.3 GHz search with embedding calculation takes around 400 us.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefasium%2Fslov2idiom","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdefasium%2Fslov2idiom","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefasium%2Fslov2idiom/lists"}