{"id":13603155,"url":"https://github.com/netease-youdao/BCEmbedding","last_synced_at":"2025-04-11T13:33:05.310Z","repository":{"id":215082922,"uuid":"737989734","full_name":"netease-youdao/BCEmbedding","owner":"netease-youdao","description":"Netease Youdao's open-source embedding and reranker models for RAG products.","archived":false,"fork":false,"pushed_at":"2025-02-05T15:07:11.000Z","size":67938,"stargazers_count":1703,"open_issues_count":31,"forks_count":115,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-10T16:42:07.661Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/netease-youdao.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-02T06:34:24.000Z","updated_at":"2025-04-10T09:28:36.000Z","dependencies_parsed_at":"2025-03-13T07:01:06.493Z","dependency_job_id":"8e09be58-1f39-4ee8-8f0f-9b1e6580e1ee","html_url":"https://github.com/netease-youdao/BCEmbedding","commit_stats":{"total_commits":71,"total_committers":6,"mean_commits":"11.833333333333334","dds":"0.12676056338028174","last_synced_commit":"c6327ccc854d65e8d4eb2edac74dbb6eb67733ec"},"previous_names":["netease-youdao/bcembedding"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netease-youdao%2FBCEmbedding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netease-youdao%2FBCEmbedding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netease-youdao%2FBCEmbedding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netease-youdao%2FBCEmbedding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/netease-youdao","download_url":"https://codeload.github.com/netease-youdao/BCEmbedding/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248409841,"owners_count":21098771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T18:01:54.224Z","updated_at":"2025-04-11T13:33:05.297Z","avatar_url":"https://github.com/netease-youdao.png","language":"Python","funding_links":[],"categories":["Python","文本匹配 文本检索 文本相似度","NLP"],"sub_categories":["其他_文本生成、文本对话","3. Pretraining"],"readme":"\u003c!--\n * @Description: \n * @Author: shenlei\n * @Modified: linhui\n * @Date: 2023-12-19 10:31:41\n * @LastEditTime: 2024-05-13 17:05:35\n * @LastEditors: shenlei\n--\u003e\n\n\u003ch1 align=\"center\"\u003eBCEmbedding: Bilingual and Crosslingual Embedding for RAG\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"./LICENSE\"\u003e\n      \u003cimg src=\"https://img.shields.io/badge/license-Apache--2.0-yellow\"\u003e\n    \u003c/a\u003e\n        \n    \u003ca href=\"https://twitter.com/YDopensource\"\u003e\n      \u003cimg src=\"https://img.shields.io/badge/follow-%40YDOpenSource-1DA1F2?logo=twitter\u0026style={style}\"\u003e\n    \u003c/a\u003e\n        \n\u003c/div\u003e\n\u003cbr\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong style=\"background-color: green;\"\u003eEnglish\u003c/strong\u003e\n  |\n  \u003ca href=\"./README_zh.md\" target=\"_Self\"\u003e简体中文\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cdetails open=\"open\"\u003e\n\u003csummary\u003eClick to Open Contents\u003c/summary\u003e\n\n- \u003ca href=\"#-bilingual-and-crosslingual-superiority\" target=\"_Self\"\u003e🌐 Bilingual and Crosslingual Superiority\u003c/a\u003e\n- \u003ca href=\"#-key-features\" target=\"_Self\"\u003e💡 Key Features\u003c/a\u003e\n- \u003ca href=\"#-latest-updates\" target=\"_Self\"\u003e🚀 Latest Updates\u003c/a\u003e\n- \u003ca href=\"#-model-list\" target=\"_Self\"\u003e🍎 Model List\u003c/a\u003e\n- \u003ca href=\"#-manual\" target=\"_Self\"\u003e📖 Manual\u003c/a\u003e\n  - \u003ca href=\"#installation\" target=\"_Self\"\u003eInstallation\u003c/a\u003e\n  - \u003ca href=\"#quick-start\" target=\"_Self\"\u003eQuick Start (`transformers`, `sentence-transformers`)\u003c/a\u003e\n  - \u003ca href=\"#embedding-and-reranker-integrations-for-rag-frameworks\" target=\"_Self\"\u003eEmbedding and Reranker Integrations for RAG Frameworks (`langchain`, `llama_index`)\u003c/a\u003e\n- \u003ca href=\"#%EF%B8%8F-evaluation\" target=\"_Self\"\u003e⚙️ Evaluation\u003c/a\u003e\n  - \u003ca href=\"#evaluate-semantic-representation-by-mteb\" target=\"_Self\"\u003eEvaluate Semantic Representation by MTEB\u003c/a\u003e\n  - \u003ca href=\"#evaluate-rag-by-llamaindex\" target=\"_Self\"\u003eEvaluate RAG by LlamaIndex\u003c/a\u003e\n- \u003ca href=\"#-leaderboard\" target=\"_Self\"\u003e📈 Leaderboard\u003c/a\u003e\n  - \u003ca href=\"#semantic-representation-evaluations-in-mteb\" target=\"_Self\"\u003eSemantic Representation Evaluations in MTEB\u003c/a\u003e\n  - \u003ca href=\"#rag-evaluations-in-llamaindex\" target=\"_Self\"\u003eRAG Evaluations in LlamaIndex\u003c/a\u003e\n- \u003ca href=\"#-youdaos-bcembedding-api\" target=\"_Self\"\u003e🛠 Youdao's BCEmbedding API\u003c/a\u003e\n- \u003ca href=\"#-wechat-group\" target=\"_Self\"\u003e🧲 WeChat Group\u003c/a\u003e\n- \u003ca href=\"#%EF%B8%8F-citation\" target=\"_Self\"\u003e✏️ Citation\u003c/a\u003e\n- \u003ca href=\"#-license\" target=\"_Self\"\u003e🔐 License\u003c/a\u003e\n- \u003ca href=\"#-related-links\" target=\"_Self\"\u003e🔗 Related Links\u003c/a\u003e\n\n\u003c/details\u003e\n\u003cbr\u003e\n\n**B**ilingual and **C**rosslingual **Embedding** (`BCEmbedding`) in English and Chinese, developed by NetEase Youdao, encompasses `EmbeddingModel` and `RerankerModel`. The `EmbeddingModel` specializes in generating semantic vectors, playing a crucial role in semantic search and question-answering, and the `RerankerModel` excels at refining search results and ranking tasks.\n\n`BCEmbedding` serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implementation, notably [QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)], an open-source implementation widely integrated in various Youdao products like [Youdao Speed Reading](https://read.youdao.com/#/home) and [Youdao Translation](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation).\n\nDistinguished for its bilingual and crosslingual proficiency, `BCEmbedding` excels in bridging Chinese and English linguistic gaps, which achieves\n\n- **A high performance on \u003ca href=\"#semantic-representation-evaluations-in-mteb\" target=\"_Self\"\u003eSemantic Representation Evaluations in MTEB\u003c/a\u003e**;\n- **A new benchmark in the realm of \u003ca href=\"#rag-evaluations-in-llamaindex\" target=\"_Self\"\u003eRAG Evaluations in LlamaIndex\u003c/a\u003e**.\n\n\u003cimg src=\"./Docs/assets/rag_eval_multiple_domains_summary.jpg\"\u003e\n\n### Our Goals\n\nProvide a bilingual and crosslingual two-stage retrieval model repository for the RAG community, which can be used directly without finetuning, including `EmbeddingModel` and `RerankerModel`:\n\n- One Model: `EmbeddingModel` handle **bilingual and crosslingual** retrieval task in English and Chinese. `RerankerModel` supports **English, Chinese, Japanese and Korean**.\n- One Model: **Cover common business application scenarios with RAG optimization**. e.g. Education, Medical Scenario, Law, Finance, Literature, FAQ, Textbook, Wikipedia, General Conversation.\n- Easy to Integrate: We provide **API** in `BCEmbedding` for LlamaIndex and LangChain integrations.\n- Others Points:\n  - `RerankerModel` supports **long passages (more than 512 tokens, less than 32k tokens) reranking**;\n  - `RerankerModel` provides **meaningful relevance score** that helps to remove passages with low quality.\n  - `EmbeddingModel` **does not need specific instructions**.\n\n### Third-party Examples\n\n- RAG applications: [QAnything](https://github.com/netease-youdao/qanything), [HuixiangDou](https://github.com/InternLM/HuixiangDou), [ChatPDF](https://github.com/shibing624/ChatPDF).\n- Efficient inference: [ChatLLM.cpp](https://github.com/foldl/chatllm.cpp), [Xinference](https://github.com/xorbitsai/inference), [mindnlp (Huawei GPU)](https://github.com/mindspore-lab/mindnlp/tree/master/llm/inference/bce).\n\n## 🌐 Bilingual and Crosslingual Superiority\n\nExisting embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. `BCEmbedding`, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.\n\n`EmbeddingModel` supports ***Chinese (ch) and English (en)*** (more languages support will come soon), while `RerankerModel` supports ***Chinese (ch), English (en), Japanese (ja) and Korean (ko)***.\n\n## 💡 Key Features\n\n- **Bilingual and Crosslingual Proficiency**: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.\n- **RAG-Optimized**: Tailored for diverse RAG tasks including **translation, summarization, and question answering**, ensuring accurate **query understanding**. See \u003ca href=\"#rag-evaluations-in-llamaindex\" target=\"_Self\"\u003eRAG Evaluations in LlamaIndex\u003c/a\u003e.\n- **Efficient and Precise Retrieval**: Dual-encoder for efficient retrieval of `EmbeddingModel` in first stage, and cross-encoder of `RerankerModel` for enhanced precision and deeper semantic analysis in second stage.\n- **Broad Domain Adaptability**: Trained on diverse datasets for superior performance across various fields.\n- **User-Friendly Design**: Instruction-free, versatile use for multiple tasks without specifying query instruction for each task.\n- **Meaningful Reranking Scores**: `RerankerModel` provides relevant scores to improve result quality and optimize large language model performance.\n- **Proven in Production**: Successfully implemented and validated in Youdao's products.\n\n## 🚀 Latest Updates\n\n- ***2024-02-04***: **Technical Blog** - See \u003ca href=\"https://zhuanlan.zhihu.com/p/681370855\"\u003e为RAG而生-BCEmbedding技术报告\u003c/a\u003e.\n- ***2024-01-16***: **LangChain and LlamaIndex Integrations** - See \u003ca href=\"#embedding-and-reranker-integrations-for-rag-frameworks\" target=\"_Self\"\u003emore\u003c/a\u003e.\n- ***2024-01-03***: **Model Releases** - [bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) and [bce-reranker-base_v1](https://huggingface.co/maidalun1020/bce-reranker-base_v1) are available.\n- ***2024-01-03***: **Eval Datasets** [[CrosslingualMultiDomainsDataset](https://huggingface.co/datasets/maidalun1020/CrosslingualMultiDomainsDataset)] - Evaluate the performance of RAG, using [LlamaIndex](https://github.com/run-llama/llama_index).\n- ***2024-01-03***: **Eval Datasets** [[Details](./BCEmbedding/evaluation/c_mteb/Retrieval.py)] - Evaluate the performance of crosslingual semantic representation, using [MTEB](https://github.com/embeddings-benchmark/mteb).\n\n## 🍎 Model List\n\n| Model Name            |     Model Type     |   Languages   | Parameters |                                                                          Weights                                                                          |\n| :-------------------- | :----------------: | :------------: | :--------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------: |\n| bce-embedding-base_v1 | `EmbeddingModel` |     ch, en     |    279M    | [Huggingface](https://huggingface.co/maidalun1020/bce-embedding-base_v1), [国内通道](https://hf-mirror.com/maidalun1020/bce-embedding-base_v1) |\n| bce-reranker-base_v1  | `RerankerModel` | ch, en, ja, ko |    279M    |  [Huggingface](https://huggingface.co/maidalun1020/bce-reranker-base_v1), [国内通道](https://hf-mirror.com/maidalun1020/bce-reranker-base_v1)  |\n\n## 📖 Manual\n\n### Installation\n\nFirst, create a conda environment and activate it.\n\n```bash\nconda create --name bce python=3.10 -y\nconda activate bce\n```\n\nThen install `BCEmbedding` for minimal installation (To avoid cuda version conflicting, you should install [`torch`](https://pytorch.org/get-started/previous-versions/) that is compatible to your system cuda version manually first):\n\n```bash\npip install BCEmbedding==0.1.5\n```\n\nOr install from source (**recommended**):\n\n```bash\ngit clone git@github.com:netease-youdao/BCEmbedding.git\ncd BCEmbedding\npip install -v -e .\n```\n\n### Quick Start\n\n#### 1. Based on `BCEmbedding`\n\nUse `EmbeddingModel`, and `cls` [pooler](./BCEmbedding/models/embedding.py#L24) is default.\n\n```python\nfrom BCEmbedding import EmbeddingModel\n\n# list of sentences\nsentences = ['sentence_0', 'sentence_1']\n\n# init embedding model\nmodel = EmbeddingModel(model_name_or_path=\"maidalun1020/bce-embedding-base_v1\")\n\n# extract embeddings\nembeddings = model.encode(sentences)\n```\n\nUse `RerankerModel` to calculate relevant scores and rerank:\n\n```python\nfrom BCEmbedding import RerankerModel\n\n# your query and corresponding passages\nquery = 'input_query'\npassages = ['passage_0', 'passage_1']\n\n# construct sentence pairs\nsentence_pairs = [[query, passage] for passage in passages]\n\n# init reranker model\nmodel = RerankerModel(model_name_or_path=\"maidalun1020/bce-reranker-base_v1\")\n\n# method 0: calculate scores of sentence pairs\nscores = model.compute_score(sentence_pairs)\n\n# method 1: rerank passages\nrerank_results = model.rerank(query, passages)\n```\n\nNOTE:\n\n- In [`RerankerModel.rerank`](./BCEmbedding/models/reranker.py#L137) method, we provide an advanced preproccess that we use in production for making `sentence_pairs`, when \"passages\" are very long.\n\n#### 2. Based on `transformers`\n\nFor `EmbeddingModel`:\n\n```python\nfrom transformers import AutoModel, AutoTokenizer\n\n# list of sentences\nsentences = ['sentence_0', 'sentence_1']\n\n# init model and tokenizer\ntokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')\nmodel = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')\n\ndevice = 'cuda'  # if no GPU, set \"cpu\"\nmodel.to(device)\n\n# get inputs\ninputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors=\"pt\")\ninputs_on_device = {k: v.to(device) for k, v in inputs.items()}\n\n# get embeddings\noutputs = model(**inputs_on_device, return_dict=True)\nembeddings = outputs.last_hidden_state[:, 0]  # cls pooler\nembeddings = embeddings / embeddings.norm(dim=1, keepdim=True)  # normalize\n```\n\nFor `RerankerModel`:\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForSequenceClassification\n\n# init model and tokenizer\ntokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1')\nmodel = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1')\n\ndevice = 'cuda'  # if no GPU, set \"cpu\"\nmodel.to(device)\n\n# get inputs\ninputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors=\"pt\")\ninputs_on_device = {k: v.to(device) for k, v in inputs.items()}\n\n# calculate scores\nscores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float()\nscores = torch.sigmoid(scores)\n```\n\n#### 3. Based on `sentence_transformers`\n\nFor `EmbeddingModel`:\n\n```python\nfrom sentence_transformers import SentenceTransformer\n\n# list of sentences\nsentences = ['sentence_0', 'sentence_1', ...]\n\n# init embedding model\n## New update for sentence-trnasformers. So clean up your \"`SENTENCE_TRANSFORMERS_HOME`/maidalun1020_bce-embedding-base_v1\" or \"～/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1\" first for downloading new version.\nmodel = SentenceTransformer(\"maidalun1020/bce-embedding-base_v1\")\n\n# extract embeddings\nembeddings = model.encode(sentences, normalize_embeddings=True)\n```\n\nFor `RerankerModel`:\n\n```python\nfrom sentence_transformers import CrossEncoder\n\n# init reranker model\nmodel = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)\n\n# calculate scores of sentence pairs\nscores = model.predict(sentence_pairs)\n```\n\n### Embedding and Reranker Integrations for RAG Frameworks\n\n#### 1. Used in `langchain`\n\nWe provide `BCERerank` in `BCEmbedding.tools.langchain` that inherits the advanced preproc tokenization of `RerankerModel`.\n\n- Install langchain first\n```bash\npip install langchain==0.1.0\npip install langchain-community==0.0.9\npip install langchain-core==0.1.7\npip install langsmith==0.0.77\n```\n\n- Demo\n```python\n# We provide the advanced preproc tokenization for reranking.\nfrom BCEmbedding.tools.langchain import BCERerank\n\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom langchain_community.document_loaders import PyPDFLoader\nfrom langchain_community.vectorstores import FAISS\n\nfrom langchain.embeddings import HuggingFaceEmbeddings\nfrom langchain_community.vectorstores.utils import DistanceStrategy\nfrom langchain.retrievers import ContextualCompressionRetriever\n\n\n# init embedding model\nembedding_model_name = 'maidalun1020/bce-embedding-base_v1'\nembedding_model_kwargs = {'device': 'cuda:0'}\nembedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True, 'show_progress_bar': False}\n\nembed_model = HuggingFaceEmbeddings(\n  model_name=embedding_model_name,\n  model_kwargs=embedding_model_kwargs,\n  encode_kwargs=embedding_encode_kwargs\n)\n\nreranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'}\nreranker = BCERerank(**reranker_args)\n\n# init documents\ndocuments = PyPDFLoader(\"BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf\").load()\ntext_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)\ntexts = text_splitter.split_documents(documents)\n\n# example 1. retrieval with embedding and reranker\nretriever = FAISS.from_documents(texts, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT).as_retriever(search_type=\"similarity\", search_kwargs={\"score_threshold\": 0.3, \"k\": 10})\n\ncompression_retriever = ContextualCompressionRetriever(\n    base_compressor=reranker, base_retriever=retriever\n)\n\nresponse = compression_retriever.get_relevant_documents(\"What is Llama 2?\")\n```\n\n#### 2. Used in `llama_index`\n\nWe provide `BCERerank` in `BCEmbedding.tools.llama_index` that inherits the advanced preproc tokenization of `RerankerModel`.\n\n- Install llama_index first\n\n```bash\npip install llama-index==0.9.42.post2\n```\n\n- Demo\n```python\n# We provide the advanced preproc tokenization for reranking.\nfrom BCEmbedding.tools.llama_index import BCERerank\n\nimport os\nfrom llama_index.embeddings import HuggingFaceEmbedding\nfrom llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader\nfrom llama_index.node_parser import SimpleNodeParser\nfrom llama_index.llms import OpenAI\nfrom llama_index.retrievers import VectorIndexRetriever\n\n# init embedding model and reranker model\nembed_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 32, 'device': 'cuda:0'}\nembed_model = HuggingFaceEmbedding(**embed_args)\n\nreranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'}\nreranker_model = BCERerank(**reranker_args)\n\n# example #1. extract embeddings\nquery = 'apples'\npassages = [\n        'I like apples', \n        'I like oranges', \n        'Apples and oranges are fruits'\n    ]\nquery_embedding = embed_model.get_query_embedding(query)\npassages_embeddings = embed_model.get_text_embedding_batch(passages)\n\n# example #2. rag example\nllm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL'))\nservice_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)\n\ndocuments = SimpleDirectoryReader(input_files=[\"BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf\"]).load_data()\nnode_parser = SimpleNodeParser.from_defaults(chunk_size=400, chunk_overlap=80)\nnodes = node_parser.get_nodes_from_documents(documents[0:36])\nindex = VectorStoreIndex(nodes, service_context=service_context)\n\nquery = \"What is Llama 2?\"\n\n# example #2.1. retrieval with EmbeddingModel and RerankerModel\nvector_retriever = VectorIndexRetriever(index=index, similarity_top_k=10, service_context=service_context)\nretrieval_by_embedding = vector_retriever.retrieve(query)\nretrieval_by_reranker = reranker_model.postprocess_nodes(retrieval_by_embedding, query_str=query)\n\n# example #2.2. query with EmbeddingModel and RerankerModel\nquery_engine = index.as_query_engine(node_postprocessors=[reranker_model])\nquery_response = query_engine.query(query)\n```\n\n## ⚙️ Evaluation\n\n### Evaluate Semantic Representation by MTEB\n\nWe provide evaluation tools for `embedding` and `reranker` models, based on [MTEB](https://github.com/embeddings-benchmark/mteb) and [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB).\n\nFirst, install `MTEB`:\n\n```\npip install mteb==1.1.1\n```\n\n#### 1. Embedding Models\n\nJust run following cmd to evaluate `your_embedding_model` (e.g. `maidalun1020/bce-embedding-base_v1`) in **bilingual and crosslingual settings** (e.g. `[\"en\", \"zh\", \"en-zh\", \"zh-en\"]`).\n\n```bash\npython BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls\n```\n\nThe total evaluation tasks contain ***114 datasets*** of **\"Retrieval\", \"STS\", \"PairClassification\", \"Classification\", \"Reranking\" and \"Clustering\"**.\n\n***NOTE:***\n\n- **All models are evaluated in their recommended pooling method (`pooler`)**.\n  - `mean` pooler: \"jina-embeddings-v2-base-en\", \"m3e-base\", \"m3e-large\", \"e5-large-v2\", \"multilingual-e5-base\", \"multilingual-e5-large\" and \"gte-large\".\n  - `cls` pooler: Other models.\n- \"jina-embeddings-v2-base-en\" model should be loaded with `trust_remote_code`.\n\n```bash\npython BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {mean_pooler_models} --pooler mean\n\npython BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path jinaai/jina-embeddings-v2-base-en --pooler mean --trust_remote_code\n```\n\n#### 2. Reranker Models\n\nRun following cmd to evaluate `your_reranker_model` (e.g. \"maidalun1020/bce-reranker-base_v1\") in **bilingual and crosslingual settings** (e.g. `[\"en\", \"zh\", \"en-zh\", \"zh-en\"]`).\n\n```bash\npython BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1\n```\n\nThe evaluation tasks contain ***12 datasets*** of **\"Reranking\"**.\n\n#### 3. Metrics Visualization Tool\n\nWe provide a one-click script to summarize evaluation results of `embedding` and `reranker` models as [Embedding Models Evaluation Summary](./Docs/EvaluationSummary/embedding_eval_summary.md) and [Reranker Models Evaluation Summary](./Docs/EvaluationSummary/reranker_eval_summary.md).\n\n```bash\npython BCEmbedding/tools/eval_mteb/summarize_eval_results.py --results_dir {your_embedding_results_dir | your_reranker_results_dir}\n```\n\n### Evaluate RAG by LlamaIndex\n\n[LlamaIndex](https://github.com/run-llama/llama_index) is a famous data framework for LLM-based applications, particularly in RAG. Recently, a [LlamaIndex Blog](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) has evaluated the popular embedding and reranker models in RAG pipeline and attracts great attention. Now, we follow its pipeline to evaluate our `BCEmbedding`.\n\nFirst, install LlamaIndex, and upgrade `transformers` to 4.36.0:\n\n```bash\npip install transformers==4.36.0\n\npip install llama-index==0.9.22\n```\n\nExport your \"openai\" and \"cohere\" app keys, and openai base url (e.g. \"https://api.openai.com/v1\") to env:\n\n```bash\nexport OPENAI_BASE_URL={openai_base_url}  # https://api.openai.com/v1\nexport OPENAI_API_KEY={your_openai_api_key}\nexport COHERE_APPKEY={your_cohere_api_key}\n```\n\n#### 1. Metrics Definition\n\n- Hit Rate:\n\n  Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it's about how often our system gets it right within the top few guesses. ***The larger, the better.***\n- Mean Reciprocal Rank (MRR):\n\n  For each query, MRR evaluates the system's accuracy by looking at the rank of the highest-placed relevant document. Specifically, it's the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it's second, the reciprocal rank is 1/2, and so on. ***The larger, the better.***\n\n#### 2. Reproduce [LlamaIndex Blog](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83)\n\nIn order to compare our `BCEmbedding` with other embedding and reranker models fairly, we provide a one-click script to reproduce results of the LlamaIndex Blog, including our `BCEmbedding`:\n\n```bash\n# There should be two GPUs available at least.\nCUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_reproduce.py\n```\n\nThen, summarize the evaluation results by:\n\n```bash\npython BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir BCEmbedding/results/rag_reproduce_results\n```\n\nResults reproduced from the LlamaIndex Blog can be checked in ***[Reproduced Summary of RAG Evaluation](./Docs/EvaluationSummary/rag_eval_reproduced_summary.md)***, with some obvious ***conclusions***:\n\n- In `WithoutReranker` setting, our `bce-embedding-base_v1` outperforms all the other embedding models.\n- With fixing the embedding model, our `bce-reranker-base_v1` achieves the best performance.\n- ***The combination of `bce-embedding-base_v1` and `bce-reranker-base_v1` is SOTA.***\n\n#### 3. Broad Domain Adaptability\n\nThe evaluation of [LlamaIndex Blog](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) is **monolingual, small amount of data, and specific domain** (just including \"llama2\" paper). In order to evaluate the **broad domain adaptability, bilingual and crosslingual capability**, we follow the blog to build a multiple domains evaluation dataset (includding \"Computer Science\", \"Physics\", \"Biology\", \"Economics\", \"Math\", and \"Quantitative Finance\". [Details](./BCEmbedding/tools/eval_rag/eval_pdfs/)), named [CrosslingualMultiDomainsDataset](https://huggingface.co/datasets/maidalun1020/CrosslingualMultiDomainsDataset):\n\n- To prevent test data leakage, English eval data is selected from the latest English articles in various fields on ArXiv, up to date December 30, 2023. Chinese eval data is selected from high-quality, as recent as possible, Chinese articles in the corresponding fields on Semantic Scholar.\n- Use OpenAI `gpt-4-1106-preview` to produce eval data for high quality.\n\nFirst, run following cmd to evaluate the most popular and powerful embedding and reranker models:\n\n```bash\n# There should be two GPUs available at least.\nCUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_multiple_domains.py\n```\n\nThen, run the following script to summarize the evaluation results:\n\n```bash\npython BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir BCEmbedding/results/rag_results\n```\n\nThe summary of multiple domains evaluations can be seen in \u003ca href=\"#1-multiple-domains-scenarios\" target=\"_Self\"\u003eMultiple Domains Scenarios\u003c/a\u003e.\n\n## 📈 Leaderboard\n\n### Semantic Representation Evaluations in MTEB\n\n#### 1. Embedding Models\n\n| Model                               | Dimensions |  Pooler  | Instructions | Retrieval (47) |    STS (19)    | PairClassification (5) | Classification (21) | Reranking (12) | Clustering (15) | ***AVG*** (119) |\n| :---------------------------------- | :--------: | :------: | :----------: | :-------------: | :-------------: | :--------------------: | :-----------------: | :-------------: | :-------------: | :---------------------: |\n| bge-base-en-v1.5                    |    768    | `cls` |     Need     |      37.14      |      55.06      |         75.45         |        59.73        |      43.00      |      37.74      |          47.19          |\n| bge-base-zh-v1.5                    |    768    | `cls` |     Need     |      47.63      |      63.72      |         77.40         |        63.38        |      54.95      |      32.56      |          53.62          |\n| bge-large-en-v1.5                   |    1024    | `cls` |     Need     |      37.18      |      54.09      |         75.00         |        59.24        |      42.47      |      37.32      |          46.80          |\n| bge-large-zh-v1.5                   |    1024    | `cls` |     Need     |      47.58      |      64.73      |    **79.14**    |        64.19        |      55.98      |      33.26      |          54.23          |\n| gte-large                           |    1024    | `mean` |     Free     |      36.68      |      55.22      |         74.29         |        57.73        |      42.44      |      38.51      |          46.67          |\n| gte-large-zh                        |    1024    | `cls` |     Free     |      41.15      |      64.62      |         77.58         |        62.04        |      55.62      |      33.03      |          51.51          |\n| jina-embeddings-v2-base-en          |    768    | `mean` |     Free     |      31.58      |      54.28      |         74.84         |        58.42        |      41.16      |      34.67      |          44.29          |\n| m3e-base                            |    768    | `mean` |     Free     |      46.29      |      63.93      |         71.84         |        64.08        |      52.38      |      37.84      |          53.54          |\n| m3e-large                           |    1024    | `mean` |     Free     |      34.85      |      59.74      |         67.69         |        60.07        |      48.99      |      31.62      |          46.78          |\n| e5-large-v2                         |    1024    | `mean` |     Need     |      35.98      |      55.23      |         75.28         |        59.53        |      42.12      |      36.51      |          46.52          |\n| multilingual-e5-base                |    768    | `mean` |     Need     |      54.73      |      65.49      |         76.97         |        69.72        |      55.01      |      38.44      |          58.34          |\n| multilingual-e5-large               |    1024    | `mean` |     Need     |      56.76      | **66.79** |         78.80         |   **71.61**   |      56.49      | **43.09** |     **60.50**     |\n| ***bce-embedding-base_v1*** |    768    | `cls` |     Free     | **57.60** |      65.73      |         74.96         |        69.00        | **57.29** |      38.95      |          59.43          |\n\n***NOTE:***\n\n- Our ***bce-embedding-base_v1*** outperforms other open-source embedding models with comparable model sizes.\n- ***114 datasets including 119 eval results*** (some dataset contains multiple languages) of \"Retrieval\", \"STS\", \"PairClassification\", \"Classification\", \"Reranking\" and \"Clustering\" in ***`[\"en\", \"zh\", \"en-zh\", \"zh-en\"]` setting***, including **MTEB and CMTEB**.\n- The [crosslingual evaluation datasets](./BCEmbedding/evaluation/c_mteb/Retrieval.py) we released belong to `Retrieval` task.\n- More evaluation details should be checked in [Embedding Models Evaluations](./Docs/EvaluationSummary/embedding_eval_summary.md).\n\n#### 2. Reranker Models\n\n| Model                              | Reranking (12) | ***AVG*** (12) |\n| :--------------------------------- | :-------------: | :--------------------: |\n| bge-reranker-base                  |      59.04      |         59.04         |\n| bge-reranker-large                 |      60.86      |         60.86         |\n| ***bce-reranker-base_v1*** | **61.29** |  ***61.29***  |\n\n***NOTE:***\n\n- Our ***bce-reranker-base_v1*** outperforms other open-source reranker models.\n- ***12 datasets*** of \"Reranking\" in ***`[\"en\", \"zh\", \"en-zh\", \"zh-en\"]` setting***.\n- More evaluation details should be checked in [Reranker Models Evaluations](./Docs/EvaluationSummary/reranker_eval_summary.md).\n\n### RAG Evaluations in LlamaIndex\n\n#### 1. Multiple Domains Scenarios\n\n\u003cimg src=\"./Docs/assets/rag_eval_multiple_domains_summary.jpg\"\u003e\n\n***NOTE:***\n\n- Data Quality: \n  - To prevent test data leakage, English eval data is selected from the latest English articles in various fields on ArXiv, up to date December 30, 2023. Chinese eval data is selected from high-quality, as recent as possible, Chinese articles in the corresponding fields on Semantic Scholar. \n  - Use OpenAI `gpt-4-1106-preview` to produce eval data for high quality.\n- Evaluated in ***`[\"en\", \"zh\", \"en-zh\", \"zh-en\"]` setting***. If you are interested in monolingual setting, please check in [Chinese RAG evaluations with [\"zh\"] setting](./Docs/EvaluationSummary/rag_eval_multiple_domains_summary_zh.md), and [English RAG evaluations with [\"en\"] setting](./Docs/EvaluationSummary/rag_eval_multiple_domains_summary_en.md).\n- Consistent with our ***[Reproduced Results](./Docs/EvaluationSummary/rag_eval_reproduced_summary.md)*** of [LlamaIndex Blog](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83).\n- In `WithoutReranker` setting, our `bce-embedding-base_v1` outperforms all the other embedding models.\n- With fixing the embedding model, our `bce-reranker-base_v1` achieves the best performance.\n- **The combination of `bce-embedding-base_v1` and `bce-reranker-base_v1` is SOTA**.\n\n## 🛠 Youdao's BCEmbedding API\n\nFor users who prefer a hassle-free experience without the need to download and configure the model on their own systems, `BCEmbedding` is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at [Youdao BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html). Here, you'll find all the necessary guidance to easily implement `BCEmbedding` across a variety of use cases, ensuring a smooth and effective integration for optimal results.\n\n## 🧲 WeChat Group\n\nWelcome to scan the QR code below and join the WeChat group.\n\n\u003cimg src=\"./Docs/assets/Wechat.jpg\" width=\"20%\" height=\"auto\"\u003e\n\n## ✏️ Citation\n\nIf you use `BCEmbedding` in your research or project, please feel free to cite and star it:\n\n```\n@misc{youdao_bcembedding_2023,\n    title={BCEmbedding: Bilingual and Crosslingual Embedding for RAG},\n    author={NetEase Youdao},\n    year={2023},\n    howpublished={\\url{https://github.com/netease-youdao/BCEmbedding}}\n}\n```\n\n## 🔐 License\n\n`BCEmbedding` is licensed under [Apache 2.0 License](./LICENSE)\n\n## 🔗 Related Links\n\n[Netease Youdao - QAnything](https://github.com/netease-youdao/qanything)\n\n[FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)\n\n[MTEB](https://github.com/embeddings-benchmark/mteb)\n\n[C_MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)\n\n[LLama Index](https://github.com/run-llama/llama_index) | [LlamaIndex Blog](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83)\n\n[HuixiangDou](https://github.com/internlm/huixiangdou)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetease-youdao%2FBCEmbedding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnetease-youdao%2FBCEmbedding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetease-youdao%2FBCEmbedding/lists"}