{"id":13754245,"url":"https://github.com/wangyuxinwhy/uniem","last_synced_at":"2025-12-30T14:04:39.407Z","repository":{"id":163321650,"uuid":"638158791","full_name":"wangyuxinwhy/uniem","owner":"wangyuxinwhy","description":"unified embedding model","archived":false,"fork":false,"pushed_at":"2023-09-01T12:15:06.000Z","size":13272,"stargazers_count":853,"open_issues_count":47,"forks_count":70,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-28T16:18:14.452Z","etag":null,"topics":["embeddings","huggingface","nlp","sentence-embeddings","sentence-transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wangyuxinwhy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-09T07:53:48.000Z","updated_at":"2025-04-26T07:39:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"161b1b2b-88ab-4eec-84ed-8ff01b81477b","html_url":"https://github.com/wangyuxinwhy/uniem","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangyuxinwhy%2Funiem","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangyuxinwhy%2Funiem/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangyuxinwhy%2Funiem/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangyuxinwhy%2Funiem/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wangyuxinwhy","download_url":"https://codeload.github.com/wangyuxinwhy/uniem/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335715,"owners_count":21892715,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","huggingface","nlp","sentence-embeddings","sentence-transformers"],"created_at":"2024-08-03T09:01:51.790Z","updated_at":"2025-12-30T14:04:39.379Z","avatar_url":"https://github.com/wangyuxinwhy.png","language":"Python","funding_links":[],"categories":["文本匹配 文本检索 文本相似度","Tools \u0026 Evaluation"],"sub_categories":["其他_文本生成、文本对话","Fine-tuning"],"readme":"# uniem\n[![Release](https://img.shields.io/pypi/v/uniem)](https://pypi.org/project/uniem/)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/uniem)\n[![ci](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/ci.yml)\n[![cd](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml/badge.svg)](https://github.com/wangyuxinwhy/uniem/actions/workflows/cd.yml)\n\nuniem 项目的目标是创建中文最好的通用文本嵌入模型。\n\n本项目主要包括模型的训练，微调和评测代码，模型与数据集会在 [HuggingFace](https://huggingface.co/) 社区上进行开源。\n\n## 🌟 重要更新\n\n- ➿ **2023.07.11** , 发布 uniem 0.3.0， `FineTuner` 除 M3E 外，还支持 `sentence_transformers`, `text2vec` 等模型的微调，同时还支持 [SGPT](https://github.com/Muennighoff/sgpt) 的方式对 GPT 系列模型进行训练，以及 Prefix Tuning。 **FineTuner 初始化的 API 有小小的变化，无法兼容 0.2.0**\n- ➿ **2023.06.17** , 发布 uniem 0.2.1 ， 实现了 `FineTuner` 以原生支持模型微调，**几行代码，即刻适配**！\n- 📊 **2023.06.17** , 发布 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) 正式版 ， 支持 6 大类 Embedding 模型 ，支持 4 大类任务 ，共 9 种数据集的自动化评测\n- 🎉 **2023.06.08** , 发布 [M3E models](https://huggingface.co/moka-ai/m3e-base) ，在中文文本分类和文本检索上均优于 `openai text-embedding-ada-002`，详请请参考 [M3E models README](https://huggingface.co/moka-ai/m3e-base/blob/main/README.md)。\n\n## 🔧 使用 M3E\n\nM3E 系列模型完全兼容 [sentence-transformers](https://www.sbert.net/) ，你可以通过 **替换模型名称** 的方式在所有支持 sentence-transformers 的项目中无缝使用 M3E Models，比如 [chroma](https://docs.trychroma.com/getting-started), [guidance](https://github.com/microsoft/guidance), [semantic-kernel](https://github.com/microsoft/semantic-kernel) 。\n\n安装\n\n```bash\npip install sentence-transformers\n```\n\n使用 \n\n```python\nfrom sentence_transformers import SentenceTransformer\n\nmodel = SentenceTransformer(\"moka-ai/m3e-base\")\nembeddings = model.encode(['Hello World!', '你好,世界!'])\n```\n\n## 🎨 微调模型\n\n`uniem` 提供了非常易用的 finetune 接口，几行代码，即刻适配！\n\n```python\nfrom datasets import load_dataset\n\nfrom uniem.finetuner import FineTuner\n\ndataset = load_dataset('shibing624/nli_zh', 'STS-B')\n# 指定训练的模型为 m3e-small\nfinetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)\nfinetuner.run(epochs=3)\n```\n\n微调模型详见 [uniem 微调教程](https://github.com/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb) or \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb\"\u003e\n  \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\n\u003c/a\u003e\n\n\n如果您想要在本地运行，您需要运行如下命令，准备环境\n\n```bash\nconda create -n uniem python=3.10\npip install uniem\n```\n\n## 💯 MTEB-zh\n\n中文 Embedding 模型缺少统一的评测标准，所以我们参考了 [MTEB](https://huggingface.co/spaces/mteb/leaderboard) ，构建了中文评测标准 MTEB-zh，目前已经对 6 种模型在各种数据集上进行了横评，详细的评测方式和代码请参考 [MTEB-zh](https://github.com/wangyuxinwhy/uniem/tree/main/mteb-zh) 。\n\n\n### 文本分类\n\n- 数据集选择，选择开源在 HuggingFace 上的 6 种文本分类数据集，包括新闻、电商评论、股票评论、长文本等\n- 评测方式，使用 MTEB 的方式进行评测，报告 Accuracy。\n\n|                   | text2vec | m3e-small | m3e-base | m3e-large-0619 | openai | DMetaSoul   | uer     | erlangshen  |\n| ----------------- | -------- | --------- | -------- | ------ | ----------- | ------- | ----------- | ----------- |\n| TNews             | 0.43     | 0.4443    | 0.4827   | **0.4866** | 0.4594 | 0.3084      | 0.3539  | 0.4361      |\n| JDIphone          | 0.8214   | 0.8293    | 0.8533   | **0.8692** | 0.746  | 0.7972      | 0.8283  | 0.8356      |\n| GubaEastmony      | 0.7472   | 0.712     | 0.7621   | 0.7663 | 0.7574 | 0.735       | 0.7534  | **0.7787**      |\n| TYQSentiment      | 0.6099   | 0.6596    | 0.7188   | **0.7247** | 0.68   | 0.6437      | 0.6662  | 0.6444      |\n| StockComSentiment | 0.4307   | 0.4291    | 0.4363   | 0.4475 | **0.4819** | 0.4309      | 0.4555  | 0.4482      |\n| IFlyTek           | 0.414    | 0.4263    | 0.4409   | 0.4445 | **0.4486** | 0.3969      | 0.3762  | 0.4241      |\n| Average           | 0.5755   | 0.5834    | 0.6157   | **0.6231** | 0.5956 | 0.552016667 | 0.57225 | 0.594516667 |\n\n### 检索排序\n\n- 数据集选择，使用 [T2Ranking](https://github.com/THUIR/T2Ranking/tree/main) 数据集，由于 T2Ranking 的数据集太大，openai 评测起来的时间成本和 api 费用有些高，所以我们只选择了 T2Ranking 中的前 10000 篇文章\n- 评测方式，使用 MTEB 的方式进行评测，报告 map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10\n\n|         | text2vec | openai-ada-002 | m3e-small | m3e-base | m3e-large-0619 | DMetaSoul | uer     | erlangshen |\n| ------- | -------- | -------------- | --------- | -------- | --------- | ------- | ---------- | ---------- |\n| map@1   | 0.4684   | 0.6133         | 0.5574    | **0.626**    | 0.6256 | 0.25203   | 0.08647 | 0.25394    |\n| map@10  | 0.5877   | 0.7423         | 0.6878    | **0.7656**   | 0.7627 | 0.33312   | 0.13008 | 0.34714    |\n| mrr@1   | 0.5345   | 0.6931         | 0.6324    | 0.7047   | **0.7063** | 0.29258   | 0.10067 | 0.29447    |\n| mrr@10  | 0.6217   | 0.7668         | 0.712     | **0.7841**   | 0.7827 | 0.36287   | 0.14516 | 0.3751     |\n| ndcg@1  | 0.5207   | 0.6764         | 0.6159    | 0.6881   | **0.6884** | 0.28358   | 0.09748 | 0.28578    |\n| ndcg@10 | 0.6346   | 0.7786         | 0.7262    | **0.8004**   | 0.7974 | 0.37468   | 0.15783 | 0.39329    |\n\n## 🤝 Contributing\n\n如果您想要在 MTEB-zh 中添加评测数据集或者模型，欢迎提 issue 或者 PR，我会在第一时间进行支持，期待您的贡献！\n\n## 📜 License\n\nuniem is licensed under the Apache-2.0 License. See the LICENSE file for more details.\n\n## 🏷 Citation\n\nPlease cite this model using the following format:\n\n@software {Moka Massive Mixed Embedding,\nauthor = {Wang Yuxin,Sun Qingxuan,He sicheng},\ntitle = {M3E: Moka Massive Mixed Embedding Model},\nyear = {2023} }\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwangyuxinwhy%2Funiem","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwangyuxinwhy%2Funiem","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwangyuxinwhy%2Funiem/lists"}