{"id":15066216,"url":"https://github.com/shibing624/similarities","last_synced_at":"2025-05-14T14:08:19.272Z","repository":{"id":40358144,"uuid":"462683173","full_name":"shibing624/similarities","owner":"shibing624","description":"Similarities: a toolkit for similarity calculation and semantic search. 相似度计算、匹配搜索工具包，支持亿级数据文搜文、文搜图、图搜图，python3开发，开箱即用。","archived":false,"fork":false,"pushed_at":"2024-10-29T14:09:17.000Z","size":10081,"stargazers_count":842,"open_issues_count":8,"forks_count":82,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-04-13T01:55:24.388Z","etag":null,"topics":["bm25","deep-learning","faiss","image-search","image-similarity","matching","nlp","pytorch","search-engine","similarity","similarity-search","text-matching"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/similarities/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shibing624.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-02-23T10:17:53.000Z","updated_at":"2025-04-10T02:55:43.000Z","dependencies_parsed_at":"2023-02-17T03:16:09.733Z","dependency_job_id":"159725bb-dcb0-4885-acaf-7fb42f132675","html_url":"https://github.com/shibing624/similarities","commit_stats":{"total_commits":165,"total_committers":5,"mean_commits":33.0,"dds":"0.024242424242424288","last_synced_commit":"6e6e39c258ba36a62efc04dedb66884eef44d841"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibing624%2Fsimilarities","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibing624%2Fsimilarities/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibing624%2Fsimilarities/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibing624%2Fsimilarities/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shibing624","download_url":"https://codeload.github.com/shibing624/similarities/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254159730,"owners_count":22024564,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bm25","deep-learning","faiss","image-search","image-similarity","matching","nlp","pytorch","search-engine","similarity","similarity-search","text-matching"],"created_at":"2024-09-25T01:03:48.106Z","updated_at":"2025-05-14T14:08:19.254Z","avatar_url":"https://github.com/shibing624.png","language":"Python","funding_links":[],"categories":["NLP"],"sub_categories":["3. Pretraining"],"readme":"[**🇨🇳中文**](https://github.com/shibing624/similarities/blob/main/README.md) | [**🌐English**](https://github.com/shibing624/similarities/blob/main/README_EN.md) | [**📖文档/Docs**](https://github.com/shibing624/similarities/wiki) | [**🤖模型/Models**](https://huggingface.co/shibing624) \n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://github.com/shibing624/similarities\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/shibing624/similarities/main/docs/logo.png\" height=\"150\" alt=\"Logo\"\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\n-----------------\n\n# Similarities: Similarity Calculation and Semantic Search\n[![PyPI version](https://badge.fury.io/py/similarities.svg)](https://badge.fury.io/py/similarities)\n[![Downloads](https://static.pepy.tech/badge/similarities)](https://pepy.tech/project/similarities)\n[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![python_version](https://img.shields.io/badge/Python-3.5%2B-green.svg)](requirements.txt)\n[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)\n[![Wechat Group](https://img.shields.io/badge/wechat-group-green.svg?logo=wechat)](#Contact)\n\n\n**similarities**: a toolkit for similarity calculation and semantic search, supports text and image. 相似度计算、语义匹配搜索工具包。\n\n**similarities** 实现了多种文本和图片的相似度计算、语义匹配检索算法，支持亿级数据文搜文、文搜图、图搜图，python3开发，pip安装，开箱即用。\n\n**Guide**\n\n- [Features](#Features)\n- [Install](#install)\n- [Usage](#usage)\n- [Contact](#Contact)\n- [Acknowledgements](#Acknowledgements)\n\n## Features\n\n### 文本相似度计算 + 文本搜索\n\n- 语义匹配模型【推荐】：本项目基于text2vec实现了CoSENT模型的文本相似度计算和文本搜索\n  - 支持中英文、多语言多种SentenceBERT类预训练模型\n  - 支持 Cos Similarity/Dot Product/Hamming Distance/Euclidean Distance 等多种相似度计算方法\n  - 支持 SemanticSearch/Faiss/Annoy/Hnsw 等多种文本搜索算法\n  - 支持亿级数据高效检索\n  - 支持命令行文本转向量（多卡）、建索引、批量检索、启动服务\n- 字面匹配模型：本项目实现了Word2Vec、BM25、RankBM25、TFIDF、SimHash、同义词词林、知网Hownet义原匹配等多种字面匹配模型\n\n\n### 图像相似度计算/图文相似度计算 + 图搜图/文搜图\n- CLIP(Contrastive Language-Image Pre-Training)模型：图文匹配模型，可用于图文特征（embeddings）、相似度计算、图文检索、零样本图片分类，本项目基于PyTorch实现了CLIP模型的向量表征、构建索引（基于AutoFaiss）、批量检索、后台服务（基于FastAPI）、前端展现（基于Gradio）功能\n  - 支持[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)等CLIP系列模型\n  - 支持[OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14)等Chinese-CLIP系列模型\n  - 支持前后端分离部署，FastAPI后端服务，Gradio前端展现\n  - 支持亿级数据高效检索，基于Faiss检索，支持GPU加速\n  - 支持图搜图、文搜图、向量搜图\n  - 支持图像embedding提取、文本embedding提取\n  - 支持图像相似度计算、图文相似度计算\n  - 支持命令行图像转向量（多卡）、建索引、批量检索、启动服务\n- 图像特征提取：本项目基于cv2实现了pHash、dHash、wHash、aHash、SIFT等多种图像特征提取算法\n\n## Demo\nImage Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search\n\n![](https://github.com/shibing624/similarities/blob/main/docs/white_cat.png)\n\nText Search Demo: https://huggingface.co/spaces/shibing624/similarities\n\n![](https://github.com/shibing624/similarities/blob/main/docs/hf_search.png)\n\n\n## Install\n\n```\npip install torch # conda install pytorch\npip install -U similarities\n```\n\nor\n\n```\ngit clone https://github.com/shibing624/similarities.git\ncd similarities\npip install -e .\n```\n\n## Usage\n\n### 1. 文本向量相似度计算\n\nexample: [examples/text_similarity_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_similarity_demo.py)\n\n\n```python\nfrom similarities import BertSimilarity\nm = BertSimilarity(model_name_or_path=\"shibing624/text2vec-base-chinese\")\nr = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡')\nprint(f\"similarity score: {float(r)}\")  # similarity score: 0.855146050453186\n```\n\n- `model_name_or_path`：模型名称或者路径，默认会从HF model hub下载并使用中文语义匹配模型[shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese)，如果需要多语言，可以替换为[shibing624/text2vec-base-multilingual](https://huggingface.co/shibing624/text2vec-base-multilingual)模型，支持中、英、韩、日、德、意等多国语言\n\n### 2. 文本向量搜索\n\n在文档候选集中找与query最相似的文本，常用于QA场景的问句相似匹配、文本搜索等任务。\n\n#### SemanticSearch精准搜索算法，Cos Similarity + topK 聚类检索，适合百万内数据集\n\nexample: [examples/text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_semantic_search_demo.py)\n\n#### Annoy、Hnswlib等近似搜索算法，适合百万级数据集\n\nexample: [examples/fast_text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/fast_text_semantic_search_demo.py)\n\n#### Faiss高效向量检索，适合亿级数据集\n\n- 文本转向量，建索引，批量检索，启动服务：[examples/faiss_bert_search_server_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_bert_search_server_demo.py)\n\n- 前端python调用：[examples/faiss_bert_search_client_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_bert_search_client_demo.py)\n\n\n### 3. 基于字面的文本相似度计算和文本搜索\n\n支持同义词词林（Cilin）、知网Hownet、词向量（WordEmbedding）、Tfidf、SimHash、BM25等算法的相似度计算和字面匹配搜索，常用于文本匹配冷启动。\n\nexample: [examples/literal_text_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/literal_text_semantic_search_demo.py)\n\n### 4. 图像相似度计算和图片搜索\n\n支持CLIP、pHash、SIFT等算法的图像相似度计算和匹配搜索，中文CLIP模型支持图搜图，文搜图、还支持中英文图文互搜。\n\nexample: [examples/image_semantic_search_demo.py](https://github.com/shibing624/similarities/blob/main/examples/image_semantic_search_demo.py)\n\n![image_sim](https://github.com/shibing624/similarities/blob/main/docs/image_sim.png)\n\n\n#### Faiss高效向量检索，适合亿级数据集\n\n- 图像转向量，建索引，批量检索，启动服务：[examples/faiss_clip_search_server_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_server_demo.py)\n\n- 前端python调用：[examples/faiss_clip_search_client_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_client_demo.py)\n\n- 前端gradio调用：[examples/faiss_clip_search_gradio_demo.py](https://github.com/shibing624/similarities/blob/main/examples/faiss_clip_search_gradio_demo.py)\n\n\u003cimg src=\"https://github.com/shibing624/similarities/blob/main/docs/dog-img.png\"/\u003e\n\n### 5. 聚类\n\n通过社群发现（community_detection）算法可以在大规模数据集上执行聚类，寻找聚类簇（即相似的句子组）。\n\nexample: [examples/text_clustering_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_clustering_demo.py)\n\n\n### 6. 图文语义去重\n\n通过同义句挖掘（paraphrase_mining_embeddings）算法可以从大量句子或文档集中挖掘出具有相似意义的句子对，可用于冗余图文检测，语义去重。\n\n- 文本语义去重：[examples/text_duplicates_demo.py](https://github.com/shibing624/similarities/blob/main/examples/text_duplicates_demo.py)\n- 图片语义去重：[examples/image_duplicates_demo.py](https://github.com/shibing624/similarities/blob/main/examples/image_duplicates_demo.py)\n\n### 命令行模式（CLI）\n\n- 支持批量获取文本向量、图像向量（embedding）\n- 支持构建索引（index）\n- 支持批量检索（filter）\n- 支持启动服务（server）\n\ncode: [cli.py](https://github.com/shibing624/similarities/blob/main/similarities/cli.py)\n\n```\n\u003e similarities -h                                    \n\nNAME\n    similarities\n\nSYNOPSIS\n    similarities COMMAND\n\nCOMMANDS\n    COMMAND is one of the following:\n\n     bert_embedding\n       Compute embeddings for a list of sentences\n\n     bert_index\n       Build indexes from text embeddings using autofaiss\n\n     bert_filter\n       Entry point of bert filter, batch search index\n\n     bert_server\n       Main entry point of bert search backend, start the server\n\n     clip_embedding\n       Embedding text and image with clip model\n\n     clip_index\n       Build indexes from embeddings using autofaiss\n\n     clip_filter\n       Entry point of clip filter, batch search index\n\n     clip_server\n       Main entry point of clip search backend, start the server\n```\n\nrun：\n\n```shell\npip install similarities -U\nsimilarities clip_embedding -h\n\n# example\ncd examples\nsimilarities clip_embedding data/toy_clip/\n```\n\n- `bert_embedding`等是二级命令，bert开头的是文本相关，clip开头的是图像相关\n- 各二级命令使用方法见`similarities clip_embedding -h`\n- 上面示例中`data/toy_clip/`是`clip_embedding`方法的`input_dir`参数，输入文件目录（required）\n\n\n\n## Contact\n\n- Issue(建议)\n  ：[![GitHub issues](https://img.shields.io/github/issues/shibing624/similarities.svg)](https://github.com/shibing624/similarities/issues)\n- 邮件我：xuming: xuming624@qq.com\n- 微信我： 加我*微信号：xuming624, 备注：姓名-公司-NLP* 进NLP交流群。\n\n\u003cimg src=\"https://github.com/shibing624/similarities/blob/main/docs/wechat.jpeg\" width=\"200\" /\u003e\n\n## Citation\n\n如果你在研究中使用了similarities，请按如下格式引用：\n\nAPA:\n\n```\nXu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities\n```\n\nBibTeX:\n\n```\n@misc{Xu_Similarities_Compute_similarity,\n  title={Similarities: similarity calculation and semantic search toolkit},\n  author={Xu Ming},\n  year={2022},\n  howpublished={\\url{https://github.com/shibing624/similarities}},\n}\n```\n\n## License\n\n授权协议为 [The Apache License 2.0](/LICENSE)，可免费用做商业用途。请在产品说明中附加similarities的链接和授权协议。\n\n## Contribute\n\n项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：\n\n- 在`tests`添加相应的单元测试\n- 使用`python -m pytest`来运行所有单元测试，确保所有单测都是通过的\n\n之后即可提交PR。\n\n## Acknowledgements \n\n- [A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]](https://openreview.net/forum?id=SyK00v5xx)\n- [https://github.com/liuhuanyong/SentenceSimilarity](https://github.com/liuhuanyong/SentenceSimilarity)\n- [https://github.com/qwertyforce/image_search](https://github.com/qwertyforce/image_search)\n- [ImageHash - Official Github repository](https://github.com/JohannesBuchner/imagehash)\n- [https://github.com/openai/CLIP](https://github.com/openai/CLIP)\n- [https://github.com/OFA-Sys/Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)\n- [https://github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n- [https://github.com/rom1504/clip-retrieval](https://github.com/rom1504/clip-retrieval)\n\nThanks for their great work!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshibing624%2Fsimilarities","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshibing624%2Fsimilarities","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshibing624%2Fsimilarities/lists"}