{"id":18918135,"url":"https://github.com/dengbocong/text-similarity","last_synced_at":"2026-03-09T10:02:41.410Z","repository":{"id":46703435,"uuid":"356466603","full_name":"DengBoCong/text-similarity","owner":"DengBoCong","description":"文本相似度（匹配）计算，提供Baseline、训练、推理、指标分析...代码包含TensorFlow/Pytorch双版本","archived":false,"fork":false,"pushed_at":"2022-05-01T08:34:35.000Z","size":60430,"stargazers_count":179,"open_issues_count":6,"forks_count":32,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-22T21:47:16.892Z","etag":null,"topics":["bert","deep-learning","mechine-learing","model","nlp","pytorch","similarity","text-classification","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DengBoCong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-10T04:12:15.000Z","updated_at":"2025-08-15T01:04:18.000Z","dependencies_parsed_at":"2022-09-13T13:50:54.334Z","dependency_job_id":null,"html_url":"https://github.com/DengBoCong/text-similarity","commit_stats":null,"previous_names":["dengbocong/sentence2vec","dengbocong/sim"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/DengBoCong/text-similarity","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DengBoCong%2Ftext-similarity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DengBoCong%2Ftext-similarity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DengBoCong%2Ftext-similarity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DengBoCong%2Ftext-similarity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DengBoCong","download_url":"https://codeload.github.com/DengBoCong/text-similarity/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DengBoCong%2Ftext-similarity/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30290932,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T02:57:19.223Z","status":"ssl_error","status_checked_at":"2026-03-09T02:56:26.373Z","response_time":61,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","deep-learning","mechine-learing","model","nlp","pytorch","similarity","text-classification","transformer"],"created_at":"2024-11-08T10:29:44.325Z","updated_at":"2026-03-09T10:02:41.374Z","avatar_url":"https://github.com/DengBoCong.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eText-Similarity\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n[![Blog](https://img.shields.io/badge/blog-@DengBoCong-blue.svg?style=social)](https://www.zhihu.com/people/dengbocong)\n[![Paper Support](https://img.shields.io/badge/paper-repo-blue.svg?style=social)](https://github.com/DengBoCong/nlp-paper)\n![Stars Thanks](https://img.shields.io/badge/Stars-thanks-brightgreen.svg?style=social\u0026logo=trustpilot)\n![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=social\u0026logo=appveyor)\n\n[comment]: \u003c\u003e ([![PRs Welcome]\u0026#40;https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square\u0026#41;]\u0026#40;\u0026#41;)\n\n\u003c/div\u003e\n\n# Overview\n+ **Dataset**: 中文/English 语料, ☞ [点这里](https://github.com/DengBoCong/text-similarity/tree/main/corpus)\n+ **Paper**: 相关论文详解, ☞ [点这里](https://github.com/DengBoCong/nlp-paper)\n+ **The implemented method is as follows:**：\n + TF-IDF\n + BM25\n + LSH\n + SIF/uSIF\n + FastText\n + RNN Base (Siamese RNN, Stack RNN)\n + CNN Base (Fast Text, Text CNN, Char CNN, VDCNN)\n + Bert Base\n + Albert\n + NEZHA\n + RoBERTa\n + SimCSE\n + Poly-Encoder\n + ColBERT\n + RE2（Simple-Effective-Text-Matching）\n\n# Usages\n可以选择通过pip进行安装并使用（如下），或者直接下载源码到本地，集成到项目中：\n```\npip3 install text-sim\n```\n\n```\n1：examples目录下有不同模型对应的 preprocess/train/evalute代码，可自行修改\n2：如下示例从examples中引入actuator方法，准备好对应的模型配置文件即可执行\n3：examples目录下的inference.py为训练好的模型推理代码\n4：主体代码放在sim下，TensorFlow和Pytorch两个版本分开存放，引用方式基本保持一致\n5：相关工具包括word2vec、tokenizer、data_format统一放在sim的tools下\n```\n\n### TF-IDF\n\n```python\n# Example\n# Sklearn version\nfrom examples.run_tfidf_sklearn import actuator\nactuator(\"./corpus/chinese/breeno/train.tsv\", query1=\"12 23 4160 276\", query2=\"29 23 169 1495\")\n\n# Custom version\nfrom examples.run_tfidf import actuator\nactuator(\"./corpus/chinese/breeno/train.tsv\", query1=\"12 23 4160 276\", query2=\"29 23 169 1495\")\n\n# 工具调用\nfrom sim.tf_idf import TFIdf\n\ntokens_list = [\"这是一个什么样的工具\", \"...\"]\nquery = [\"非常好用的工具\"]\n\ntf_idf = TFIdf(tokens_list, split=\" \")\nprint(tf_idf.get_score(query, 0)) # score\nprint(tf_idf.get_score_list(query, 10)) # [(index, score), ...]\nprint(tf_idf.weight()) # list or numpy array\n```\n\n### BM25\n\n```python\n# Example\nfrom examples.run_bm25 import actuator\nactuator(\"./corpus/chinese/breeno/train.tsv\", query1=\"12 23 4160 276\", query2=\"29 23 169 1495\")\n\n# 工具调用\nfrom sim.bm25 import BM25\n\ntokens_list = [\"这是一个什么样的工具\", \"...\"]\nquery = [\"非常好用的工具\"]\n\nbm25 = BM25(tokens_list, split=\" \")\nprint(bm25.get_score(query, 0)) # score\nprint(bm25.get_score_list(query, 10)) # [(index, score), ...]\nprint(bm25.weight()) # list or numpy array\n```\n\n### LSH\n\n```python\nfrom sim.lsh import E2LSH\nfrom sim.lsh import MinHash\n\ne2lsh = E2LSH()\nmin_hash = MinHash()\n\ncandidates = [[3.6216, 8.6661, -2.8073, -0.44699, 0], ...]\nquery = [-2.7769, -5.6967, 5.9179, 0.37671, 1]\nprint(e2lsh.search(candidates, query)) # index in candidates\nprint(min_hash.search(candidates, query)) # index in candidates\n```\n\n### SIF\n+ [A Simple But Tough-To-Beat Baseline For Sentence Embeddings](https://openreview.net/pdf?id=SyK00v5xx)\n+ [Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline](https://aclanthology.org/W18-3012.pdf)\n```python\nsentences = [[\"token1\", \"token2\", \"...\"], ...]\nvector = [[[1, 1, 1], [2, 2, 2], [...]], ...]\nfrom sim.sif_usif import SIF\nfrom sim.sif_usif import uSIF\n\nsif = SIF(n_components=5, component_type=\"svd\")\nsif.fit(tokens_list=sentences, vector_list=vector)\n\nusif = uSIF(n_components=5, n=1, component_type=\"svd\")\nusif.fit(tokens_list=sentences, vector_list=vector)\n```\n\n### FastText\n+ [Bag of Tricks for Efficient Text Classification](https://arxiv.org/pdf/1607.01759.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_fast_text import actuator\nactuator(execute_type=\"train\", model_type=\"bert\", model_dir=\"./data/chinese_wwm_L-12_H-768_A-12\")\n\n# Pytorch version\nfrom examples.pytorch.run_fast_text import actuator\nactuator(execute_type=\"train\", model_type=\"bert\", model_dir=\"./data/chinese_wwm_pytorch\")\n```\n\n### RNN Base\n+ [Siamese Recurrent Architectures for Learning Sentence Similarity](https://scholar.google.com/scholar_url?url=https://ojs.aaai.org/index.php/AAAI/article/view/10350/10209\u0026hl=zh-CN\u0026sa=T\u0026oi=gsb-gga\u0026ct=res\u0026cd=0\u0026d=7393466935379636447\u0026ei=KQWzYNL5OYz4yATXqJ6YCg\u0026scisig=AAGBfm0zNEZZez8zh5ZB_iG7UTrwXmhJWg)\n+ [Learning Text Similarity with Siamese Recurrent Networks](https://aclanthology.org/W16-1617.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_siamese_rnn import actuator\nactuator(\"./data/config/siamse_rnn.json\", execute_type=\"train\")\n\n# Pytorch version\nfrom examples.pytorch.run_siamese_rnn import actuator\nactuator(\"./data/config/siamse_rnn.json\", execute_type=\"train\")\n```\n\n### CNN Base\n+ [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/pdf/1408.5882.pdf)\n+ [Character-Aware Neural Language Models](https://arxiv.org/pdf/1508.06615.pdf)\n+ [Highway Networks](https://arxiv.org/pdf/1505.00387.pdf)\n+ [Very Deep Convolutional Networks for Text Classification](https://arxiv.org/pdf/1606.01781.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_cnn_base import actuator\nactuator(execute_type=\"train\", model_type=\"bert\", model_dir=\"./data/chinese_wwm_L-12_H-768_A-12\")\n\n# Pytorch version\nfrom examples.pytorch.run_cnn_base import actuator\nactuator(execute_type=\"train\", model_type=\"bert\", model_dir=\"./data/chinese_wwm_pytorch\")\n```\n\n### Bert Base\n+ [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_basic_bert import actuator\nactuator(model_dir=\"./data/chinese_wwm_L-12_H-768_A-12\", execute_type=\"train\")\n\n# Pytorch version\nfrom examples.pytorch.run_basic_bert import actuator\nactuator(model_dir=\"./data/chinese_wwm_pytorch\", execute_type=\"train\")\n```\n\n### Albert\n+ [ALBERT: A Lite BERT For Self-superpised Learning Of Language Representations](https://arxiv.org/pdf/1909.11942.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_albert import actuator\nactuator(model_dir=\"./data/albert_small_zh_google\", execute_type=\"train\")\n\n# Pytorch version\nfrom examples.pytorch.run_albert import actuator\nactuator(model_dir=\"./data/albert_chinese_small\", execute_type=\"train\")\n```\n\n### NEZHA\n+ [NEZHA: Neural Contextualized Representation For Chinese Language Understanding](https://arxiv.org/pdf/1909.00204.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_nezha import actuator\nactuator(model_dir=\"./data/NEZHA-Base-WWM\", execute_type=\"train\")\n\n# Pytorch version\nfrom examples.pytorch.run_nezha import actuator\nactuator(model_dir=\"./data/nezha-base-wwm\", execute_type=\"train\")\n```\n\n### RoBERTa\n+ [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_basic_bert import actuator\nactuator(model_dir=\"./data/chinese_roberta_L-6_H-384_A-12\", execute_type=\"train\")\n\n# Pytorch version\nfrom examples.pytorch.run_basic_bert import actuator\nactuator(model_dir=\"./data/chinese-roberta-wwm-ext\", execute_type=\"train\")\n```\n\n### SimCSE\n+ [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/pdf/2104.08821.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_simcse import actuator\nactuator(model_dir=\"./data/chinese_wwm_L-12_H-768_A-12\", execute_type=\"train\", model_type=\"bert\")\n\n# Pytorch version\nfrom examples.pytorch.run_simcse import actuator\nactuator(model_dir=\"./data/chinese_wwm_pytorch\", execute_type=\"train\", model_type=\"bert\")\n```\n\n### Poly-Encoder\n+ [Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring](https://arxiv.org/pdf/1905.01969v2.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_poly_encoder import actuator\nactuator(model_dir=\"./data/chinese_wwm_L-12_H-768_A-12\", execute_type=\"train\", model_type=\"bert\")\n\n# Pytorch version\nfrom examples.pytorch.run_poly_encoder import actuator\nactuator(model_dir=\"./data/chinese_wwm_pytorch\", execute_type=\"train\", model_type=\"bert\")\n```\n\n### ColBERT\n+ [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/pdf/2004.12832.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_colbert import actuator\nactuator(model_dir=\"./data/chinese_wwm_L-12_H-768_A-12\", execute_type=\"train\", model_type=\"bert\")\n\n# Pytorch version\nfrom examples.pytorch.run_colbert import actuator\nactuator(model_dir=\"./data/chinese_wwm_pytorch\", execute_type=\"train\", model_type=\"bert\")\n```\n\n### RE2\n+ [Simple and Effective Text Matching with Richer Alignment Features](https://arxiv.org/pdf/1908.00300.pdf)\n```python\n# TensorFlow version\nfrom examples.tensorflow.run_re2 import actuator\nactuator(\"./data/config/re2.json\", execute_type=\"train\")\n\n# Pytorch version\nfrom examples.pytorch.run_re2 import actuator\nactuator(\"./data/config/re2.json\", execute_type=\"train\")\n```\n\n\n# Cite\n```\n@misc{text-similarity,\n title={text-similarity},\n author={Bocong Deng},\n year={2021},\n howpublished={\\url{https://github.com/DengBoCong/text-similarity}},\n}\n```\n\n# Reference\n+ [bert4keras](https://github.com/bojone/bert4keras/)\n+ [albert_zh](https://github.com/brightmart/albert_zh)\n+ [HuggingFace](https://huggingface.co/)\n+ [Self-Attention with Relative Position Representations](https://arxiv.org/pdf/1803.02155.pdf)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdengbocong%2Ftext-similarity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdengbocong%2Ftext-similarity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdengbocong%2Ftext-similarity/lists"}