{"id":13856786,"url":"https://github.com/nullnull/simstring","last_synced_at":"2025-04-09T18:22:49.509Z","repository":{"id":6100612,"uuid":"7327946","full_name":"nullnull/simstring","owner":"nullnull","description":"A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.","archived":false,"fork":false,"pushed_at":"2023-10-24T03:47:16.000Z","size":1260,"stargazers_count":123,"open_issues_count":4,"forks_count":15,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-01T16:12:32.374Z","etag":null,"topics":["nlp","nlp-library","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nullnull.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2012-12-26T14:21:54.000Z","updated_at":"2025-02-27T18:27:28.000Z","dependencies_parsed_at":"2024-06-19T17:10:21.641Z","dependency_job_id":"661f9433-7bee-43bd-aab5-9d649823b95d","html_url":"https://github.com/nullnull/simstring","commit_stats":{"total_commits":22,"total_committers":1,"mean_commits":22.0,"dds":0.0,"last_synced_commit":"249a5312fce4c8cf98704a877281701f7db524ae"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullnull%2Fsimstring","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullnull%2Fsimstring/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullnull%2Fsimstring/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullnull%2Fsimstring/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nullnull","download_url":"https://codeload.github.com/nullnull/simstring/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248085836,"owners_count":21045224,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","nlp-library","python"],"created_at":"2024-08-05T03:01:13.261Z","updated_at":"2025-04-09T18:22:49.488Z","avatar_url":"https://github.com/nullnull.png","language":"Python","readme":"# simstring\n[![PyPI - Status](https://img.shields.io/pypi/status/simstring-pure.svg)](https://pypi.org/project/simstring-pure/)\n[![PyPI version](https://badge.fury.io/py/simstring-pure.svg)](https://badge.fury.io/py/simstring-pure)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/simstring-pure.svg)](https://pypi.org/project/simstring-pure/0.0.1/)\n[![MIT License](http://img.shields.io/badge/license-MIT-blue.svg?style=flat)](LICENSE)\n[![CircleCI](https://circleci.com/gh/nullnull/simstring.svg?style=svg)](https://circleci.com/gh/nullnull/simstring)\n[![Maintainability](https://api.codeclimate.com/v1/badges/66eb2018262f03ece8a3/maintainability)](https://codeclimate.com/github/nullnull/simstring/maintainability)\n\n\nA Python implementation of the [SimString](http://www.chokkan.org/software/simstring/index.html.en), a simple and efficient algorithm for approximate string matching.\n\n## Features\nWith this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.\n\nThis library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.\n\nSimString has the following features:\n\n* Fast algorithm for approximate string retrieval.\n* 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.\n* Unicode support.\n* Extensibility. You can implement your own feature extractor easily.\n* Japanese support. [MeCab](http://taku910.github.io/mecab/)を使った形態素Nグラムをサポートしています。\n\n[Please see this paper for more details](http://www.aclweb.org/anthology/C10-1096).\n\n\n## Install\n```\npip install simstring-pure\n```\n\n## Usage\n```python\nfrom simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor\nfrom simstring.measure.cosine import CosineMeasure\nfrom simstring.database.dict import DictDatabase\nfrom simstring.searcher import Searcher\n\ndb = DictDatabase(CharacterNgramFeatureExtractor(2))\ndb.add('foo')\ndb.add('bar')\ndb.add('fooo')\n\nsearcher = Searcher(db, CosineMeasure())\nresults = searcher.search('foo', 0.8)\nprint(results)\n# =\u003e ['foo', 'fooo']\n```\n\nIf you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.\n\n```python\nfrom simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor\nfrom simstring.measure.jaccard import JaccardMeasure\nfrom simstring.database.mongo import MongoDatabase\nfrom simstring.searcher import Searcher\n\ndb = MongoDatabase(WordNgramFeatureExtractor(2))\ndb.add('You are so cool.')\n\nsearcher = Searcher(db, JaccardMeasure())\nresults = searcher.search('You are cool.', 0.8)\nprint(results)\n```\n\n## Supported String Similarity Measures\n- Cosine\n- Dice\n- Jaccard\n\n## Run Tests\n```\ndocker-compose run main bash -c 'source activate simstring \u0026\u0026 python -m unittest discover tests'\n```\n\n## Benchmark\n* About 1ms to search strings from 5797 strings(company names).\n* About 14ms to search strings from 235544 strings(unabridged dictionary).\n\n#### search from `dev/data/company_names.txt`\n```\n$ python dev/benchmark.py\nbenchmark for using dict as database\n## benchmarker:         release 4.0.1 (for python)\n## python version:      3.7.0\n## python compiler:     GCC 7.2.0\n## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4\n## python executable:   /opt/conda/envs/simstring/bin/python\n## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz\n## parameters:          loop=1, cycle=1, extra=0\n\n##                        real    (total    = user    + sys)\ninitialize database(5797 lines)    0.1227    0.1200    0.1200    0.0000\nsearch text(5797 times)    6.9719    6.9400    6.8900    0.0500\n\n## Ranking                real\ninitialize database(5797 lines)    0.1227  (100.0) ********************\nsearch text(5797 times)    6.9719  (  1.8)\n\n## Matrix                 real    [01]    [02]\n[01] initialize database(5797 lines)    0.1227   100.0  5680.9\n[02] search text(5797 times)    6.9719     1.8   100.0\n\nbenchmark for using Mongo as database\n## benchmarker:         release 4.0.1 (for python)\n## python version:      3.7.0\n## python compiler:     GCC 7.2.0\n## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4\n## python executable:   /opt/conda/envs/simstring/bin/python\n## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz\n## parameters:          loop=1, cycle=1, extra=0\n\n##                        real    (total    = user    + sys)\ninitialize database(5797 lines)    4.5762    2.4900    1.9200    0.5700\nsearch text(5797 times)  177.8401   60.9100   47.2500   13.6600\n\n## Ranking                real\ninitialize database(5797 lines)    4.5762  (100.0) ********************\nsearch text(5797 times)  177.8401  (  2.6) *\n\n## Matrix                 real    [01]    [02]\n[01] initialize database(5797 lines)    4.5762   100.0  3886.2\n[02] search text(5797 times)  177.8401     2.6   100.0\n```\n\n#### search from `dev/data/unabridged_dictionary.txt`\n```\n$ python dev/benchmark.py\nbenchmark for using dict as database\n## benchmarker:         release 4.0.1 (for python)\n## python version:      3.7.0\n## python compiler:     GCC 7.2.0\n## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4\n## python executable:   /opt/conda/envs/simstring/bin/python\n## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz\n## parameters:          loop=1, cycle=1, extra=0\n\n##                        real    (total    = user    + sys)\ninitialize database(235544 lines)    2.2576    2.2300    2.1200    0.1100\nsearch text(10000 times)  141.0302  140.6400  139.9600    0.6800\n\n## Ranking                real\ninitialize database(235544 lines)    2.2576  (100.0) ********************\nsearch text(10000 times)  141.0302  (  1.6)\n\n## Matrix                 real    [01]    [02]\n[01] initialize database(235544 lines)    2.2576   100.0  6246.8\n[02] search text(10000 times)  141.0302     1.6   100.0\n```\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnullnull%2Fsimstring","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnullnull%2Fsimstring","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnullnull%2Fsimstring/lists"}