{"id":16233477,"url":"https://github.com/toflar/state-set-index","last_synced_at":"2025-03-19T14:31:32.761Z","repository":{"id":182965989,"uuid":"656748631","full_name":"Toflar/state-set-index","owner":"Toflar","description":"Implementation of the State Set Index Algorithm in PHP","archived":false,"fork":false,"pushed_at":"2024-12-07T09:49:09.000Z","size":46,"stargazers_count":11,"open_issues_count":2,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-02-28T19:21:03.255Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Toflar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-21T14:56:54.000Z","updated_at":"2025-01-16T14:40:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"44c8b10f-0335-478f-9b14-0800d04501f4","html_url":"https://github.com/Toflar/state-set-index","commit_stats":null,"previous_names":["toflar/state-set-index"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Toflar%2Fstate-set-index","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Toflar%2Fstate-set-index/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Toflar%2Fstate-set-index/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Toflar%2Fstate-set-index/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Toflar","download_url":"https://codeload.github.com/Toflar/state-set-index/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243997108,"owners_count":20380980,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T13:12:47.225Z","updated_at":"2025-03-19T14:31:32.435Z","avatar_url":"https://github.com/Toflar.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"# State Set Index Implementation for PHP\n\nThis implements the algorithm presented in the 2012 research paper \"Efficient Similarity Search\nin Very Large String Sets\" by Dandy Fenz, Dustin Lange, Astrid Rheinländer, Felix Naumann,\nand Ulf Leser from the Hasso Plattner Institute, Potsdam, Germany and Humboldt-Universität zu Berlin, Department of \nComputer Science, Berlin, Germany.\n\nThe algorithm allows to efficiently search through huge datasets with typos (Levenshtein distance) while keeping the\nindex size small. [Download the paper and read all the details here][Paper].\n\n## Installation\n\nUse Composer:\n\n```\ncomposer require toflar/state-set-index\n```\n\n## Usage\n\n```php\nnamespace App;\n\nuse Toflar\\StateSetIndex\\Alphabet\\Utf8Alphabet;\nuse Toflar\\StateSetIndex\\DataStore\\InMemoryDataStore;\nuse Toflar\\StateSetIndex\\StateSet\\InMemoryStateSet;\nuse Toflar\\StateSetIndex\\StateSetIndex;\n\n$stateSetIndex = new StateSetIndex(\n    new Config(6, 4),\n    new Utf8Alphabet(),\n    new InMemoryStateSet(),\n    new InMemoryDataStore()\n);\n\n$stateSetIndex-\u003eindex(['Mueller', 'Müller', 'Muentner', 'Muster', 'Mustermann']);\n$stateSetIndex-\u003efind('Mustre', 2); // Will return ['Muster'];\n```\n\n## Configuration\n\nYou can configure the maximum index length and maximum alphabet size with the `Config` object. Read the\npaper for details on what they do. There's no such thing as a recommended size as it very much depends on what\nyou want to index and or search.\n\n## Customization\n\nThis library ships with the algorithm readily prepared for you to use. The main customization areas will be\nthe alphabet (both the way it maps characters to labels) and the state set storage, if you want to make the index\npersistent. Hence, there are two interfaces that allow you to implement your own logic:\n\n* The `AlphabetInterface` is very straight-forward. It only consists of a `map(string $char, int $alphabetSize)` method \n  which the library needs to map characters to an internal label. Whether you load/store the alphabet in some \n  database is up to you. The library ships with an `InMemoryAlphabet` for reference and simple use cases. You don't \n  even need to store the alphabet as we already have one with the UTF-8 codepoints, that's what `Utf8Alphabet` is \n  for. In case you don't want to customize the labels, use `Utf8Alphabet`.\n* The `StateSetInterface` is responsible to load and store information about the state set of your index. Again, \n  how you load/store the state set in some database is up to you. The library ships with an `InMemoryStateSet` \n  for reference and simple use cases and tests.\n* The `DataStoreInterface` is responsible for storing the string you index alongside its assigned state. Sometimes \n  you want to completely customize storage in which case you can use the `NullDataStore` and only use the \n  assignments you get as a return value from calling `$stateSetIndex-\u003eindex()`.\n\nYou can not only ask for the final matching results using `$stateSetIndex-\u003efindMatchingStates('Mustre', 2)` which is \nalready filtered using a multibyte implementation of the Levenshtein algorithm, but you can also access intermediary \nresults which you can use to e.g. search your own database for states etc.:\n\n* `$stateSetIndex-\u003efindMatchingStates('Mustre', 2)` returns the matching states only.\n* `$stateSetIndex-\u003efindAcceptedStrings('Mustre', 2)` returns the matching states and the respective accepted strings \n  (unfiltered for false-positives!).\n* `$stateSetIndex-\u003efind('Mustre', 2)` returns the real matches, filtered for false-positives.\n\n[Paper]: https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2012_fenz_efficient.pdf","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftoflar%2Fstate-set-index","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftoflar%2Fstate-set-index","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftoflar%2Fstate-set-index/lists"}