{"id":15653248,"url":"https://github.com/x-tabdeveloping/neofuzz","last_synced_at":"2025-04-07T06:10:30.973Z","repository":{"id":156304948,"uuid":"617458085","full_name":"x-tabdeveloping/neofuzz","owner":"x-tabdeveloping","description":"Blazing fast fuzzy text search for Python.","archived":false,"fork":false,"pushed_at":"2025-01-21T09:00:58.000Z","size":454,"stargazers_count":44,"open_issues_count":3,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-31T05:05:18.176Z","etag":null,"topics":["fuzzy","llm","nlp","python","search","semantic"],"latest_commit_sha":null,"homepage":"https://x-tabdeveloping.github.io/neofuzz/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/x-tabdeveloping.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-22T12:44:26.000Z","updated_at":"2025-03-28T15:45:57.000Z","dependencies_parsed_at":"2024-10-23T03:35:13.046Z","dependency_job_id":"1bc5c0e9-951e-49d0-a5f9-152802cea099","html_url":"https://github.com/x-tabdeveloping/neofuzz","commit_stats":{"total_commits":47,"total_committers":3,"mean_commits":"15.666666666666666","dds":0.06382978723404253,"last_synced_commit":"0e59f99b482594ca631437025595c0dfe7251979"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fneofuzz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fneofuzz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fneofuzz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fneofuzz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/x-tabdeveloping","download_url":"https://codeload.github.com/x-tabdeveloping/neofuzz/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247601448,"owners_count":20964864,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fuzzy","llm","nlp","python","search","semantic"],"created_at":"2024-10-03T12:45:07.645Z","updated_at":"2025-04-07T06:10:30.938Z","avatar_url":"https://github.com/x-tabdeveloping.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":["General-Purpose Machine Learning"],"readme":"\u003cimg align=\"left\" width=\"82\" height=\"82\" src=\"docs/_static/logo.svg\"\u003e\n\n# Neofuzz\n\n\u003cbr\u003e\n\nBlazing fast, lightweight and customizable fuzzy and semantic text search in Python.\n\n## Introduction ([Documentation](https://x-tabdeveloping.github.io/neofuzz/))\nNeofuzz is a fuzzy search library based on vectorization and approximate nearest neighbour\nsearch techniques.\n\n### New in version 0.3.0\nNow you can reorder your search results using Levenshtein distance!\nSometimes n-gram processes or vectorized processes don't quite order the results correctly.\nIn these cases you can retrieve a higher number of examples from the indexed corpus, then refine those results with Levenshtein distance.\n\n```python\nfrom neofuzz import char_ngram_process\n\nprocess = char_ngram_process()\nprocess.index(corpus)\n\nprocess.extract(\"your query\", limit=30, refine_levenshtein=True)\n```\n\n### Why is Neofuzz fast?\nMost fuzzy search libraries rely on optimizing the hell out of the same couple of fuzzy search algorithms (Hamming distance, Levenshtein distance). Sometimes unfortunately due to the complexity of these algorithms, no amount of optimization will get you the speed, that you want.\n\nNeofuzz makes the realization, that you can’t go above a certain speed limit by relying on traditional algorithms, and uses text vectorization and approximate nearest neighbour search in the vector space to speed up this process.\n\nWhen it comes to the dilemma of speed versus accuracy, Neofuzz goes full-on speed.\n\n### When should I choose Neofuzz?\n - You need to do repeated searches in the same corpus.\n - Levenshtein and Hamming distance is simply not fast enough.\n - You are willing to sacrifice the quality of the results for speed.\n - You don’t mind that the up-front computation to index a corpus might take time.\n - You have very long strings, where other methods would be impractical.\n - You want to rely on semantic content.\n - You need a drop-in replacement for TheFuzz.\n\n### When should I NOT choose Neofuzz?\n - The corpus changes all the time, or you only want to do one search in a corpus. (It might still give speed-up in that case though.)\n - You value the quality of the results over speed.\n - You don’t mind slower searches in favor of no indexing.\n - You have a small corpus with short strings.\n\n## [Usage](https://x-tabdeveloping.github.io/neofuzz/getting_started.html)\n\nYou can install Neofuzz from PyPI:\n\n```bash\npip install neofuzz\n```\n\nIf you want a plug-and play experience you can create a generally good quick and dirty\nprocess with the `char_ngram_process()` process.\n\n```python\nfrom neofuzz import char_ngram_process\n\n# We create a process that takes character 1 to 5-grams as features for\n# vectorization and uses a tf-idf weighting scheme.\n# We will use cosine distance for the nearest neighbour search.\nprocess = char_ngram_process(ngram_range=(1,5), metric=\"cosine\", tf_idf=True)\n\n# We index the options that we are going to search in\nprocess.index(options)\n\n# Then we can extract the ten most similar items the same way as in\n# thefuzz\nprocess.extract(\"fuzz\", limit=10)\n---------------------------------\n[('fuzzer', 67),\n ('Januzzi', 30),\n ('Figliuzzi', 25),\n ('Fun', 20),\n ('Erika_Petruzzi', 20),\n ('zu', 20),\n ('Zo', 18),\n ('blog_BuzzMachine', 18),\n ('LW_Todd_Bertuzzi', 18),\n ('OFU', 17)]\n```\n\n## [Custom Processes](https://x-tabdeveloping.github.io/neofuzz/custom_vectorizer.html)\n\nYou can customize Neofuzz’s behaviour by making a custom process.\nUnder the hood every Neofuzz Process relies on the same two components:\n\n - A vectorizer, which turns texts into a vectorized form, and can be fully customized.\n - Approximate Nearest Neighbour search, which indexes the vector space and can find neighbours of a given vector very quickly. This component is fixed to be PyNNDescent, but all of its parameters are exposed in the API, so its behaviour can also be altered at will.\n\n### Words as Features\n\nIf you’re more interested in the words/semantic content of the text you can also use them as features. This can be very useful especially with longer texts, such as literary works.\n\n```python\nfrom neofuzz import Process\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\n # Vectorization with words is the default in sklearn.\n vectorizer = TfidfVectorizer()\n\n # We use cosine distance because it's waay better for high-dimentional spaces.\n process = Process(vectorizer, metric=\"cosine\")\n```\n\n### Dimensionality Reduction\n\nYou might find that the speed of your fuzzy search process is not sufficient. In this case it might be desirable to reduce the dimentionality of the produced vectors with some matrix decomposition method or topic model.\n\nHere for example I use NMF (excellent topic model and incredibly fast one too) too speed up my fuzzy search pipeline.\n\n```python\nfrom neofuzz import Process\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.decomposition import NMF\nfrom sklearn.pipeline import make_pipeline\n\n# Vectorization with tokens again\nvectorizer = TfidfVectorizer()\n# Dimensionality reduction method to 20 dimensions\nnmf = NMF(n_components=20)\n# Create a pipeline of the two\npipeline = make_pipeline(vectorizer, nmf)\n\nprocess = Process(pipeline, metric=\"cosine\")\n```\n\n### Semantic Search/Large Language Models\n\nWith Neofuzz you can easily use semantic embeddings to your advantage, and can use both attention-based language models (Bert), just simple neural word or document embeddings (Word2Vec, Doc2Vec, FastText, etc.) or even OpenAI’s LLMs.\n\nWe recommend you try embetter, which has a lot of built-in sklearn compatible vectorizers.\n```bash\npip install embetter\n```\n\n```python\nfrom embetter.text import SentenceEncoder\nfrom neofuzz import Process\n\n# Here we will use a pretrained Bert sentence encoder as vectorizer\nvectorizer = SentenceEncoder(\"all-distilroberta-v1\")\n# Then we make a process with the language model\nprocess = Process(vectorizer, metric=\"cosine\")\n\n# Remember that the options STILL have to be indexed even though you have a pretrained vectorizer\nprocess.index(options)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Fneofuzz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-tabdeveloping%2Fneofuzz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Fneofuzz/lists"}