{"id":19798602,"url":"https://github.com/eriknyquist/boyermoore","last_synced_at":"2025-05-01T05:30:25.553Z","repository":{"id":64764677,"uuid":"578028099","full_name":"eriknyquist/boyermoore","owner":"eriknyquist","description":"Boyer-moore in pure python, search for unicode strings in large files quickly","archived":false,"fork":false,"pushed_at":"2022-12-17T19:36:12.000Z","size":2929,"stargazers_count":22,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-06T08:27:35.795Z","etag":null,"topics":["boyer-moore","boyer-moore-algorithm","boyermoore","file-search","file-searcher","pure-python","python3","string-matching","unicode","utf-8","utf8"],"latest_commit_sha":null,"homepage":"https://eriknyquist.github.io/boyermoore","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eriknyquist.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-12-14T04:40:54.000Z","updated_at":"2025-03-08T13:02:54.000Z","dependencies_parsed_at":"2023-01-29T18:01:06.166Z","dependency_job_id":null,"html_url":"https://github.com/eriknyquist/boyermoore","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eriknyquist%2Fboyermoore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eriknyquist%2Fboyermoore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eriknyquist%2Fboyermoore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eriknyquist%2Fboyermoore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eriknyquist","download_url":"https://codeload.github.com/eriknyquist/boyermoore/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251830449,"owners_count":21650802,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["boyer-moore","boyer-moore-algorithm","boyermoore","file-search","file-searcher","pure-python","python3","string-matching","unicode","utf-8","utf8"],"created_at":"2024-11-12T07:30:43.889Z","updated_at":"2025-05-01T05:30:24.021Z","avatar_url":"https://github.com/eriknyquist.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n.. contents:: **Table Of Contents**\n\nBoyer-Moore in pure python: search for unicode strings quickly in large files\n*****************************************************************************\n\n.. |tests_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/tests.yml/badge.svg\n.. |cov_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/coverage.yml/badge.svg\n.. |codeclimate_badge| image:: https://api.codeclimate.com/v1/badges/a5d499edc22f0a05c533/maintainability\n.. |version_badge| image:: https://badgen.net/pypi/v/boyermoore\n.. |license_badge| image:: https://badgen.net/pypi/license/boyermoore\n\n|tests_badge| |cov_badge| |codeclimate_badge| |license_badge| |version_badge|\n\n\nThis is an implementation of the Boyer-Moore substring search algorithm in pure python.\n\nIt is a shameless copy-paste of the python reference code provided `here \u003chttps://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm\u003e`_,\nwith modifications to support the following additional features:\n\n* Searching in files without reading the whole file into memory, allowing handling of large files\n* Full unicode support\n\nSee the `API documentation \u003chttps://eriknyquist.github.io/boyermoore/\u003e`_ for more details.\n\nInstalling\n----------\n\nInstall from ``pip``.\n\n::\n\n    pip install boyermoore\n\nSearching for all occurences of a substring in a file\n-----------------------------------------------------\n\n::\n\n    \u003e\u003e\u003e from boyermoore import search_file\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e offsets = search_file(\"pattern!\", \"file.txt\")                 # Find all occurrences of \"pattern!\" in file \"file.txt\"\n    \u003e\u003e\u003e offsets                                                       # Display found occurrences\n    [12, 456, 10422]                                                  # Pattern occurs at byte offsets 12, 456, and 104222\n\nSearching for the first occurence of a substring in a file\n----------------------------------------------------------\n\n::\n\n    \u003e\u003e\u003e from boyermoore import search_file\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e offsets = search_file(\"pattern!\", \"file.txt\", greedy=False)   # Find the first occurrence of \"pattern!\" in file \"file.txt\"\n    \u003e\u003e\u003e offsets                                                       # Display found occurrences\n    [12]                                                              # First occurrence of pattern is at byte offset 12\n\nPerformance / Speed test\n------------------------\n\nThe following section illustrates the average speed of the ``boyermoore.search_file``\nfunction when searching for a unicode string in files of sizes ranging from 1MB to 2GB.\n\nThe test is implemented in the file ``scripts/speed_test.py`` if you want to inspect the code yourself.\n\nTest environment\n################\n\nThe test was executed using Python 3.9.13 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz\nand 32 GB of RAM.\n\nTest methodology\n################\n\nThe test searches for all occurrences of a fixed unicode string in a series of test files.\nThe unicode string is:\n\n::\n\n    Hello नमस्ते Привет こんにちは\n\n(\"Hello\" in English, followed by the Hindi translation, followed by the Russian translation,\nfollowed by the Japanese translation)\n\nEach test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0)\nand one at the very end (byte offset of [file_length - pattern_length]).\n\nTest results\n############\n\nThe following table shows the times taken to search for all occurences of the unicode\nstring \"Hello नमस्ते Привет こんにちは\" inside test files of various sizes, and compares\nit to a linear search of the same data.\n\n+-----------+----------------+-------------+\n| File size | Boyer-moore    | Linear time |\n|           | time (seconds) | (seconds)   |\n+===========+================+=============+\n| 1  MB     | 0.01           | 0.08        |\n+-----------+----------------+-------------+\n| 2 MB      | 0.02           | 0.17        |\n+-----------+----------------+-------------+\n| 32 MB     | 0.24           | 2.67        |\n+-----------+----------------+-------------+\n| 64 MB     | 0.47           | 5.24        |\n+-----------+----------------+-------------+\n| 128 MB    | 0.93           | 10.62       |\n+-----------+----------------+-------------+\n| 256 MB    | 1.88           | 21.44       |\n+-----------+----------------+-------------+\n| 512 MB    | 4.16           | 43.76       |\n+-----------+----------------+-------------+\n| 1 GB      | 7.44           | 85.46       |\n+-----------+----------------+-------------+\n| 2 GB      | 16.03          | 175.93      |\n+-----------+----------------+-------------+\n\nContributions\n*************\n\nContributions are welcome, please open a pull request at `\u003chttps://github.com/eriknyquist/boyermoore\u003e`_ and ensure that:\n\n#. All existing unit tests pass (run tests via ``python setup.py test``)\n#. New unit tests are added to cover any modified/new functionality (run ``python code_coverage.py``\n   to ensure that coverage is above 98%)\n\nYou will need to install packages required for development, these are listed in ``dev_requirements.txt``:\n\n::\n\n    pip install -r dev_requirements.txt\n\nIf you have any questions about / need help with contributions or unit tests, please\ncontact Erik at eknyquist@gmail.com.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feriknyquist%2Fboyermoore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feriknyquist%2Fboyermoore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feriknyquist%2Fboyermoore/lists"}