https://github.com/eriknyquist/boyermoore
Boyer-moore in pure python, search for unicode strings in large files quickly
https://github.com/eriknyquist/boyermoore
boyer-moore boyer-moore-algorithm boyermoore file-search file-searcher pure-python python3 string-matching unicode utf-8 utf8
Last synced: 5 months ago
JSON representation
Boyer-moore in pure python, search for unicode strings in large files quickly
- Host: GitHub
- URL: https://github.com/eriknyquist/boyermoore
- Owner: eriknyquist
- License: apache-2.0
- Created: 2022-12-14T04:40:54.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2022-12-17T19:36:12.000Z (almost 3 years ago)
- Last Synced: 2025-04-06T08:27:35.795Z (6 months ago)
- Topics: boyer-moore, boyer-moore-algorithm, boyermoore, file-search, file-searcher, pure-python, python3, string-matching, unicode, utf-8, utf8
- Language: Python
- Homepage: https://eriknyquist.github.io/boyermoore
- Size: 2.79 MB
- Stars: 22
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
.. contents:: **Table Of Contents**
Boyer-Moore in pure python: search for unicode strings quickly in large files
*****************************************************************************.. |tests_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/tests.yml/badge.svg
.. |cov_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/coverage.yml/badge.svg
.. |codeclimate_badge| image:: https://api.codeclimate.com/v1/badges/a5d499edc22f0a05c533/maintainability
.. |version_badge| image:: https://badgen.net/pypi/v/boyermoore
.. |license_badge| image:: https://badgen.net/pypi/license/boyermoore|tests_badge| |cov_badge| |codeclimate_badge| |license_badge| |version_badge|
This is an implementation of the Boyer-Moore substring search algorithm in pure python.
It is a shameless copy-paste of the python reference code provided `here `_,
with modifications to support the following additional features:* Searching in files without reading the whole file into memory, allowing handling of large files
* Full unicode supportSee the `API documentation `_ for more details.
Installing
----------Install from ``pip``.
::
pip install boyermoore
Searching for all occurences of a substring in a file
-----------------------------------------------------::
>>> from boyermoore import search_file
>>>
>>> offsets = search_file("pattern!", "file.txt") # Find all occurrences of "pattern!" in file "file.txt"
>>> offsets # Display found occurrences
[12, 456, 10422] # Pattern occurs at byte offsets 12, 456, and 104222Searching for the first occurence of a substring in a file
----------------------------------------------------------::
>>> from boyermoore import search_file
>>>
>>> offsets = search_file("pattern!", "file.txt", greedy=False) # Find the first occurrence of "pattern!" in file "file.txt"
>>> offsets # Display found occurrences
[12] # First occurrence of pattern is at byte offset 12Performance / Speed test
------------------------The following section illustrates the average speed of the ``boyermoore.search_file``
function when searching for a unicode string in files of sizes ranging from 1MB to 2GB.The test is implemented in the file ``scripts/speed_test.py`` if you want to inspect the code yourself.
Test environment
################The test was executed using Python 3.9.13 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
and 32 GB of RAM.Test methodology
################The test searches for all occurrences of a fixed unicode string in a series of test files.
The unicode string is:::
Hello नमस्ते Привет こんにちは
("Hello" in English, followed by the Hindi translation, followed by the Russian translation,
followed by the Japanese translation)Each test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0)
and one at the very end (byte offset of [file_length - pattern_length]).Test results
############The following table shows the times taken to search for all occurences of the unicode
string "Hello नमस्ते Привет こんにちは" inside test files of various sizes, and compares
it to a linear search of the same data.+-----------+----------------+-------------+
| File size | Boyer-moore | Linear time |
| | time (seconds) | (seconds) |
+===========+================+=============+
| 1 MB | 0.01 | 0.08 |
+-----------+----------------+-------------+
| 2 MB | 0.02 | 0.17 |
+-----------+----------------+-------------+
| 32 MB | 0.24 | 2.67 |
+-----------+----------------+-------------+
| 64 MB | 0.47 | 5.24 |
+-----------+----------------+-------------+
| 128 MB | 0.93 | 10.62 |
+-----------+----------------+-------------+
| 256 MB | 1.88 | 21.44 |
+-----------+----------------+-------------+
| 512 MB | 4.16 | 43.76 |
+-----------+----------------+-------------+
| 1 GB | 7.44 | 85.46 |
+-----------+----------------+-------------+
| 2 GB | 16.03 | 175.93 |
+-----------+----------------+-------------+Contributions
*************Contributions are welcome, please open a pull request at ``_ and ensure that:
#. All existing unit tests pass (run tests via ``python setup.py test``)
#. New unit tests are added to cover any modified/new functionality (run ``python code_coverage.py``
to ensure that coverage is above 98%)You will need to install packages required for development, these are listed in ``dev_requirements.txt``:
::
pip install -r dev_requirements.txt
If you have any questions about / need help with contributions or unit tests, please
contact Erik at eknyquist@gmail.com.