{"id":23441907,"url":"https://github.com/genivia/ugrep-indexer","last_synced_at":"2026-03-03T04:40:23.282Z","repository":{"id":186596848,"uuid":"675413399","full_name":"Genivia/ugrep-indexer","owner":"Genivia","description":"A monotonic indexer to speed up grepping by \u003e10x (ugrep-indexer is now part of ugrep 6.0 and greater)","archived":false,"fork":false,"pushed_at":"2025-06-22T20:21:08.000Z","size":2003,"stargazers_count":77,"open_issues_count":2,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-12-24T23:48:53.124Z","etag":null,"topics":["grep","grep-search","index","indexing","regex","search"],"latest_commit_sha":null,"homepage":"https://ugrep.com","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Genivia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-08-06T20:36:58.000Z","updated_at":"2025-12-21T17:06:12.000Z","dependencies_parsed_at":"2023-12-19T04:22:32.845Z","dependency_job_id":"959e85d3-e497-441c-9c86-9e6476e075d6","html_url":"https://github.com/Genivia/ugrep-indexer","commit_stats":{"total_commits":79,"total_committers":2,"mean_commits":39.5,"dds":"0.025316455696202556","last_synced_commit":"3f944fe2693b4f560da139955bc1b85815eae979"},"previous_names":["genivia/ugrep-indexer"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/Genivia/ugrep-indexer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2Fugrep-indexer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2Fugrep-indexer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2Fugrep-indexer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2Fugrep-indexer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Genivia","download_url":"https://codeload.github.com/Genivia/ugrep-indexer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2Fugrep-indexer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30032063,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-03T03:27:35.548Z","status":"ssl_error","status_checked_at":"2026-03-03T03:27:09.213Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["grep","grep-search","index","indexing","regex","search"],"created_at":"2024-12-23T17:19:26.387Z","updated_at":"2026-03-03T04:40:23.268Z","avatar_url":"https://github.com/Genivia.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"A monotonic indexer to speed up grepping\n========================================\n\n[![ci](https://github.com/Genivia/ugrep-indexer/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/Genivia/ugrep-indexer/actions?query=branch%3Amain++)\n\nThe *ugrep-indexer* utility recursively indexes files to speed up recursive\ngrepping.\n\nAlso the contents of archives and compressed files are indexed when specified\nwith a command-line option.  This eliminates searching them when none of their\ncontents match the specified patterns.\n\n[ugrep](https://github.com/Genivia/ugrep) is a grep-compatible fast file\nsearcher that supports index-based searching.  Index-based search can be\nsignificantly faster on slow file systems and when file system caching is\nineffective: if the file system on a drive searched is not cached in RAM, i.e.\nit is \"cold\", then indexing will speed up search.  It only searches those files\nthat may match a specified regex pattern by using an index of the file.  This\nindex allows for a quick check if there is a potential match, thus we avoid\nsearching all files.\n\nIndexed-based search with ugrep is safe and never skips updated files that may\nnow match.  If any files and directories are added or changed after indexing,\nthen searching will always search these additions and changes made to the file\nsystem by comparing file and directory time stamps to the indexing time stamp.\n\nWhen many files are added or changed after indexing, then we might want to\nre-index to bring the indexes up to date.  Re-indexing is incremental, so it\nwill not take as much time as the initial indexing process.\n\nA typical but small example of an index-based search, for example on the ugrep\nv3.12.6 repository placed on a separate drive:\n\n    $ cd drive/ugrep\n    $ ugrep-indexer -I\n\n    12247077 bytes scanned and indexed with 19% noise on average\n        1317 files indexed in 28 directories\n          28 new directories indexed\n        1317 new files indexed\n           0 modified files indexed\n           0 deleted files removed from indexes\n         128 binary files ignored with --ignore-binary\n           0 symbolic links skipped\n           0 devices skipped\n     5605227 bytes indexing storage increase at 4256 bytes/file\n\nNormal searching on a cold file system without indexing takes 1.02 seconds\nafter unmounting the `drive` and mounting again to clear FS cache to record the\neffect of indexing:\n\n    $ ugrep -I -l 'std::chrono' --stats\n    src/ugrep.cpp\n\n    Searched 1317 files in 28 directories in 1.02 seconds with 8 threads: 1 matching (0.07593%)\n\nRipgrep 13.0.0 takes longer with 1.18 seconds for the same cold search (ripgrep\nskips binary files by default, so option `-I` is not specified):\n\n    $ time rg -l 'std::chrono'\n    src/ugrep.cpp\n        1.18 real         0.01 user         0.06 sys\n\nBy contrast, with indexing, searching a cold file system only takes 0.0487\nseconds with ugrep, which is 21 times faster, after unmounting `drive` and\nmounting again to clear FS cache to record the effect of indexing:\n\n    $ ugrep --index -I -l 'std::chrono' --stats\n    src/ugrep.cpp\n\n    Searched 1317 files in 28 directories in 0.0487 seconds with 8 threads: 1 matching (0.07593%)\n    Skipped 1316 of 1317 files with non-matching indexes\n\nThere is always some variance in the elapsed time with 0.0487 seconds the best\ntime of four search runs that produced a search time range of 0.0487 (21x speed\nup) to 0.0983 seconds (10x speed up).\n\nThe speed increase may be significantly higher in general compared to this\nsmall demo, depending on several factors, the size of the files indexed, the\nread speed of the file system and assuming most files are cold.\n\nThe indexing algorithm that I designed is *provably monotonic*: a higher\naccuracy guarantees an increased search performance by reducing the false\npositive rate, but also increases index storage overhead.  Likewise, a lower\naccuracy decreases search performance, but also reduces the index storage\noverhead.  Therefore, I named my indexer a *monotonic indexer*.\n\nIf file storage space is at a premium, then we can dial down the index storage\noverhead by specifying a lower indexing accuracy.\n\nIndexing the example from above with level 0 (option `-0`) reduces the indexing\nstorage overhead by 8.6 times, from 4256 bytes per file to a measly 490 bytes\nper file:\n\n    12247077 bytes scanned and indexed with 42% noise on average\n        1317 files indexed in 28 directories\n           0 new directories indexed\n        1317 new files indexed\n           0 modified files indexed\n           0 deleted files removed from indexes\n         128 binary files ignored with --ignore-binary\n           0 symbolic links skipped\n           0 devices skipped\n      646123 bytes indexing storage increase at 490 bytes/file\n\nIndexed search is still a lot faster by 12x than non-indexed for this example,\nwith 16 files actually searched (15 false positives):\n\n    Searched 1317 files in 28 directories in 0.0722 seconds with 8 threads: 1 matching (0.07593%)\n    Skipped 1301 of 1317 files with non-matching indexes\n\nRegex patterns that are more complex than this example may have a higher false\npositive rate naturally, which is the rate of files that are considered\npossibly matching when they are not.  A higher false positive rate may reduce\nsearch speeds when the rate is large enough to be impactful.\n\nThe following table shows how indexing accuracy affects indexing storage\nand the average noise per file indexed.  The rightmost columns show the search\nspeed and false positive rate for `ugrep --index -I -l 'std::chrono'`:\n\n| acc. | index storage (KB) | average noise | false positives | search time (s) |\n| ---- | -----------------: | ------------: | --------------: | --------------: |\n| `-0` |                631 |           42% |              15 |          0.0722 |\n| `-1` |               1276 |           39% |               1 |          0.0506 |\n| `-2` |               1576 |           36% |               0 |          0.0487 |\n| `-3` |               2692 |           31% |               0 |            unch |\n| `-4` |               2966 |           28% |               0 |            unch |\n| `-5` |               4953 |           23% |               0 |            unch |\n| `-6` |               5474 |           19% |               0 |            unch |\n| `-7` |               9513 |           15% |               0 |            unch |\n| `-8` |              10889 |           11% |               0 |            unch |\n| `-9` |              13388 |            7% |               0 |            unch |\n\nIf the specified regex matches many more possible patterns, for example with\nthe search `ugrep --index -I -l '(todo|TODO)[: ]'`, then we may observe a\nhigher rate of false positives among the 1317 files searched, resulting in\nslightly longer search times:\n\n| acc. | false positives | search time (s) |\n| ---- | --------------: | --------------: |\n| `-0` |             189 |           0.292 |\n| `-1` |              69 |           0.122 |\n| `-2` |              43 |           0.103 |\n| `-3` |              19 |           0.101 |\n| `-4` |              16 |           0.097 |\n| `-5` |               2 |           0.096 |\n| `-6` |               1 |            unch |\n| `-7` |               0 |            unch |\n| `-8` |               0 |            unch |\n| `-9` |               0 |            unch |\n\nAccuracy `-4` is the default (from `-5` previously in older releases), which\ntends to work very well to search with regex patterns of modest complexity.\n\nOne word of caution.  There is always a tiny bit of overhead to check the\nindexes.  This means that if all files are already cached in RAM, because files\nwere searched or read recently, then indexing will not necesarily speed up\nsearch, obviously.  In that case a non-indexed search might be faster.\nFurthermore, an index-based search has a longer start-up time.  This start-up\ntime increases when Unicode character classes and wildcards are used that must\nbe converted to hash tables.\n\nTo summarize, index-based search is most effective when searching a lot of\ncold files and when regex patterns aren't matching too much, i.e. we want to\nlimit the use of unlimited repeats `*` and `+` and limit the use of Unicode\ncharacter classes when possible.  This reduces the ugrep start-up time and\nlimits the rate of false positive pattern matches (see also Q\u0026A below).\n\nQuick examples\n--------------\n\nRecursively and incrementally index all non-binary files showing progress:\n\n    ugrep-indexer -I -v\n\nRecursively and incrementally index all non-binary files, including non-binary\nfiles stored in archives and in compressed files, showing progress:\n\n    ugrep-indexer -z -I -v\n\nIncrementally index all non-binary files, including archives and compressed\nfiles, show progress, follow symbolic links to files (but not to directories),\nbut do not index files and directories matching the globs in .gitignore:\n\n    ugrep-indexer -z -I -v -S -X\n\nForce re-indexing of all non-binary files, including archives and compressed\nfiles, follow symbolic links to files (but not to directories), but do not\nindex files and directories matching the globs in .gitignore:\n\n    ugrep-indexer -f -z -I -v -S -X\n\nSame, but decrease index file storage to a minimum by decreasing indexing\naccuracy from 5 (default) to 0:\n\n    ugrep-indexer -f -0 -z -I -v -S -X\n\nIncrease search performance by increasing the indexing accuracy from 5\n(default) to 7 at a cost of larger index files:\n\n    ugrep-indexer -f7zIvSX\n\nRecursively delete all hidden `._UG#_Store` index files to restore the\ndirectory tree to non-indexed:\n\n    ugrep-indexer -d\n\nBuild steps\n-----------\n\nConfigure and compile with:\n\n    ./build.sh\n\nIf desired but not required, install with:\n\n    sudo make install\n\nFuture enhancements\n-------------------\n\n- Add an option to create one index file, e.g. specified explicitly to ugrep.\n  This could further improve indexed search speed if the index file is located\n  on a fast file system.  Otherwise, do not expect much improvement or even\n  possible slow down, since a single index file cannot be searched concurrently\n  and more index entries will be checked when in fact directories are skipped\n  (skipping their indexes too).  Experiments will tell.  *A critical caveat of\n  this approach is that index-based search with `ugrep --index` is no longer\n  safe: new and modified files that are not indexed yet will not be searched.*\n\n- Each N-gram Bloom filter has its own \"bit tier\" in the hash table to avoid\n  hash conflicts.  For example 2-grams do not share any bits with 3-grams.\n  This ensures that we never have any false positives with characters being\n  falsely matched that are actually not part of the pattern.  However, the\n  1-gram (single character) bit space is small (at most 256 bits).  Therefore,\n  we waste some bits when hash tables are larger.  A possible approach to\n  reduce waste is to combine 1-grams with 2-grams to share the same bit space.\n  This is easy to do if we consider a 1-gram being equal to a 2-gram with the\n  second character set to `\\0` (NUL).  We can lower the false positive rate\n  with a second 2-gram hash based on a different hash method.  Or we can expand\n  the \"bit tiers\" from 8 to 9 to store 9-grams.  That will increase the\n  indexing accuracy for longer patterns (9 or longer) at no additional cost.\n  On the other hand, that change may cause more false positives when characters\n  are falsely matched that are not part of the pattern; we lose the advantage\n  of a perfect 1-gram accuracy.\n\nQ\u0026A\n---\n\n### Q: How does it work?\n\nIndexing adds a hidden index file `._UG#_Store` to each directory indexed.\nFiles indexed are scanned (never changed!) by ugrep-indexer to generate index\nfiles.\n\nThe size of the index files depends on the specified accuracy, with `-0` the\nlowest (small index files) and `-9` the highest (large index files).  The\ndefault accuracy is `-4`.  See the next Q for details on the impact of accuracy\non indexing size versus search speed.\n\nIndexing *never follows symbolic links to directories*, because symbolically\nlinked directories may be located anywhere in a file system, or in another file\nsystem, where we do not want to add index files.  You can still index symbolic\nlinks to files with ugrep-indexer option `-S`.\n\nOption `-v` (`--verbose`) displays the indexing progress and \"noise\" of each\nfile indexed.  Noise is a measure of *entropy* or *randomness* in the input.  A\nhigher level of noise means that indexing was less accurate in representing the\ncontents of a file.  For example, a large file with random data is hard to\nindex accurately and will have a high level of noise.\n\nThe complexity of indexing is linear in the size of a given file to index.\nIn practice it is not a fast process, not as fast a searching, and may take\nsome time to complete a full indexing pass over a large directory tree.  When\nindexing completes, ugrep-indexer displays the results of indexing.  The total\nsize of the indexes added and average indexing noise is also reported.\n\nScanning a file to index results in a 64KB indexing hashes table.  Then, the\nugrep-indexer halves the table with bit compression using bitwise-and as long\nas the target accuracy is not exceeded.  Halving is made possible by the fact\nthat the table encodes hashes for 8 windows at offsets from the start of the\npattern, corresponding to the 8 bits per index hashing table cell.  Combining\nthe two halves of the table may flip some bits to zero from one, which may\ncause a false positive match.  This proves the monotonicity of the indexer.  A\nzero bit hash value indicates a possible match.\n\nThe ugrep-indexer detects \"binary files\", which can be ignored and not indexed\nwith ugrep-indexer option `-I` (`--ignore-binary`).  This is useful when\nsearching with ugrep option `-I` (`--ignore-binary`) to ignore binary files,\nwhich is a typical scenario.\n\nThe ugrep-indexer obeys .gitignore file exclusions when specified with option\n`-X` (`--ignore-files`).  Ignored files and directories will not be indexed to\nsave file system space.  This works well when searching for files with ugrep\noption `--ignore-files`.\n\nIndexing can be aborted, for example with CTRL-C, which will not result in a\nloss of search capability with ugrep, but will leave the directory structure\nonly partially indexed.\n\nOption `-c` checks indexes for stale references and non-indexed files and\ndirectories.\n\nIndexes are deleted with ugrep-indexer option `-d`.\n\nThe ugrep-indexer has been extensively tested by comparing `ugrep --index`\nsearch results to the \"slow\" non-indexed `ugrep` search results on thousands of\nfiles with thousands of random search patterns.\n\nIndexed-based search works with all ugrep options except with option `-v`\n(`--invert-match`), `--filter`, `-P` (`--perl-regexp`) and `-Z` (`--fuzzy`).\nOption `-c` (`--count`) with `--index` automatically sets `--min-count=1` to\nskip all files with zero matches.\n\nIf any files or directories were updated, added or deleted after indexing, then\nugrep `--index` will always search these files and directories when they are\npresent on the recursive search path.  You can run ugrep-indexer again to\nincrementally update all indexes.\n\nRegex patterns are converted internally by ugrep with option `--index` to a\nform of hash tables for up to the first 16 bytes of the regex patterns\nspecified, possibly shorter in order to reduce construction time when regex\npatterns are complex.  Therefore, the first 8 to 16  characters of a regex\npattern to search are most critical and should not match too much to limit\nso-called false positive matches that may slow down searching.\n\nIn ugrep, a regex pattern is converted to a DFA.  An indexing hash finite\nautomaton (HFA) is constructed on top of the DFA to compactly represent hash\ntables as state transitions with labelled edges.  This HFA consists of up to\neight layers, each shifted by one byte to represent the next 8-byte window over\nthe pattern.  Each HFA layer encodes index hashes for that part of the pattern.\nThe index hash function chosen is \"additive\", meaning the next byte is added\nwhen hashed with the previous hash.  This is very important as it critically\nreduces the HFA construction overhead.  We can now encode labelled HFA\ntransitions to states as multiple edges with 16-bit hash value ranges instead\nof a set of single edges each with an individual hash value.  To this end, I\nuse my open-ended ranges library `reflex::ORanges\u003cT\u003e` derived from\n`std::set\u003cT\u003e`.\n\nA very simple single string `maybe_match()` function with the prime 61 index\nhash function is given below to demonstrate index-based searching of a single\nstring:\n\n    // prime 61 hashing\n    uint16_t indexhash(uint16_t h, uint8_t b, size_t size)\n    {\n      return ((h \u003c\u003c 6) - h - h - h + b) \u0026 (size - 1);\n    }\n\n    // return possible match of string given array of hashes of size \u003c= 64K (power of two)\n    bool maybe_match(const char *string, uint8_t *hashes, size_t size)\n    {\n      size_t len = strlen(string); // practically we can and should limit len to e.g. 15 or 16\n      for (const char *window = string; len \u003e 0; ++window, --len)\n      {\n        uint16_t h = window[0] \u0026 (size - 1);\n        if (hashes[h] \u0026 0x01)\n          return false\n        size_t k, n = len \u003c 8 ? len : 8;\n        for (k = 1; k \u003c n; ++k)\n        {\n          h = indexhash(h, window[k], size);\n          if (hashes[h] \u0026 (1 \u003c\u003c k))\n            return false;\n        }\n      }\n      return true;\n    }\n\nThe prime 61 hash was chosen among many other possible hashing functions using\na realistic experimental setup.  A candidate hashing function was tested by\nrepreatedly searching a randomly-drawn word from a 100MB Wikipedia file.\nThe word was mutated with one, two or three random letters.  This mutation is\nchecked to make sure it does not correspond to an actual valid word in the\nWikipedia file.  Then the false positive rate was recorded whenever a mutated\nword matches the file.  A hash function with a minimal false positive rate\nshould be a good candidate overall.\n\nBy using a window of 8 (or shorter depending on the pattern length) the false\npositive rate is lower compared to standard Bloom filters.  More specifically,\n*N²* hash functions are used instead of *N* in a Bloom filter.  For shorter\npatterns, *N* is often too small to limit false positives.  Therefore, *N²* is\nmore effective.  It also rejects any pattern from a match that has a character\nanywhere in the first 8 bytes of the pattern does not actually occur anywhere\nin an indexed file, whereas a standard Bloom filter might have a false positive\nmatch.  Furthermore, the bit addressing used to index the hashes table enables\nefficient table compression.\n\n### Q: What is indexing accuracy?\n\nIndexing is a form of lossy compression.  The higher the indexing accuracy, the\nfaster ugrep search performance should be by skipping more files that do not\nmatch.  A higher accuracy reduces noise (less lossy).  A high level of noise\ncauses ugrep to sometimes search indexed files that do not match.  We call\nthese \"false positive matches\".  Higher accuracy requires larger index files.\nNormally we expect 4K or less indexing storage per file on average.  The\nminimum is 128 bytes of index storage per file, excluding the file name and\na 4-byte index header.  The maximum is 64K bytes storage per file for very\nlarge noisy files.\n\nWhen searching indexed files with `ugrep --index --stats`, option `--stats`\nshows the search statistics after the indexing-based search completed.  When\nmany files are not skipped from searching due to indexing noise (i.e. false\npositives), then a higher accuracy helps to increase the effectiveness of\nindexing, which may speed up searching.\n\n### Q: What about UTF-16 and UTF-32 files?\n\nUTF-16 and UTF-32 files are indexed too.  The indexer treats them as UTF-8\nafter internally converting them to UTF-8 to index.\n\n### Q: Why bother indexing archives and compressed files?\n\nDisk space is saved by archiving (zip/tar/pax/cpio) and compressing files.  On\nthe other hand, searching archives and compressed files is much slower than\nsearching regular files.  Indexing archives and compressed files with\n`ugrep-indexer -z -I` and searching them with `ugrep -z -I --index PATTERN`\nspeeds up searching, i.e. when archives and compressed files are skipped.  On\nthe other hand, disk store requirements will increase with the addition of\nindex file entries for archives and compressed files.  Note that when archives\nand compressed files contain binaries, option `-I` ignores these binaries.\n\n### Q: Why is the start-up time of ugrep higher with option --index?\n\nThe start-up overhead of `ugrep --index` to construct indexing hash tables\ndepends on the regex patterns.  If a regex pattern is very \"permissive\", i.e.\nmatches a lot of possible patterns, then the start-up time of `ugrep --index`\nsignificantly increases to compute hash tables.  This may happen when large\nUnicode character classes and wildcards are used, especially with the unlimited\n`*` and `+` repeats.  To find out how the start-up time increases, use option\n`ugrep --index -r PATTERN /dev/null --stats=vm` to search /dev/null with your\nPATTERN.\n\n### Q: Why are index files not compressed?\n\nIndex files should be very dense in information content and that is the case\nwith this new indexing algorithm for ugrep that I designed and implemented.\nThe denser an index file is, the more compact it accurately represents the\noriginal file data.  That makes it hard or impossible to compress index files.\nThis is also a good indicator of how effective an index file will be in\npractice.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgenivia%2Fugrep-indexer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgenivia%2Fugrep-indexer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgenivia%2Fugrep-indexer/lists"}