{"id":17166404,"url":"https://github.com/andreaferretti/cello","last_synced_at":"2025-04-10T20:12:32.238Z","repository":{"id":66316524,"uuid":"80541719","full_name":"andreaferretti/cello","owner":"andreaferretti","description":"A string library","archived":false,"fork":false,"pushed_at":"2022-08-31T06:25:23.000Z","size":236,"stargazers_count":79,"open_issues_count":3,"forks_count":10,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-03-24T17:53:11.545Z","etag":null,"topics":["burrows-wheeler-transform","fm-index","string-search","strings","succinct","suffix-array","wavelet-tree"],"latest_commit_sha":null,"homepage":"https://andreaferretti.github.io/cello/","language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andreaferretti.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-01-31T17:03:56.000Z","updated_at":"2025-03-14T03:06:47.000Z","dependencies_parsed_at":"2023-02-25T05:00:08.929Z","dependency_job_id":null,"html_url":"https://github.com/andreaferretti/cello","commit_stats":{"total_commits":138,"total_committers":2,"mean_commits":69.0,"dds":0.01449275362318836,"last_synced_commit":"c397c78820464987a8d872a0e5051e6a6af326c8"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Fcello","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Fcello/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Fcello/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Fcello/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andreaferretti","download_url":"https://codeload.github.com/andreaferretti/cello/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248288361,"owners_count":21078903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["burrows-wheeler-transform","fm-index","string-search","strings","succinct","suffix-array","wavelet-tree"],"created_at":"2024-10-14T23:05:29.486Z","updated_at":"2025-04-10T20:12:32.216Z","avatar_url":"https://github.com/andreaferretti.png","language":"Nim","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cello\n\n![logo](https://raw.githubusercontent.com/andreaferretti/cello/master/img/logo.jpg)\n\nCello is a library of [succinct data structures](https://en.wikipedia.org/wiki/Succinct_data_structure),\noriented in particular for string searching and other string operations.\n\nUsually, searching for patterns in a string takes `O(n)` time, where `n` is\nthe length of the string. Indices can speedup the search, but take additional\nspace, which can be costly for very large strings. A data structure is called\nsuccinct when it takes `n + o(n)` space, where `n` is the space needed to store\nthe data anyway. Hence succinct data structures can provide additional\noperations with limited space overhead.\n\nIt turns out that strings admit succinct indices, which do not take much more\nspace than the string itself, but allow for `O(k)` substring search, where `k`\nis the length of the *substring*. Usually, this is much shorter, and this\nconsiderably improves search times. Cello provide such indices and many other\nrelated string operations.\n\nAn example of usage would be:\n\n```nim\nlet\n  x = someLongString\n  pattern = someShortString\n  index = searchIndex(x)\n  positions = index.search(pattern)\n\necho positions\n```\nMany intermediate data structures are constructed to provide such indices,\nthough, and as they may be of independent interest, we describe them in the\nfollowing.\n\nNotice that a string here just stands for a (usually very long) sequence of\nsymbols taken from a (usually small) alphabet. Prototypical examples include\n\n* genomic data, where the alphabet is `A, C, G, T` or\n* time series, where each value is represented by a symbol, such as `HIGH`,\n  `MEDIUM`, `LOW`, or `UP`, `DOWN`\n* where only two values are available, it is often convenient to store the\n  data as bit sequences to save space.\n\nAt the moment all operations are implemented on\n\n```nim\ntype AnyString = string or seq[char] or Spill[char]\n```\n\nwhere [spills](https://github.com/andreaferretti/spills) are just memory-mapped\nsequences. The library may become generic in the future, although this is not\na priority.\n\nNotice that Cello is not Unicode-aware: think more of searching large genomic\nstrings or symbolized time series, rather then using it for internationalized\ntext, although I may consider Unicode operations in the future.\n\n## Versions\n\nCello recent version (\u003e= 0.2) requires Nim \u003e= 0.20. For usage with Nim up to\n0.19.4, use Cello 0.1.6.\n\n## Basic operations\n\nThe most common operations that we implement on various kind of sequence data\nare `rank` and `select`. We first describe them for sequences of bits, which are\nthe foundation we use to store more complex kind of data.\n\nFor bit sequences, `rank(i)` counts the number of 1 bits in the first `i`\nplaces. The number of 0 bits can easily be obtained as `i - rank(i)`. Viceversa,\n`select(i)` finds the position of the `i`-th 1 bit in the sequence. In this\ncase, there is not an obvious relation to the position of the `i`-th 0 bit,\nso we provide a similar operation `select0(i)`.\n\nTo ensure that `rank(select(i)) == i`, we define `select(i)` to be 1-based,\nthat is, we count bits starting from 1.\n\nAs a reference, we implement `rank` and `select` on Nim built-in sets, so\nthat for instance the following is valid:\n\n```nim\nlet x = { 13..27, 35..80 }\n\necho x.rank(16)  # 3\necho x.select(3) # 16\n```\n\nMore generally, one can define 'rank' and `select` for sequence of symbols\ntaken from a finite alphabet, relative to a certain symbol. Here, `rank(c, i)`\nis the number of symbols equal to `c` among the first `i` symbols, and\n`select(c, i)` is the position of the `i`-th symbol `c` in the sequence.\n\nAgain, we give a reference implementation for strings, so that the following\nis valid:\n\n```nim\nlet x = \"ABRACADABRA\"\n\necho x.rank('A', 8)   # 4\necho x.select('A', 4) # 8\n```\n\nNotice that in both cases, the implementation of `rank` and `select` is a\nnaive implementation which takes `O(i)` operations. More sophisticated data\nstructures allow to perform similar operations in constant (for rank) or\nlogarithmic (for select) time, by using indices. *Succinct* data structures\nallow to do this using indices that take at most `o(n)` space in addition\nto the sequence data itself, where `n` is the sequence length.\n\n## Data structures\n\nWe now describe the succinct data structures that will generalize the bitset\nand the string examples above. In doing so, we also need a few intermediate\ndata structures that may be of independent interest.\n\n### Bit arrays\n\nBit arrays are a generalization of Nim default `set` collections. They can\nbe seen as an ordered sequence of `bool`, which are actually backed by a\n`seq[int]`. We implement random access - both read and write - as well as\nnaive `rank` and `select`. An example follows:\n\n```nim\nvar x = bits(13..27, 35..80)\n\necho x[12]   # false\necho x[13]   # true\nx[12] = true # or incl(x, 12)\necho x[12]   # true\nx[12] = false\n\necho x.rank(16)    # 3\necho x.select(3)   # 16\necho x.select0(30) # 90\n```\n\n### Int arrays\n\nInt arrays are just integer sequences of fixed length. What distinguishes\nthem by the various types `seq[uint64]`, `seq[uint32]`, `seq[uint16]`, `seq[uint8]`\nis that the integers can have any length, such as 23.\n\nThey are backed by a bit array, and can be used to store many integer numbers\nof which an upper bound is known without wasting space. For instance, a sequence\nof positive numbers less that 512 can be backed by an int array where each\nnumber has size 9. Using a `seq[uint16]` would almost double the space\nconsumption.\n\nMost sequence operations are available, but they cannot go after the initial\ncapacity. Here is an example:\n\n```nim\nvar x = ints(200, 13) # 200 ints at most 2^13 - 1\n\nx.add(123)\nx.add(218)\nx.add(651)\necho x[2]   # 651\nx[12] = 1234\necho x[12]   # 1234\n\necho x.len       # 13\necho x.capacity  # 200\n```\n\n### RRR\n\nThe [RRR](http://alexbowe.com/rrr/) bit vector is the first of our collections\nthat is actually succinct. It consists of a bit arrays, plus two int arrays\nthat stores `rank(i)` values for various `i`, at different scales.\n\nIt can be created after a bit array, and allows constant time `rank` and\nlogarithmic time `select` and `select0`.\n\n```nim\nlet b: BitArray = ...\nlet r = rrr(b)\n\necho r.rank(123456)\necho r.select(123456)\necho r.select0(123456)\n```\n\nTo convince oneself that the structure really is succinct, `stats(rrr)` returns\na data structures that shows the space taken (in bits) by the bit array, as\nwell as the two auxiliary indices.\n\n[Reference](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.538.8528\u0026rep=rep1\u0026type=pdf)\n\n### Wavelet tree\n\nThe [wavelet tree](http://alexbowe.com/wavelet-trees/) is a tree constructed\nin the following way. An input string over a finite alphabet is given. The\nalphabet is split in two parts - the left and the right one, call them L and R.\n\nFor each character of the string, we use a 1 bit to denote that the character\nbelongs to R and a 0 bit to denote that it belongs to L. In this way, we\nobtain a bit sequence. The node stores the bit sequence as an RRR structures,\nand has two children: the one to the left is the wavelet tree associated to\nthe substring composed by the characters in L, taken in order, and similarly\nfor the right child.\n\nThis structure allows to compute `rank(c, i)`, where `c` is a character in the\nalphabet, in time `O(log(l))`, and `select(c, i)` in time `O(log(l)log(n))`\nwhere `l` is the size of the alphabet and `n` is the size of the string.\nIt also allows `O(log(l))` random access to read elements of the string.\n\nIt can be used as follows:\n\n```nim\nlet\n  x = \"ACGGTACTACGAGAGTAGCAGTTTAGCGTAGCATGCTAGCG\"\n  w = waveletTree(x)\n\necho x.rank('A', 20)   # 7\necho x.select('A', 7)  # 20\necho x[12]             # 'G'\n```\n\n[Reference](http://people.unipmn.it/manzini/papers/icalp06.pdf)\n\n### Rotated strings\n\nThe next ingredient that we need it the Burrows-Wheeler transform of a string.\nIt can be implemented using string rotations, so that's what we implement\nfirst. It turns out that this implementation is too slow for our purposes,\nbut rotated strings may be useful anyway, so we left them in.\n\nA rotated strings is just a view over a string, rotated by a certain amount\nand wrapping around the end of the string. If the underlying string is a `var`,\nour implementation reuses that memory (which is then shared) to avoid the\ncopy of the string. We just implement random access and printing:\n\n```nim\nvar\n  s = \"The quick brown fox jumps around the lazy dog\"\n  t = s.rotate(20)\n\necho t[10] # n\necho t[20] # u\n\nt[18] = e\n\necho s # The quick brown fox jumps around the lezy dog\necho t # jumps around the lezy dogThe quick brown fox\n```\n\n### Suffix array\n\nThe suffix array of a string is a permutation of the numbers from 0 up to the\nstring length excluded. The permutation is obtained by considering, for each\n`i`, the suffix starting at `i`, and sorting these strings in lexicographical\norder. The resulting order is the suffix array.\n\nHere the suffix array is represented as an IntArray. It can be obtained as\nfollows:\n\n```nim\nlet\n  x = \"this is a test.\"\n  y = suffixArray(x)\n\necho y # @[7, 4, 9, 14, 8, 11, 1, 5, 2, 6, 3, 12, 13, 10, 0]\n```\n\nSorting the indices may be a costly operation. One can use the fact that the\nsuffixes of a string are a quite special collection to produce more efficient\nalgorithms. Other than the sort-based one, we offer the\n[DC3 algorithm](http://spencer-carroll.com/the-dc3-algorithm-made-simple/).\n\nNotice that at the moment DC3 is not really optimized and may be neither\nspace nor time efficient.\n\nTo use an alternative algorithm, just pass an additional parameter, of type\n\n```nim\ntype SuffixArrayAlgorithm* {.pure.} = enum\n  Sort, DC3\n```\n\nlike this\n\n```nim\nlet\n  x = \"this is a test.\"\n  y = suffixArray(x, SuffixArrayAlgorithm.DC3)\n\necho y # @[7, 4, 9, 14, 8, 11, 1, 5, 2, 6, 3, 12, 13, 10, 0]\n```\n\n[Reference](https://www.cs.helsinki.fi/u/tpkarkka/publications/jacm05-revised.pdf)\n\n### Burrows-Wheeler transform\n\nThe [Burrows-Wheeler transform](http://michael.dipperstein.com/bwt/) of a string\nis a string one character longer, together with a distinguished character.\nOnce one has a suffix array `sa` for the string `s \u0026 '\\0'`, where `\\0` is our\ndistinguished character, the Burrows-Wheeler transform is the string which at\nthe index `i` has the last character of the rotation of `s` by `sa[i]`. The\ndistinguished index if the permutation of `\\0`.\n\nWe recall the following two facts:\n\n* the Burrows-Wheeler transform can be inverted - the exact algorithm is\n  outside the purposes of this documentation\n* whenever a character is a good predictor for the next one (in the original\n  string), the string in the Burrows-Wheeler transform tends to have many\n  repeated characters, which allows to compress it by run-length encoding.\n\nAn example of usage is this:\n\n```nim\nlet\n  s = \"The quick brown fox jumps around the lazy dog\"\n  t = burrowsWheeler(s)\n  u = inverseBurrowsWheeler(t)\n\necho t # gskynxeed\\0 l in hh otTu c uwudrrfm abp qjoooza\necho u # The quick brown fox jumps around the lazy dog\n```\n\nNotice that for this to work we assume that `s` does not contain `\\0` itself.\nWe use the fact that Nim strings are not null terminated, hence `\\0` is a\nvalid character. Notice that printing the transformed string may not work as\nintended, since the terminal may interpret the embedded `\\0` as a string\nterminator.\n\n[Reference](http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf)\n\n### FM indices\n\nAn [FM index](http://alexbowe.com/fm-index/) for a string puts together\nessentially all the pieces that we have described so far. The index itself\nholds a walevet tree for the Burrows-Wheeler transform of the string, together\nwith a small auxiliary table having the size of the string alphabet.\n\nIt can be used for various purposes, but the simplest one is backward search.\nGiven a pattern `p` (a small string) and possibly long string `s`, there is a\nway to search all occurrences of `p` in time `O(L)`, where `L` is the length\nof `p` - the time is independent of `s` - using an FM index for `s`.\n\nEvery occurrence of `p` appears as the prefix of some rotation of `s` - hence\nall such occurrences correspond to consecutive positions into the suffix\narray for `s`. The first and last such positions can be found as follows:\n\n```nim\nlet\n  x = \"mississippi\"\n  pattern = \"iss\"\n  fm = fmIndex(x)\n  sa = suffixArray(x)\n  positions = fm.search(pattern)\n\necho positions.first # 2\necho positions.last  # 3\n\nfor j in positions.first .. positions.last:\n  let i = sa[j.int]\n  echo x.rotate(i)\n\n# issippimiss\n# ississippim\n```\n\nFor economy, the FM index itself does not include the suffix array, as some\napplications do not require the latter. Still, it is quite frequent to need\nboth; since computing the FM index requires the suffix array in any case, and\ncomputing the suffix array is quite costly, there is a way to get both at the\nsame time. In the above example, we could write as well\n\n```nim\nlet\n  index = searchIndex(x)\n  fm = index.fmIndex\n  sa = index.suffixArray\n```\n\nThe above type can be used to streamline search:\n\n```nim\nlet\n  index = searchIndex(x)\n  positions = index.search(pattern)\n\necho positions # @[1, 4]\n```\n\n[Reference](http://people.unipmn.it/manzini/papers/focs00draft.pdf)\n\n## Applications\n\nHere we describe a few applications of the above data structures, together\nwith some other string utilities included in Cello.\n\n### Boyer-Moore-Horspool search\n\nTo make a comparison with naive string searching (without using indices),\nan implementation of Boyer-Moore-Horspool string searching is provided.\n\nThe Boyer-Moore algorithm and variations (such as the one used here, due to\nHorspool) scan a string linearly to find a pattern, but use a precomputed table\nbased on the pattern to skip more than one charachter at a time.\nThe key observation is that after making a comparison for the pattern in a given\nposition, one already knows that some subsequent positions will not match for\nsure, hence can be skipped. The resulting algorithm is still `O(n)` in the\nlength of the string, but may perform less than `n` actual comparisons.\n\nThe API mimics `strutils.find` and it is meant to be used as follows:\n\n```nim\nlet\n  x = \"mississippi\"\n  pattern = \"iss\"\n\necho boyerMooreHorspool(x, pattern) # 1 (ississippi)\necho boyerMooreHorspool(x, pattern, start = 2)  # 4 (issippi)\n```\n\n[Reference](http://onlinelibrary.wiley.com/doi/10.1002/spe.4380100608/abstract)\n\n### Levenshtein similarity\n\nThe [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)\n(or edit distance) between two strings is the minimum number of insertions,\ndeletions or substitutions required to change one string into the other.\n\nIt is computed by `strutils.editDistance`. Here we expose a similarity measure\nderived from it, defined as `s = (L - e) / L`, where `L` is the cumulative\nlength of the two strings, and `e` is the edit distance. It is a number\nbetween 0 and 1, which is 1 only if the two strings are equal.\n\nIt is simply used as\n\n```nim\nlet\n  a = someString\n  b = someOtherString\n  s = levenhstein(a, b)\n```\n\n### Ratcliff-Obershelp similarity\n\nThe Levenshtein similarity is a quite crude measure of whether two strings\nresemble each other. A better measure is given by the\n[Ratcliff-Obershelp similarity](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html)\nwhich is defined as `s = (2 * m) / L`, where `L` is the cumulative\nlength of the two strings, and `m` is the number of matching characters.\n\nMatching characters are defined recursively: first we find the longest common\nsubstring `lcs` between the two and count the number of characters of `lcs` as\nmatching. Then, recursively, we compare the number of matching characters in\nthe chunks to the left of `lcs` and to the right of `lcs`.\n\nFor instance, when comparing `ALEXANDRE` and `ALEKSANDER`, we find the following\nsequence of longest common substrings:\n\n* ALE\n* AND\n* R\n\ngiving a Ratcliff-Obershelp similarity of `2 * (3 + 3 + 1) / (9 + 10).`\n\nIt is simply used as\n\n```nim\nlet\n  a = someString\n  b = someOtherString\n  s = ratcliffObershelp(a, b)\n```\n\n[Reference](http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/DDJ/1988/8807/8807c/8807c.htm)\n\n### Jaro similarity\n\nThe Jaro similarity of two strings `a` and `b` is given by\n\n```\n0 if m == 0\n((m / len(a)) + (m / len(b)) + ((m - t / 2) / m)) / 3 otherwise\n```\n\nwhere `m` is number of matching characters and `t` is the number of transpositions.\nHere two characters are considered matching if they are equal and their ì\ndistance is less then `max(len(a), len(b)) / 2`. The substrings of `a` and `b`\ngiven by matching characters are permutations of each other. Characters that\nmatch but appear in different positions in these strings are considered transpositions.\n\nFor instance, when comparing `ALEXANDRE` and `ALEKSANDER`, we find the following\nmatches inside `a` and `b` respectively: `ALEANDRE`, `ALEANDER`. Hence here\n`m = 8`, `t = 2`, so that the similarity is `((8 / 9) + (8 / 10) + (7 / 8)) / 3`.\n\n[Reference](https://ilyankou.files.wordpress.com/2015/06/ib-extended-essay.pdf)\n\n### Jaro-Winkler similarity\n\nThe Jaro-Winkler similarity of two strings is a correction to the Jaro similarity\nthat favours strings which have a long common prefix. If `L` is the length of\nthe common prefix of two strings and `J` is the Jaro similarity, the Jaro-Winkler\nsimilarity is computed as\n\n```\nJ + p * L * (1 - J)\n```\n\nwhere `p` is a constant factor, commonly set as `p=0.1`.\n\n**NB** The Jaro Winkler similarity can be higher than 1, unlike the other\nmetrics implemented in Cello.\n\n### Approximate search\n\nWe implement a naif form of approximate search for strings. The algorithm is\nas follows: when looking for a pattern we randomly select a substring of the\npattern whose length is a given fraction (`exactness`) of the pattern itself.\nWe then search for this substring exactly in the target string. If we find it,\nwe focus on a window around this match having the same length as the pattern.\nWe compare the similarity of the window with the pattern itself, using one\nof the similarity functions above. If this is above a given threshold\n(`tolerance`) we accept the match and return the position of the window;\notherwise we try with another attempt. After a certain number of attempts\nfail, we return `-1`.\n\nThe algorithm is driven by the following type:\n\n```nim\ntype\n  Similarity {.pure.} = enum\n    RatcliffObershelp, Levenshtein, LongestSubstring, Jaro, JaroWinkler\n  SearchOptions = object\n    exactness, tolerance: float\n    attempts: int\n    similarity: Similarity\n```\n\nand can be used like this:\n\n```nim\nlet\n  s = someLongString\n  pattern = someShortString\n  index = searchIndex(s)\n  options = searchOptions(exactness = 0.2)\n  position = index.searchApproximate(x, pattern, options)\n\necho position\n```\n\nThe defaults are `exactness = 0.1`, `tolerance = 0.7`, `attempts = 30` and\n`similarity = Similarity.RatcliffObershelp`\n\n## TODO\n\n* Improve DC3 algorithm\n* More applications of suffix arrays\n* Construct wavelet trees in threads\n* Make use of SIMD operations to improve performance\n* Allow data structures to work on memory-mapped files\n* Implement assembly on top of FM indices following [this thesis](ftp://ftp.sanger.ac.uk/pub/resources/theses/js18/thesis.pdf)\n\n# Thanks\n\nThe logo comes from [cliparts.co](http://cliparts.co/clipart/2313124)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreaferretti%2Fcello","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandreaferretti%2Fcello","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreaferretti%2Fcello/lists"}