{"id":15520715,"url":"https://github.com/proycon/analiticcl","last_synced_at":"2025-04-06T15:12:01.677Z","repository":{"id":47722598,"uuid":"359600206","full_name":"proycon/analiticcl","owner":"proycon","description":"an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction","archived":false,"fork":false,"pushed_at":"2025-03-03T13:49:36.000Z","size":2365,"stargazers_count":36,"open_issues_count":4,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-30T14:09:46.362Z","etag":null,"topics":["approximate-string-matching","fuzzy-matching","nlp","normalization","spelling-correction"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/proycon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-19T21:14:09.000Z","updated_at":"2025-03-10T18:02:52.000Z","dependencies_parsed_at":"2024-06-12T17:00:29.642Z","dependency_job_id":"db1030f8-286c-4f52-bf5e-1d13cc099b6c","html_url":"https://github.com/proycon/analiticcl","commit_stats":{"total_commits":392,"total_committers":3,"mean_commits":"130.66666666666666","dds":"0.025510204081632626","last_synced_commit":"6413ee8aec3c54fdcffae6f188e74ecef9779b78"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fanaliticcl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fanaliticcl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fanaliticcl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fanaliticcl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/proycon","download_url":"https://codeload.github.com/proycon/analiticcl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247500468,"owners_count":20948880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["approximate-string-matching","fuzzy-matching","nlp","normalization","spelling-correction"],"created_at":"2024-10-02T10:29:00.949Z","updated_at":"2025-04-06T15:12:01.651Z","avatar_url":"https://github.com/proycon.png","language":"Rust","funding_links":[],"categories":["\u003ca name=\"text-processing\"\u003e\u003c/a\u003eText processing"],"sub_categories":[],"readme":"[![Crate](https://img.shields.io/crates/v/analiticcl.svg)](https://crates.io/crates/analiticcl)\n[![Docs](https://docs.rs/analiticcl/badge.svg)](https://docs.rs/analiticcl/)\n[![GitHub build](https://github.com/proycon/analiticcl/actions/workflows/analiticcl.yml/badge.svg?branch=master)](https://github.com/proycon/analiticcl/actions/)\n[![GitHub release](https://img.shields.io/github/release/proycon/analiticcl.svg)](https://GitHub.com/proycon/analiticcl/releases/)\n[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n\n# Analiticcl\n\n## Introduction\n\nAnaliticcl is an approximate string matching or fuzzy-matching system that can be used for spelling\ncorrection or text normalisation (such as post-OCR correction or post-HTR correction). Texts can be checked against a\nvalidated or corpus-derived lexicon (with or without frequency information) and spelling variants will be returned.\n\nThe distinguishing feature of the system is the usage of anagram hashing to drastically reduce the search space and make\nquick lookups possible even over larger edit distances. The underlying idea is largely derived from prior work *TICCL*\n(Reynaert 2010; Reynaert 2004), which was implemented in [ticcltools](https://github.com/languagemachines/ticcltools).\nThis *analiticcl* implementation attempts to re-implement the core of these ideas from scratch, but also introduces some\nnovelties, such as the introduction of prime factors for improved anagram hashing. We aim at a high-performant\nimplementation written in [Rust](https://www.rust-lang.org).\n\nAside from reading this documentation, you can also view an in-depth [presentation\nvideo](https://diode.zone/w/kkrqA4MocGwxyC3s68Zsq7) that was presented at the KNAW Humanities Cluster in January 2022.\n\nIf you plan to use analiticcl from Python, then we recommend you to also follow [this tutorial](tutorial.ipynb) in the form of a jupyter notebook.\n\n## Demo\n\n![Analiticcl demo](https://raw.githubusercontent.com/CLARIAH/wp3-demos/master/analiticcl.gif)\n\n## Features\n\n* Quick retrieval of spelling variants given an input word due to smart anagram hashing lookup. This is the main feature\n  that drastically reduces the search spaces.\n* Works against a lexicon, which can either be a validated lexicon (preferred), or a lexicon derived from a corpus.\n* Uses a user-provided alphabet file for anagram hashing, in which multiple characters may be mapped to a single alphabet entry if so\n  desired (e.g. for casing or for more phonetic-like lookup behaviour like soundex)\n* Can take into account frequency information from the lexicon\n* Matching against final candidates using a variety of possible distance metrics. Scoring and ranking is implemented as\n  a weighted linear combination including the following components:\n    * Damerau-Levenshtein\n    * Longest common substring\n    * Longest common prefix/suffix\n    * Casing difference (boolean)\n  An exact match always has distance score 1.0.\n* Additionally, frequency information can be used to influence ranking.\n* A confusable list with known confusable patterns and weights can be provided. This is used to favour or penalize certain\n  confusables in the ranking stage (this weight is applied to the whole score).\n* Rather than look up words in spelling-correction style, users may also output the entire hashed anagram index, or\n  output a reverse index of all variants found the supplied input data for each item in the lexicon.\n* Also supports ingesting explicit variant lists/error lists.\n* Support for language models to consider context information.\n* Multi-threading support\n\nThe current implementation is still a work in progress and should be considered experimental. Especially the search mode\nand the use of language models is still being evaluated and improved.\n\n## Installation\n\nYou can build and install the latest stable analiticcl release using Rust's package manager:\n\n```\ncargo install analiticcl\n```\n\nor if you want the development version after cloning this repository:\n\n```\ncargo install --path .\n```\n\nNo cargo/rust on your system yet? Do ``sudo apt install cargo`` on Debian/ubuntu based systems, ``brew install rust`` on mac, or use [rustup](https://rustup.rs/).\n\nNote that 32-bit architectures are not supported.\n\n## Usage\n\nAnaliticcl is typically used through its command line interface or through the [Python binding](https://github.com/proycon/analiticcl/tree/master/bindings/python). Full syntax help for the command line tool is always available through ``analiticcl --help``.\n\nAnaliticcl can be run in several **modes**, each is invoked through a subcommand, each subcommand also takes its own\n``--help`` parameter for detailed usage information.\n\n* **Query mode** - ``analiticcl query`` - Queries the model for variants for the provided input item (one per line)\n* **Search mode** - ``analiticcl search`` - Searches for variants in running text. This encompasses detection and correction whereas the above query mode only handles correction.\n* **Learn mode** - ``analiticcl learn`` - Learns variants from the input for each item in the lexicon and outputs a weighted variant list.\n* **Index mode** - ``analiticcl index`` - Computes and outputs the anagram index, takes no further input\n\nIn all modes, the performance of the system depends to a large depree on the quality of the lexicons, including the **background lexicon**, the importance of which can not be understated so we dedicate a special section to it later, and the chosen parameters.\n\n### Query Mode\n\nThe query mode takes one input item per line and outputs all variants and their scores found for the given input.\nDefault output is TSV (tab separated fields) in which the first column contains the input and the variants and scores\nare tab delimited fields in the columns thereafter.\n\nYou need to pass at least an [alphabet file](#alphabet-file) and a [lexicon file](#lexicon-file) against which matches\nare made.\n\nExample:\n\n```\n$ analiticcl query --interactive --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv\nInitializing model...\nLoading lexicons...\nBuilding model...\nComputing anagram values for all items in the lexicon...\n - Found 119773 instances\nAdding all instances to the index...\n - Found 108802 anagrams\nCreating sorted secondary index...\nSorting secondary index...\n ...\nQuerying the model...\n(accepting standard input; enter input to match, one per line)\n```\n\nThe program is now taking standard input, enter a word to query and press ENTER to get the variants and the scores.\nSpecify the ``--interactive`` parameter otherwise output may not be returned immediately but will be buffered for\nparallellisation:\n\n```tsv\nseperate\nseperate        separate        0.734375                operate 0.6875          desperate       0.6875          temperate       0.6875          serrate 0.65625         separates       0.609375                separated       0.609375\n```\n\nRather than running it interactively, you can use your shell's standard redirection facilities to provide input and output, multiple variants will be processed in parallel.\n\n```\n$ analiticcl query --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv \u003c input.tsv \u003e\noutput.tsv\n```\n\n\nThe ``--lexicon`` argument can be specified multiple times for multiple lexicons. Lexicons may contain absolute\nfrequency information, but frequencies between multiple lexicons must be balanced! In case you are using multiple\nlexicons, you can get analiticcl to output information on which lexicon a match was found in by setting.\n``--output-lexmatch``. The order of the lexicons (and variant lists) matters if there is associated frequency\ninformation. If an entry occurs in multiple lexicons, they will all be returned.\n\nIf you want JSON output rather than TSV, use the ``--json`` flag. The JSON output includes more details than the TSV\noutput. Most notable, you will see the distance score (aka similarity score) and the frequency scores seperated, whereas\nthe TSV mode only outputs the combined score.\n\n```\n$ analiticcl query --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv --output-lexmatch\n--json \u003c input.tsv \u003e output.json\n```\n\n```json\n[\n    { \"input\": \"seperate\", \"variants\": [\n        { \"text\": \"separate\", \"score\": 0.734375, \"dist_score\": 0.734375, \"freq_score\": 1, \"lexicons\": [ \"examples/eng.aspell.lexicon\" ] },\n        { \"text\": \"desperate\", \"score\": 0.6875, \"dist_score\": 0.6875, \"freq_score\": 1, \"lexicons\": [ \"examples/eng.aspell.lexicon\" ] },\n        { \"text\": \"operate\", \"score\": 0.6875, \"dist_score\": 0.6875, \"freq_score\": 1, \"lexicons\": [ \"examples/eng.aspell.lexicon\" ] },\n        { \"text\": \"temperate\", \"score\": 0.6875, \"dist_score\": 0.6875, \"freq_score\": 1, \"lexicons\": [ \"examples/eng.aspell.lexicon\" ] },\n        { \"text\": \"serrate\", \"score\": 0.65625, \"dist_score\": 0.65625, \"freq_score\": 1, \"lexicons\": [ \"examples/eng.aspell.lexicon\" ] },\n        { \"text\": \"separated\", \"score\": 0.609375, \"dist_score\": 0.609375, \"freq_score\": 1, \"lexicons\": [ \"examples/eng.aspell.lexicon\" ] },\n        { \"text\": \"separates\", \"score\": 0.609375, \"dist_score\": 0.609375, \"freq_score\": 1, \"lexicons\": [ \"examples/eng.aspell.lexicon\" ] }\n    ] }\n]\n```\n\n\n### Learn Mode\n\nIn learn mode, analiticcl takes a lexicon, and collects variants from the input for each item in the lexicon.\n\nIt takes input similar like search mode or query mode (add an extra ``--strict`` flag if your input is a list/lexicon\nrather than running text). Instead of outputting the results directly, it collects all varaints and associates them with\nthe items in the lexicon, effectively updating the model with this information. The output this mode provides is\nthe inverse of what search or query does; for each item in the lexicon, all variants that were found (and their scores\nare listed). This output constitutes a weighted variant\nlist which can be loaded in again using ``--variants``.\n\nThe learned variants are used as intermediate words to guide the system towards a desired solution. Assume for instance\nthat our lexicon contains the word ``separate``, and we found the variant ``seperate`` in the data during learning. This\nvariant is now associated with the right reference, and on subsequent runs matches against ``seperate`` will count\ntowards matches on ``separate``. This mechanism allows the system to bridge larger edit distances even when it is\ncontrained to smaller ones. For example: ``seperete`` will match against ``seperate`` but not ``separate`` when the\nedit/anagram distance is constrained to 1.\n\nLearn mode may do multiple iterations over the same data (set ``--iterations``). As iterations grow, larger edit\ndistances can be covered, but this is also a source for extra noise so accuracy will go down too.\n\nWhen using learn mode, make sure to choose tight constraints (e.g. ``--max-matches 1`` and a high\n``--score-threshold``). Learning on a list/lexicon using ``--strict`` rather than on running text, generally leads to\nbetter results.\n\n### Search Mode\n\nIn query mode you provide an exact input string and ask Analiticcl to correct it as a single unit. Query mode\neffectively implements the *correction* part of a spelling-correction system, but does not really handle the *detection*\naspect. This is where *search mode* comes in. In search mode you can provide running text as input and the system will\nautomatically attempt to detect the parts of your input that can corrected, and give the suggestions for correction.\n\nIn the output, Analiticcl will return UTF-8 byte offsets for fragments in your data that it finds variants for. You can\nset ``--unicode-offsets`` if you want unicode codepoint offsets instead. Both types of offsets are zero-indexed and\nthe end offset is always non-inclusive.\n\nYour input does not have to be tokenised, because tokenisation errors in the\ninput may in itself account for variation which the system will attempt to resolve. Search mode can look at n-grams to\nthis end, which effectively makes Analiticcl context-aware. You can use the ``--max-ngram-order`` parameter to set the\nmaximum n-gram order you want to consider. Any setting above 1 enables a language modelling component in Analiticcl,\nwhich requires a frequency list of n-grams as input (using ``--lm``).\n\n### Index Mode\n\nThe index mode simply outputs the anagram index, it takes no further input.\n\n```\n$ analiticcl index --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv\n```\n\nIt may be insightful to sort on the number of anagrams and show the top 20 , with a bit of awk scripting and some piping:\n\n\n```\n$ analiticcl index --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv | awk -F'\\t' '{ print NF-1\"\\t\"$0 }' | sort -rn | head -n 20\n[...]\n```\n```tsv\n8       1227306 least   slate   Stael   stale   steal   tales   teals   Tesla\n7       98028906        elan's  lane's  Lane's  lean's  Lean's  Lena's  Neal's\n7       55133630        actors  castor  Castor  Castro  costar  Croats  scrota\n7       485214  abets   baste   bates   Bates   beast   beats   betas\n7       416874  bares   baser   bears   braes   saber   sabre   Sabre\n7       411761163       luster  lustre  result  rustle  sutler  ulster  Ulster\n7       409102  alts    last    lats    LSAT    salt    SALT    slat\n7       3781815 notes   onset   Seton   steno   stone   Stone   tones\n7       33080178        carets  caster  caters  crates  reacts  recast  traces\n7       2951915777547   luster's        lustre's        result's        rustle's        sutler's        ulster's        Ulster's\n7       286404699       merits  mister  Mister  miters  mitres  remits  timers\n7       28542   east    East    eats    etas    sate    seat    teas\n7       28365   ergo    goer    gore    Gore    ogre    Oreg    Roeg\n7       27489162        capers  crapes  pacers  parsec  recaps  scrape  spacer\n7       1741062 aster   rates   resat   stare   tares   tears   treas\n7       17286   ales    Elsa    lase    leas    Lesa    sale    seal\n7       1446798 pares   parse   pears   rapes   reaps   spare   spear\n7       1403315 opts    post    Post    pots    spot    stop    tops\n7       13674   elan    lane    Lane    lean    Lean    Lena    Neal\n6       96935466        parses  passer  spares  sparse  spears  Spears\n```\nThe large number is the [anagram value](#theoretical-background) of the anagram.\n\n### Background Lexicon\n\nWe can not understate the importance of the background lexicon to reduce false positives. Analiticcl will eagerly\nattempt to match your test input to whatever lexicons you provide. This demands a certain degree of completeness in your\nlexicons. If your lexicon contains a relatively rare word like \"boulder\" and not a more common word like \"builder\", then\nanaliticcl will happily suggest all instantes of \"builder\" to be \"boulder\". The risk for this increases as the allowed\nedit distances increase.\n\nSuch background lexicons should also contain morphological variants and not just lemma. Ideally it is derived automatically from a fully spell-checked corpus.\n\nAnaliticcl **will not** work for you if you just feed it some small lexicons and no complete enough background lexicons, unless you are sure your test texts have a very constrained limited vocabulary.\n\n### Scores and ranking\n\nIn query mode, analiticcl will return a similarity/distance score between your input and any matching variants. This\nscore is expressed on a scale of 1.0 (exact match) to 0.0. The score takes the length of the input into account, so a\nlevenshtein difference of 2 on a word weighs less than a levenshtein distance of 2 on a shorter word. The distance score\nitself consists of multiple components, each with a configurable weight:\n\n* Damerau-Levenshtein\n* Longest common substring\n* Longest common prefix\n* Longest common suffix\n* Casing difference (boolean)\n\nA frequency score on a scale of 1.0 (most frequent variant) to 0.0 is returned separately (not shown in TSV output).\nBy default, the ranking of variants is based primarily on the distance score, the frequency score is only used as a\nsecondary key in case there is a tie (multiple items with the same distance score).\n\nIf you do want frequency information to play a larger role in the ranking of variants, you can use the ``--freq-ranking`` parameter, the value of\nwhich is a weight to attribute to frequency ranking in relation to the distance component and should be in the range 0.0\nto 1.0, where a smaller value around 0.25 is recommended. This is used to compute a ranking score as follows:\n\n```\nranking_score = (distance_score + freq_weight * freq_score) / (1 + freq_weight)\n```\n\nThis ranking score is subsequently used to rank the results. This may result in a variant with less similarity to the\ninput being preferred over a variant with more similarity to the input, if that first variant is far more frequent.\n\n## Data Formats\n\nAll input for analiticcl must be UTF-8 encoded and use unix-style line endings, NFC unicode normalisation is strongly\nrecommended.\n\n### Alphabet File\n\nThe alphabet file is a TSV file (tab separated fields) containing all characters of the alphabet. Each line describes a\nsingle alphabet 'character'. An alphabet file may for example start as follows:\n\n```\na\nb\nc\n```\n\nMultiple values on a line may be tab separated and are used to denote equivalents. A single line\nrepresenting a single character could for example look like:\n\n```tsv\na\tA\tá\tà\tä\tÁ\tÀ\tÄ\n```\n\nThis means that these are all encoded the same way and are considered identical for all anagram hashing and distance\nmetrics. A common situation is that all numerals are encoded indiscriminately, which you can accomplish with an alphabet entry\nlike:\n\n```tsv\n0\t1\t2\t3\t4\t5\t6\t7\t8\t9\n```\n\nIt is recommended to order the lines in the alphabet file based on the frequency of the character, as this will lead to\nthe most optimal performance (i.e. generally smaller anagram values), but this is not a hard requirement by any means.\n\n\nEntries in the alphabet file are not constrained to a single character but may also correspond to multiple characters, for instance:\n\n```tsv\nae\tæ\n```\n\nEncoding always proceeds according to a greedy matching algorithm in the exact order entries are defined in the alphabet\nfile.\n\n### Lexicon File\n\nThe lexicon is a TSV file (tab separated fields) containing either validated or corpus-derived\nwords or phrases, one lexicon entry per line. The first column typically (this is configurable) contains the word\nand the optional second column contains the absolute frequency count. If no frequency information is available, all\nitems in the lexicon carry the exact same weight.\n\nMultiple lexicons may be passed and analiticcl will remember which lexicon was matched against, so you could use this\ninformation for some simple tagging.\n\n### Variant List\n\nA variant list explicitly relates spelling variants to preferred forms, and in doing so go a step further than a simple lexicon which only\nspecifies the validated or corpus-derived form.\n\nA variant list is *directed* and *weighted*, it specifies a normalised/preferred form first, and then specifies variants and variant scores. Take the following example (all fields are tab separated):\n\n```tsv\nhuis\thuys\t1.0\thuijs\t1.0\n```\nThis states that the preferred word *huis* has two variants (historical spelling in this case), and both have a score\n(0-1) that expresses how likely the variant maps to the preferred word. When loaded into analiticcl with\n``--variants``, both the preferred form and the variants will be valid results in normalization (as if you\nloaded a lexicon with all three words in it). Any matches on the variants will automatically *expand* to also match on\nthe preferred form.\n\nWhat you might be more interested in, is a special flavour of the variant list called an *error list*, loaded\ninto analiticcl using ``--errors``. Consider the following example:\n\n```tsv\nseparate\tseperate\t1.0\tseperete 1.0\n```\nThis states that the preferred word ``seperate`` has two variants that are considered errors. In this case, analiticcl considers these variants *transparent*, it will still match against the variants but but they will never be returned as a solution; the preferred variant will be returned as a solution instead. This mechanism helps bridge larger edit distances. In the JSON output, the \"via\" property conveys that a transparent variant was used in matching.\n\nA variant list may also contain an extra column of absolute frequencies, provided that it's consistently\nprovided for *all* references and variants:\n\n```tsv\nseparate\t531\tseperate\t1.0\t4\tseperete 1.0\t1\n```\n\nHere the reference occurs 531 times, the first misspelling 4 times, and the last variant only 1 time.\n\nAnaliticcl can also *output* variant lists, given input lexicons and a text to train on, this occurs when you run it in *learn mode*.\n\n### Confusable List\n\nThe confusable list is a TSV file (tab separated fields) containing known confusable patterns and weights to assign to\nthese patterns when they are found. The file contains one confusable pattern per line. The patterns are expressed in the\nedit script language of [sesdiff](https://github.com/proycon/sesdiff). Consider the following example:\n\n```tsv\n-[y]+[i]\t1.1\n```\n\nThis pattern expressed a deletion of the letter ``y`` followed by insertion of ``i``, which comes down to substitution\nof ``y`` for  ``i``. Edits that match against this confusable pattern receive the weight *1.1*, meaning such an edit is\ngiven preference over edits with other confusable patterns, which by definition have weight *1.0*. Weights greater than\n*1.0* are being given preference in the score weighting, weights smaller than ``1.0`` imply a penalty. When multiple\nconfusable patterns match, the products of their weights is taken. The final weight is applied to the whole candidate\nscore, so weights should be values fairly close to ``1.0`` in order not to introduce too large bonuses/penalties.\n\nThe edit script language from sesdiff also allows for matching on immediate context, consider the following variant of the above\nwhich only matches the substituion when it comes after a ``c`` or a ``k``:\n\n```tsv\n=[c|k]-[y]+[i]\t1.1\n```\n\nTo force matches on the beginning or end, start or end the pattern with respectively a  ``^`` or a ``$``. A further description of the edit script language\ncan be found in the [sesdiff](https://github.com/proycon/sesdiff) documentation.\n\n### Language Model\n\nIn order to consider context information, analiticcl can construct and apply a simple n-gram language model. The input for this language\nmodel is an n-gram frequency list, provided through the ``--lm`` parameter. It is used in analiticcl's *search mode*.\n\nThis should be a corpus-derived list of unigrams and bigrams, optionally also trigrams (and even all up to quintgrams if\nneeded, higher-order ngrams are not supported though).  This is a TSV file containing the the ngram in the first column\n(space character acts as token separator), and the absolute frequency count in the second column. It is also recommended\nit contains the special tokens ``\u003cbos\u003e`` (begin of sentence) and ``\u003ceos\u003e`` end of sentence. The items in this list are\n**NOT** used for variant matching, use ``--lexicon`` instead if you want to also match against\nthese items. It is fine to have an entry in both the language model and lexicon, analiticcl will store it only once\ninternally.\n\n### Context Rules\n\nAnother way to consider context information is through context rules. The context rules define certain patterns that are\nto be either favoured or penalized. The context rules are expressed in a tab separated file which can be passed to\nanaliticcl using ``--contextrules``. The first column contains a sequence separated by semicolons, and the second a\nscore close to 1.0 (lower scores penalize the pattern, higher scores favour it):\n\n```tsv\nhello ; world\t1.1\n```\n\nThis means that if the words \"hello world\" appear as a solution a text/sentence, its total context score will be boosted\n(proportial to the length of the match), effectively preferring this solution over others. This context score is an\nindependent component in the final score function and its weight can be set using ``--contextrules-weight``.\n\nNote that the words also need to be in a lexicon you provide for a rule to work. You can express disjunctions using the\npipe character (``|``), as follows:\n\n```tsv\nhello|hi ; world|planet\t1.1\n```\n\nThis will match all four possible combinations. Rather than match the text, you can match specific lexicons you loaded\nusing the `@` prefix. This makes sense mainly if you use different lexicons and could be used as a form of elementary tagging:\n\n```tsv\n@greetings.tsv ; world\t1.1\n```\n\nHere too you can create disjunctions using the pipe character:\n\n\n```tsv\n@greetings.tsv|@curses.tsv ; world\t1.1\n```\n\nIf you want to negate a match, just add ``!`` as a prefix. This also works in combination with ``@``, allowing you to match anything *except* the words from a particular the lexicon. If you want to negate an entire disjunction, use parenthesis like ``!(a|b|c|)``.\n\nThere are two standalone characters you may use in matching:\n\n* ``?`` - Matches anything\n* ``^`` - Matches anything that does not match with *any* lexicon (i.e. out of vocabulary words)\n\nNote that in all cases,  you'll still need to explicitly load the lexicons (or variants lists) using ``--lexicon``, ``--variants``,\netc...\n\nThe rules are applied in the exact order you specify them. Note that a certain words in a text may only match against\none pattern (the first that is found). When defining context rules, you'll generally want to specify longer rules before\nshorter ones, as otherwise the longer rules might never be considered. For example, in the following example, the second\npattern would never apply because the first one already matches:\n\n```tsv\nhello\t1.1\nhello ; world\t1.1\n```\n\n### Entity Tagging\n\nAnaliticcl can be used as a simple entity tagger using its context rules. Make sure you understand the above section before you\ncontinue reading.\n\nYou may pass two additional tab-separated columns to the context rules file, the third column specifies a tag to assign\nto any matches, and an *optional* fourth column specifies an offset for tagging (more about this later). For example:\n\n\n```tsv\nhello ; world\t1.1\tgreeting\n```\n\nAny instances of \"hello world\" will be assigned the tag \"greeting\", more specifically \"hello\" will be assigned the tag\n\"greeting\" and gets sequence number 0, \"world\" gets the same tag and sequence number 1.\n\nIf you want to tag only a subset and leave certain left or right context untagged, then you can do so by specifying an\noffset (in matches aka words, not characters). Such an offset takes the form ``offset:length``. For example:\n\n\n```tsv\nhello ; world\t1.1\tgreeting\t1:1\n```\n\nIn this case only the word \"world\" will get the tag greeting (and sequence number 0).\n\nIt is also possible to assign multiple (even overlapping) tags with a single context rule. Use a semicolon to separate multiple tags and multiplet tag offsets (must be equal amount). However, it is not possible to apply multiple context rules once one has matched:\n\n```tsv\n@firstname.tsv ; @lastname.tsv\t1.0\tperson;firstname;lastname 0:2;0:1;1:2\n```\n\nThis mechanism can also be used to assign tags based on lexicons whilst allowing some form of lexicon weighting, even if\nno further context is included:\n\n```tsv\n@greetings.tsv\t1.0\tgreeting\nin|to|from ; @city.tsv\t1.1\tlocation\t1:1\n@firstname.tsv ; @lastname.tsv\t1.0\tperson\n```\n\n## Theoretical Background\n\nA naive approach to find variants would be to compute the edit distance between the input string and all ``n`` items in\nthe lexicon. This, however, is prohibitively expensive (``O(mn)``) when ``m`` input items need to be compared. Anagram\nhashing (Reynaert 2010; Reynaert 2004) aims to drastically reduce the variant search space. For all items in the\nlexicon, an order-independent **anagram value** is computed over all characters that make up the item. All words with\nthe same set of characters (allowing for duplicates) obtain an identical anagram value. This value is subsequently used\nas a hash in a hash table that maps each anagram value to all variant instances. This is effectively what is outputted\nwhen running ``analiticcl index``.\n\nUnlike earlier work, Analiticcl uses prime factors for computation of anagram values. Each character in the alphabet\ngets assigned a prime number (e.g. a=2, b=3, c=5, d=7, e=11) and the product of these forms the anagram value. This\nprovides the following useful properties:\n\n* We can multiply any two anagram values to get an anagram that represents the union set of all characters in both\n    (including duplicates): ``av(A) ∙ av(B) = av(AB)``\n* If anavalue A can be divided by anavalue B (``av(A) % av(B) = 0``), then the set of characters represented by B is fully contained within A.\n    * ``av(A) / av(B) = av(A-B)`` contains the set difference (aka relative complement). It consists of\n        the set of all characters in A that are not in B.\n\nThe caveat of this approach is that it results in fairly large anagram values that quickly exceed a 64-bit register, the\nanaliticcl implementation therefore uses a big-number implementation to deal with arbitrarily large integers.\n\nThe properties of the anagram values facilitate a much quicker lookup, when given an input word to seek variants for\n(e.g. using ``analiticcl query``), we take the following steps:\n\n* we compute the anagram value for the input\n* we look up this anagram value in the index (if it exists) and gather the variant candidates associated with the\n    anagram value\n* we compute all deletions within a certain distance (e.g. by removing any 2 characters). This is a division operation\n    on the anagram values. The maximum distance is set using the ``-k`` parameter.\n* for all of the anagram values resulting from these deletions, we look which anagram values in our index match or contain (``av(A) % av(B) = 0``) the value under consideration. We again gather the candidates that result from all matches.\n    * To facilitate this lookup, we make use of a  *secondary index*, the secondary index is grouped by the number of\n        characters. For each length it enumerates, in sorted order, all anagram values that exist for that particular length. This means we\n        can apply a binary search to find the anagrams that we should check our anagram value against (i.e. to check whether it is a subset of the anagram), rather than needing to exhaustively try all anagram values in our index.\n* Via the anagram index, we have collected all possibly relevant variant instances, which is a considerably smaller than\n    the entire set we'd get if we didn't have the anagram heuristic. Now the set is reduced we apply more conventional\n    measures:\n    * We compute several metrics between the input and the possible variants:\n        * Damerau-Levenshtein\n        * Longest common substring\n        * Longest common prefix/suffix\n        * Casing difference (binary, different case or not)\n    * A score is computed that is an expression of a weighted linear combination of the above items (the actual weights\n        are configurable). An exact match always has score 1.0.\n    * A cut-off value prunes the list of candidates that score too low (the parameter ``-n`` expresses how many variants\n        we want)\n    * Optionally, if a confusable list was provided, we compute the edit script between the input and each variant, and\n      rescore when there are known confusables that are either favoured or penalized.\n\n\n## Licence\n\nAnaliticcl is open-source software licenced under the GNU Public Licence v3.\n\n## References\n\n* Boytsov, Leonid. (2011). Indexing methods for approximate dictionary searching: Comparative analysis. ACM Journal of Experimental Algorithmics. 16. https://doi.org/10.1145/1963190.1963191.\n* Reynaert, Martin. (2004) Text induced spelling correction. In: Proceedings COLING 2004, Geneva (2004). https://doi.org/10.3115/1220355.1220475\n* Reynaert, Martin. (2011) Character confusion versus focus word-based correction of spelling and OCR variants in corpora. IJDAR 14, 173–187 (2011). https://doi.org/10.1007/s10032-010-0133-5\n\n## Acknowledgement\n\nAnaliticcl was developed at the [KNAW Humanities Cluster](https://huc.knaw.nl/). Its development was funded in the scope of the [Golden Agents](https://www.goldenagents.org/) project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproycon%2Fanaliticcl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fproycon%2Fanaliticcl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproycon%2Fanaliticcl/lists"}