{"id":16489935,"url":"https://github.com/crowdagger/caribon","last_synced_at":"2025-12-12T13:50:33.510Z","repository":{"id":34254830,"uuid":"38139119","full_name":"crowdagger/caribon","owner":"crowdagger","description":"A repetition detector written in Rust","archived":false,"fork":false,"pushed_at":"2017-03-16T23:07:37.000Z","size":219,"stargazers_count":14,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-18T20:40:43.250Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-2.1","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crowdagger.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-26T23:40:58.000Z","updated_at":"2022-07-19T02:29:42.000Z","dependencies_parsed_at":"2022-08-20T11:50:24.270Z","dependency_job_id":null,"html_url":"https://github.com/crowdagger/caribon","commit_stats":null,"previous_names":["crowdagger/caribon","lise-henry/caribon"],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdagger%2Fcaribon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdagger%2Fcaribon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdagger%2Fcaribon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdagger%2Fcaribon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crowdagger","download_url":"https://codeload.github.com/crowdagger/caribon/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245104460,"owners_count":20561377,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-11T13:45:53.349Z","updated_at":"2025-12-12T13:50:33.449Z","avatar_url":"https://github.com/crowdagger.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"Caribon\n=======\n\nA repetition detector written in Rust.\n\n[![Build Status](https://travis-ci.org/lise-henry/caribon.svg?branch=master)](https://travis-ci.org/lise-henry/caribon)\n\nDemo server\n===========\n\nIf you want to have an idea of what Caribon does without downloading\nanything, you can have a look at\n[this instance that runs Caribon as a web service](http://vps184889.ovh.net/caribon/).\n\nDownloading\n===========\n\nEither use `git` to get the latest version:\n\n`$ git clone https://github.com/lise-henry/caribon.git`\n\nor just download one of the stable(ish)\n[releases](https://github.com/lise-henry/caribon/releases).\n\n(If you only plan to use Caribon as a library for your rust program,\nyou don't need to worry too much about downloading and building, just\nadd `caribon = \"*\"` in your `Cargo.toml` file.)\n\n\nBuild\n=====\n\nYou'll need Rust and Cargo, see their [install instructions](http://www.rust-lang.org/install.html). Then\n\n`$ cargo build`\n\nshould do the job, though you can also run `cargo build --release`\ndirectly. You can then run caribon either with:\n\n`$ cargo run --release`\n\nor by directly executing the binary (in `target/debug` or\n`target/release`):\n\n`$ target/release/caribon`\n\nInstalling\n==========\n\nIf you have a recent version of `Cargo`, you can use:\n\n`$ cargo install caribon`\n\n(for the latest version on [crates.io](http://crates.io), or\n\n`$ cargo install --git https://github.com/lise-henry/caribon`\n\nfor the latest version on GitHub)\n\nwhich will download, build and install Caribon.\n\nCargo run\n=========\n\nIf you don't want to install Caribon, `cargo run` might be the\nsimplest option. Note, though, that command-line arguments must\nbe prefixed by `--` so cargo gives them to the binary: \n\n`$ cargo run -- --input=some_text.txt`\n\nAlso note that, by default, `cargo run` builds and runs the program in\ndebug mode, which is slower. This isn't a problem for tiny files, but\nif you plan to detect a repetitions in, say, a novel, using\ncpu-extensive features (such as fuzzy string matching, see below), you might want\nto run with `--release`:\n\n`$ cargo run --release -- --input=big_file.html --output=output_big_file.html`\n\nExamples\n========\n\nHere is an\n[example](https://lise-henry.github.io/rust/caribon-examples/example_readme.html)\nof Caribon using the HTML output of a (previous) version of this \nREADME, obtained with the following command:\n\n`$ caribon --language=english --input=README.html\n--output=example.html --fuzzy=0.5`\n\n(Note that `--fuzzy=0.5`, while useful to show that fuzzy string\nmatching does indeed work, is not a very sensible parameter as is it\nquite high (words only needs to be 50% similar to be considered the\nsame, matching e.g. `just` and `rust`). For real life usage, a lower value\nwould be recommended.)\n\nHere is another [example](https://lise-henry.github.io/rust/caribon-examples/screenshot.png), displaying repetitions in\n`README.md` to the terminal, using the following command:\n\n`$ caribon --language=english --input=README.md --fuzzy=0.5 | more`\n\n![example](https://lise-henry.github.io/rust/caribon-examples/screenshot.png)\n\nWhile outputting to the terminal might be useful for small files, HTML\noutputs gives a more useful result, as higlighting a word will show\nyou the other occurrences of it.\n\nOptions\n=======\n\nCaribon provides a list of options. Here's the\nexplanations to a few ones, from the most commons the the pretty\nadvanced ones:\n\n### Language ###\n\n* `--language=[english|french|spanish|...]`specifies the language of the\n  input file. It is important for two reasons. The first one is that\n  Caribon internally uses a stemming library, which will detect when\n  words are derived from the same stem, e.g. \"eats\", \"eat\" and\n  \"eating\" will be considered the same word. (More information on how\n  this stemming library works can be found on the\n  [Snowball project website](http://snowball.tartarus.org/).) The\n  second reason is that for some languages (currently only french and\n  english), Caribon provides a default list of words to ignore for\n  repetition counting (e.g. in english \"it\", \"a\" and so on are on it)\n  to avoid cluttering the result file. It is possible to disable\n  stemming by using \"no_stemmer\" instead of a language. This isn't\n  really advised, but it might be useful if you want to try Caribon on\n  a language that isn't implemented.\n* `--list-languages` prints the list of languages supported by the\n  stemming library.\n\n### Input and output ###\n\n* `--input=[file]` specifies the input file. By default it is `stdin`,\nwhich means you'll have directly to type your text and end it with\n`control-D`. If `file` is a non-existing file, the program aborts.\n* `--output=[file]` specifies the output file. It defaults to `stdout`,\nprinting the result to the terminal.\n\nThe input and output filenames extension determine the input and\noutput format, e.g. if you pass `--input=text.html --output=result.html`, Caribon will\ninfer that the content is in HTML and that it must also output HTML\n(so `$ caribon \u003c input.html \u003e output.html` is NOT equivalent to `$\ncaribon --input=input.html --output=output.html`: in the first case,\nCaribon will consider the input as raw text and will output in\n`terminal` format (see below), while in the latter one it will\nunderstand that both files are HTML).\n\nIt is possible to override this behaviour by specifying\n\n* `--input-format=[text|html]` or\n* `--output-format=[terminal|html|markdown]`.\n\nA note on the `terminal` output format: it is designed to print text\nto the terminal, by underlining and colouring some words with UNIX\nterminal special characters (see screenshot above). It is, thus, only activated when no\noutput file name is given and Caribon prints on the standard output,\nHTML output being the default for most of the cases.\n\n### Text statistics ###\n\n* `--print-stats`, if passed to Caribon, will also display some statistics\n  on the input text on the standard output.\n\n### Threshold and max-distance ###\n\nThe most useful algorithm of Caribon is local repetition\ndetections. It detects when a word is repeated in a given interval of\nwords. This interval is determined by\n\n* `--max-distance=[value]` (default is currently 50).\n\nSo basically, if `max-distance` is 50 and the word 'foo' occurs twice in this\ninterval, each occurrence will have a \"repetition value\" of 2. If\n'foo' is repeated a third time in a 50-words interval *after the second\noccurence*, then each of these occurrences will have a repetition\nvalue of 3. (If there is then more than 50 words without apparition of\n'foo', and 'foo' appears again, the value of the latest apparition\nwill be reset to 1).\n\nWords are underlined when their \"repetition value\" is higher than a\nthreshold, which can be set by:\n\n* `--threshold=[value]`. The default is `1.9`, so a word will be underlined\n  as soon that is is repeated two times locally. If you change the threshold to, say,\n  `2.5`, a word will have to be repeated three times (locally) to be\n  underlined.\n\n(Why a float value for the threshold, instead of an integer one?\nBecause the local repetition detector will underline words in\ndifferent colors: green, orange and red according to the \"severity\" of\nthe repetitions. So setting the threshold to `1.01` or `1.99` will not\nchange which words are underlined, but they will be in orange or red\nmore quickly in the first case.)\n\n### Fuzzy string matching ###\n\nCaribon uses a stemming library to detect words that are part of the\nsame 'family'. It turns out that this algorithm is not always\nenough, and particularly it doesn't detect repetitions when there is a\ntypo (e.g. \"higlight\" and \"highlight\" should probably be considered a\nrepetition, even if it is mispelled in the first case). To solve this\nissue, there is the option of activating fuzzy string matching:\n\n* `fuzzy=[value]`, where the value is a number between 0.0 and 1.0 which\n  represents the maximal 'difference' between two words until they are\n  no more identical: a value of 0.2 means that two words must be at\n  most \"20% different\" until they are no more considered the same.\n\nInternally, this algorithm uses the\n[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)\n(and more specifically the\n[Rust implementation by Florian Ebelling](https://crates.io/crates/edit-distance))\nwhich computes a distance between two strings by estimating the number\nof insertions, deletions and permutations it require to go from one to\nanother. E.g., \"dog\" and \"dogs\" have a distance of 1, while \"dog\" and\n\"cat\" have a distance of 3. This value is then divided by the length\nof the string to match, and two string are considered \"identical\" (or,\nat least, a repetition) when this value is less than the value given\nto `--fuzzy=`.\n\nE.g., with `--fuzzy=0.2`, \"highlight\" and \"higlight\" will have a\n\"difference\" estimated to 1/9 (Levenshtein distance of 1, it only needs\none deletion to go from the first to the second, divided by the length\nof \"highlight\", 9), so it will be a repetition. \"Just\" and \"Rust\" will\nhave a \"difference\" of 1/4, so won't be considered a repetition.\n\nFuzzy matching is practical, but you should not set it to a too high\nvalue, else you will have a lot of false positives. Empirically, `0.2`\nor `0.25` is a good choice.\n\nFuzzy matching has a drawback: it requires a lot more of CPU. Caribon\nstill manages to run reasonably fast (e.g., less than a second to\ndetect repetitions on a whole novel, with fuzzy string matching\nactivated) but it only uses fuzzy string matching for local\nrepetitions, and not for global ones (see below).\n\n### Global repetitions ###\n\nBy default, Caribon only detects repetitions at a local level (if they\nare separated by less than `max-distance` words). It is,\nhowever, possible to activate global repetition detecting with:\n\n* `--global-threshold=[value]`, value being (again) a number between 0.0\nand 1.0.\n\nIn this case, a word will be considered a repetition (even if it is\nnever repeated in a `max-distance` range of words) if the relative\nnumber of occurence is higher than the global threshold. I.e., if\n`global-threshold` is set to 0.01, a word will be highlighted (in\nblue) if it represents more than 1% of the total number of words in\nthe document.\n\n### Ignored words ###\n\nSome words, like \"a\" or \"the\", are unavoidably repeated a\nlot and it doesn't make much sense to consider them a repetition. It\nis thus useful to ignore some words. `Caribon` provides a \ndefault list for english and french, but it is in all cases possible\nto provide your own with:\n\n* `--ignore=\"list of common words\"`.\n\nThis list must be separated by either spaces or commas (or, actually,\nanything that isn't a letter), and must be encircled by\nquotes. This list *replaces* the default one\nprovided by Caribon (for english and french, at least). If you want to\n*add* words to these list instead of replace it, use:\n\n* `--add-ignored=\"list of more ignored words\"`\n\nAnother option for ignoring words is:\n\n* `--ignore-proper=[true|false]` (default is to false)\n\nIf sets to true, Caribon will try to ignore proper nouns\". That is, a word will not\ncount for repetition if it starts with a capital letter and\nis not at the beginning of a sentence.\n\nLibrary\n=======\n\nIt is possible to use Caribon as a Rust library. The documentation is\navailable [here](http://lise-henry.github.io/rust/caribon/index.html); in order to\nbe certain to have the documentation version corresponding to the code\nyou downloaded, you can also generate it with\n`cargo doc`.\n\nCaribon library is also available on\n[Crates.io](https://crates.io/crates/caribon), allowing you to easily\nuse it in any Cargo project: just add\n\n`caribon = \"0.7\"`\n\nin the dependencies section of your\n`Cargo.toml` file.\n\nCaribon-server\n==============\n\nIf you are not a big fan of command-line interface, you can have a\nlook at [Caribon-server](https://github.com/lise-henry/caribon-server) that\nruns Caribon as a web service (using\n[Iron framework](https://github.com/iron/iron)). See [here for an\ninstance running it](http://vps184889.ovh.net/caribon/).\n\nCurrent features\n================\n\n* Built-in list of ignored words (common words whose repetitions don't\n  matter) for french and english, though they are not complete.\n* Stemming support for languages supported by the Snowball (http://snowball.tartarus.org/)\n  project.\n* Additionally (because stemming algorithms aren't always perfect, and sometimes\n  you make typos), support for fuzzy string matching (based on Levenhstein distance).\n* Count repetitions locally and globally.\n* Detects HTML tags in input. Normally works both for HTML fragments\n  or full HTML pages.\n* Outputs the detected repetitions either in an HTML file (the most\n  useful option), directly to the terminal, or to a Markdown file (with less useful information).\n\nChangeLog\n=========\n\n[See here](ChangeLog.md).\n\nLicense\n=======\n\nCaribon is licensed under the\n[GNU Lesser General Public License](LICENSE), version 2.1\nor (at your convenience) any ulterior version.\n\nCredits\n=======\n\nCaribon is written by Élisabeth Henry `\u003cliz.henry at ouvaton.org\u003e`.\n\nThis software uses (rust bindings to) the\n[C Stemming library](http://snowball.tartarus.org/dist/libstemmer_c.tgz)\nwritten by Dr Martin Porter, licensed under the BSD License.\n\nIt also uses the [Rust implementation](https://crates.io/crates/edit-distance) of\nLevenshtein distance written by Florian Ebelling, licensed under the Apache 2.0 License.\n\nToDo \n====\n\nLibrary\n-------\n* Make colour highlighting more configurable\n* Complete builtin lists of ignored words and provide them for other\n  languages (currently, only french, and english);\n* Provide algorithm to detect repetitions of expressions, not just\n  single words;\n* Make library callable from C (and other languages than Rust);\n* Enhance documentation and add tests.\n\nProgram\n-------\n* Add options to select highligting colours\n* Find better default values?\n* Make different repositories for program and library?\n* Add a variant with GUI (Gtk+?)?\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrowdagger%2Fcaribon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrowdagger%2Fcaribon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrowdagger%2Fcaribon/lists"}