{"id":13508694,"url":"https://github.com/abitdodgy/gibran","last_synced_at":"2025-10-21T17:42:10.262Z","repository":{"id":62429805,"uuid":"43402130","full_name":"abitdodgy/gibran","owner":"abitdodgy","description":"Gibran is an Elixir natural language processor, and a port of WordsCounted.","archived":false,"fork":false,"pushed_at":"2017-04-23T19:38:21.000Z","size":49,"stargazers_count":65,"open_issues_count":3,"forks_count":3,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-10-07T00:46:48.849Z","etag":null,"topics":["elixir-lang","natural-language-processing","nlp"],"latest_commit_sha":null,"homepage":"http://hexdocs.pm/gibran","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abitdodgy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-09-30T00:22:48.000Z","updated_at":"2024-06-26T05:26:39.000Z","dependencies_parsed_at":"2022-11-01T20:06:17.479Z","dependency_job_id":null,"html_url":"https://github.com/abitdodgy/gibran","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/abitdodgy/gibran","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fgibran","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fgibran/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fgibran/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fgibran/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abitdodgy","download_url":"https://codeload.github.com/abitdodgy/gibran/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fgibran/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280304009,"owners_count":26307859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-21T02:00:06.614Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elixir-lang","natural-language-processing","nlp"],"created_at":"2024-08-01T02:00:57.079Z","updated_at":"2025-10-21T17:42:10.216Z","avatar_url":"https://github.com/abitdodgy.png","language":"Elixir","funding_links":[],"categories":["Natural Language Processing (NLP)"],"sub_categories":[],"readme":"Gibran\n=========\n\n\u003e Yesterday is but today's memory, and tomorrow is today's dream.\n\n![Gibran](http://d.gr-assets.com/authors/1353732571p5/6466154.jpg)\n\n[Gibran][2] is an Elixir natural language processor. Lofty goals for Gibran include:\n\n- Metaphone phonetic coding system\n- Soundex algorithm\n- Porter Stemming algorithm\n- String similarity as [described by Simon White](http://www.catalysoft.com/articles/StrikeAMatch.html)\n\nCurrently, Gibran ships with the following features:\n\n- Token count, unique token count, and character count\n- Average characters per token\n- `HashDict`s of tokens and their frequencies, lengths, and densities\n- The longest token(s) and its length\n- The most frequent token(s) and its frequency\n- Unique tokens\n- Levenshtein distance algorithm\n\n## Usage\n\nLet's start with something simple.\n\n```elixir\nalias Gibran.Tokeniser\nalias Gibran.Counter\n\nstr = \"Yesterday is but today's memory, and tomorrow is today's dream.\"\nTokeniser.tokenise(str)\n# =\u003e [\"yesterday\", \"is\", \"but\", \"today's\", \"memory\", \"and\", \"tomorrow\", \"is\", \"today's\", \"dream\"]\n\nTokeniser.tokenise(str) |\u003e Counter.uniq_token_count\n# =\u003e 8\n```\n\nBy default Gibran uses the following regular expression to tokenise strings: `~r/[^\\p{L}'-]/u`. You can provide your own regular expression through the `pattern` option. You can combine `pattern` with `exclude` to create sophisticated tokenisation strategies.\n\n```\nTokeniser.tokenise(string, exclude: \u0026String.length(\u00261) \u003c 4) |\u003e Counter.token_count\n# =\u003e 6\n```\n\nThe `exclude` option accepts a string, a function, a regular expression, or a list combining any one or more of those types.\n\n\n```elixir\n# Using `exclude` with a function.\nTokeniser.tokenise(\"Kingdom of the Imagination\", exclude: \u0026(String.length(\u00261) \u003c 10))\n[\"imagination\"]\n\n# Using `exclude` with a regular expression.\nTokeniser.tokenise(\"Sand and Foam\", exclude: ~r/and/)\n[\"foam\"]\n\n# Using `exclude` with a string.\nTokeniser.tokenise(\"Eye of The Prophet\", exclude: \"eye of\")\n[\"the\", \"prophet\"]\n\n# Using `exclude` with a list of a combination of types.\nTokeniser.tokenise(\"Eye of The Prophet\", exclude: [\"eye\", \u0026(String.ends_with?(\u00261, \"he\")), ~r/of/])\n[\"prophet\"]\n```\n\nGibran provides a shortcut for working with strings directly (instead of running them through the tokeniser first).\n\n```elixir\nGibran.from_string(str, :token_count, opts: [exclude: \u0026String.length(\u00261) \u003c 4])\n# =\u003e 6\n```\n\nTo avoid inconsistencies that arise from character-casing, Gibran normalises input before applying transformations.\n\n### Levenshtein distance\n\nOrdinary use:\n\n```elixir\niex(1)\u003e Gibran.Levenshtein.distance(\"kitten\", \"sitting\")\n3\n ```\n\nThe Levenshtein distance for the same string is 0.\n\n```elixir\niex(2)\u003e Gibran.Levenshtein.distance(\"snail\", \"snail\")\n0\n```\n\nThe Levenshtein distance is case-sensitive.\n\n```elixir \niex(3)\u003e Gibran.Levenshtein.distance(\"HOUSEBOAT\", \"houseboat\")\n9\n```\n\nThe function can accept charlists as well as strings.\n\n```elixir\n iex(4)\u003e Gibran.Levenshtein.distance('jogging', 'logger')\n 4\n ```\n\nThe `doctests` contain extensive usage examples. Please take a look there for more details.\n\n  [1]: https://github.com/abitdodgy/words_counted\n  [2]: https://en.wikipedia.org/wiki/Kahlil_Gibran\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabitdodgy%2Fgibran","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabitdodgy%2Fgibran","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabitdodgy%2Fgibran/lists"}