{"id":13896755,"url":"https://github.com/leafo/lapis-bayes","last_synced_at":"2025-02-23T08:13:56.279Z","repository":{"id":36458476,"uuid":"40763558","full_name":"leafo/lapis-bayes","owner":"leafo","description":"Naive Bayes classifier for use in Lua","archived":false,"fork":false,"pushed_at":"2024-03-04T17:40:29.000Z","size":143,"stargazers_count":30,"open_issues_count":0,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-01-04T17:28:59.927Z","etag":null,"topics":["classifier","lapis","moonscript","naive-bayes-classifier"],"latest_commit_sha":null,"homepage":"","language":"MoonScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leafo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-08-15T14:14:47.000Z","updated_at":"2024-03-04T00:41:53.000Z","dependencies_parsed_at":"2024-06-19T20:06:09.857Z","dependency_job_id":"2b2308f7-76bb-49e9-8bc5-676c1d179da7","html_url":"https://github.com/leafo/lapis-bayes","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leafo%2Flapis-bayes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leafo%2Flapis-bayes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leafo%2Flapis-bayes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leafo%2Flapis-bayes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leafo","download_url":"https://codeload.github.com/leafo/lapis-bayes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240286520,"owners_count":19777353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classifier","lapis","moonscript","naive-bayes-classifier"],"created_at":"2024-08-06T18:03:08.141Z","updated_at":"2025-02-23T08:13:56.241Z","avatar_url":"https://github.com/leafo.png","language":"MoonScript","readme":"# lapis-bayes\n\n![test](https://github.com/leafo/lapis-bayes/workflows/test/badge.svg)\n\n`lapis-bayes` is a [Naive Bayes\nclassifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for use in\nLua. It can be used to classify text into any category that has been trained\nfor ahead of time.\n\nIt's built on top of [Lapis](http://leafo.net/lapis), but can be used as a\nstandalone library as well. It requires PostgreSQL to store and parse training\ndata.\n\n## Install\n\n```bash\n$ luarocks install lapis-bayes\n```\n\n## Quick start\n\nCreate a new migration that look like this: \n\n```lua\n-- migrations.lua\n{\n  ...\n\n  [1439944992]: require(\"lapis.bayes.schema\").run_migrations\n}\n```\n\nRun migrations:\n\n```bash\n$ lapis migrate\n```\n\nTrain the classifier:\n\n```lua\nlocal bayes = require(\"lapis.bayes\")\n\nbayes.train_text(\"spam\", \"Cheap Prom Dresses 2014 - Buy discount Prom Dress\")\nbayes.train_text(\"spam\", \"Older Models Rolex Watches - View Large Selection of Rolex\")\nbayes.train_text(\"spam\", \"Hourglass Underwire - $125.00 : Professional swimwear\")\n\nbayes.train_text(\"ham\", \"Games I've downloaded so I remember them and stuff\")\nbayes.train_text(\"ham\", \"Secret Tunnel's Collection of Rad Games That I Dig\")\nbayes.train_text(\"ham\", \"Things I need to pay for when I get my credit card back\")\n```\n\nClassify text:\n\n```lua\nassert(\"ham\" == bayes.classify_text({\"spam\", \"ham\"}, \"Games to download\"))\nassert(\"spam\" == bayes.classify_text({\"spam\", \"ham\"}, \"discount rolex watch\"))\n```\n\n## Reference\n\nThe `lapis.bayes` module includes a set of functions that operate on the\ndefault classifier and tokenizer:\n\n* `BayesClassifier`\n* `PostgresTextTokenizer`\n\nIf the classifier or tokenizer need to be customized, then the classes will\nneed to be manually instantiated.\n\n#### `num_words = bayes.train_text(category, text)`\n\n```lua\nlocal bayes = require(\"lapis.bayes\")\nbayes.train_text(\"spam\", \"Cheap Prom Dresses 2014 - Buy discount Prom Dress\")\n```\n\nInserts the tokenized words from `text` into the database associated with the\ncategory named `category`. Categories don't need to be created ahead of time,\nuse any name you'd like. Later when classifying text you'll list all the\neligible categories.\n\nThe tokenizer will normalize words and remove stop words before inserting into\nthe database. The number of words kept from the original text is returned.\n\n#### `category, score = bayes.classify_text({category1, category2, ...}, text)`\n\n```lua\nlocal bayes = require(\"lapis.bayes\")\nprint bayes.classify_text({\"spam\", \"ham\"}, \"Games to download\")\n```\n\nAttempts to classify text. If none of the words in `text` are available in any\nof the listed categories then `nil` and an error message are returned.\n\nReturns the name of the category that best matches, along with a probability\nscore in natrual log (`math.log`). The closer to 0 this is, the better the\nmatch.\n\nThe input text is normalized using the same tokenizer as the trainer: stop\nwords are removed and stems are used. Only words that are available in at least\none category are used for the classification.\n\n## Tokenization\n\nWhenever a string is passed to any train or classify functions, it's passed through the default tokenizer to turn the string into an array of *words*.\n\n* For classification, these words are used to check the database for existing probabilities\n* For training, the words are inserted directly into the database\n\nTokenization is more complicated than just splitting the string by spaces, text\ncan be normalized and extraneous data can be stripped.\n\nSometimes, you may want to explicitly provide the words for insertion and\nclassification. You can bypass tokenization by passing an array of words in\nplace of the string when calling any classify or train function.\n\nYou can customize the tokenizer by providing a `tokenize_text` option. This\nshould be a function that takes a single arugment, the string of text, and the\nreturn value is the tokens. For example:\n\n```lua\nlocal bayes = require(\"lapis.bayes\")\nbayes.train_text(\"spam\", \"Cheap Prom Dresses 2014 - Buy discount Prom Dress\", {\n  tokenize_text = function(text)\n    -- your custom tokenizer goes here\n    return {tok1, tok2, ...}\n  end\n})\n```\n\n### Built-in tokenizers\n\n*Postgres Text* is the default tokenizer used when no tokenizer is provided.\nYou can customize the tokenizer when instantiating the classifer:\n\n```lua\nBayesClassifier = require \"lapis.bayes.classifiers.bayes\"\nclassifier = BayesClassifier({\n  tokenizer = \u003ctokenizer instance\u003e\n})\n\nlocal result = classifier:classify_text(...)\n```\n\n####  Postgres Text\n\nUses Postgres `tsvector` objects to normalize text. This will remove stop\nwords, normalize capitalization and symbols, and convert words to lexemes.\nDuplicates are removed.\n\n\u003e Note: The characteristics of this tokenizer may not be suitable for your\n\u003e objectives with spam detection. If you have very specific training data,\n\u003e preserving symbols, capitalization, and duplication could be beneficial. This\n\u003e tokenizer aims to generalize spam text to match a wider range of text that\n\u003e may not have specific training.\n\nThis tokenizer requires an active connection to a Postgres database (provided\nin the Lapis config). It will issue queries when tokenizing. The tokenizer is\nuses a query that is specific to English:\n\n\n```sql\nselect unnest(tsvector_to_array(to_tsvector('english', 'my text here'))) as word\n```\n\nExample:\n\n\n```lua\nlocal Tokenizer = require \"lapis.bayes.tokenizers.postgres_text\"\n\nlocal t = Tokenizer(opts)\n\nlocal tokens = t:tokenize_text(\"Hello world This Is my tests example\") --\u003e {\"exampl\", \"hello\", \"test\", \"world\"}\n\nlocal tokens2 = t:tokenize_text([[\n  \u003cdiv class='what is going on'\u003ehello world\u003ca href=\"http://leafo.net/hi.png\"\u003emy image\u003c/a\u003e\u003c/div\u003e\n]]) --\u003e {\"hello\", \"imag\", \"world\"}\n```\n\nTokenizer options:\n\n* `min_len`: minimum token length (default `2`)\n* `max_len`: maximum token length (default `12`), tokens that don't fulfill length requirements will be excluded, not truncated\n* `strip_numbers`: remove tokens that are numbers (default `true`)\n* `symbols_split_tokens`: split apart tokens that contain a symbol before tokenization, eg. `hello:world` goes to `hello world` (default `false`)\n* `filter_text`: custom pre-filter function to process incoming text, takes text as first argument, should return text (optional, default `nil`)\n* `filter_tokens`: custom post-filter function to process output tokens, takes token array, should return a token array (optional, default `nil`)\n\n####  URL Domains\n\nExtracts mentions of domains from the text text, all other text is ignored.\n\n```lua\nlocal Tokenizer = require \"lapis.bayes.tokenizers.url_domains\"\n\nlocal t = Tokenizer(opts)\nlocal tokens = t:tokenize_text([[\n  Please go to my https://leafo.net website \u003ca href='itch.io'\u003ehmm\u003c/a\u003e\n]]) --\u003e {\"leafo.net\", \"itch.io\"}\n```\n\n## Schema\n\n`lapis-bayes` creates two tables:\n\n* `lapis_bayes_categories`\n* `lapis_bayes_word_classifications`\n\n## Running outside of Lapis\n\n### Creating a configuration\n\nIf you're not running `lapis-bayes` directly inside of Lapis you'll need to\ncreate a configuration file that instructs your script on how to connect to a\ndatabase.\n\nIn your project root, create `config.lua`:\n\n\n```lua\n-- config.lua\nlocal config = require(\"lapis.config\")\n\nconfig(\"development\", {\n  postgres = {\n    database = \"lapis_bayes\"\n  }\n})\n```\n\nThe example above provides the minimum required for `lapis-bayes` to connect to\na PostgreSQL database. You're responsible for creating the actual database if\nit doesn't already exist.\n\nFor PostgreSQL you might run the command:\n\n```bash\n$ createdb -U postgres lapis_bayes\n```\n\nWe're using the standard Lapis configuration format, you can read more about it\nhere: http://leafo.net/lapis/reference/configuration.html\n\n### Creating the schema\n\nAfter database connection has been established the schema (database tables)\nneed to be created. This is done using [Lapis'\nmigrations](http://leafo.net/lapis/reference/database.html#database-migrations).\n\nCreate a file, `migrations.lua`, and make it look like the following:\n\n\n```lua\n-- migrations.lua\n\nreturn {\n  require(\"lapis.bayes.schema\").run_migrations\n}\n```\n\nYou can test your configuration now by running the migrations with the\nfollowing command from your shell: *(Note you must be in the same directory as\nyour code, `migrations.lua` and `config.lua`)*\n\n```bash\n$ lapis migrate\n```\n\nYou're now ready to start training and classifying text! (Go back to the top of\nthis document for the tutorial)\n\n# Contact\n\nAuthor: Leaf Corcoran (leafo) ([@moonscript](http://twitter.com/moonscript))  \nEmail: leafot@gmail.com  \nHomepage: \u003chttp://leafo.net\u003e  \nLicense: MIT\n\n\n","funding_links":[],"categories":["MoonScript"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleafo%2Flapis-bayes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleafo%2Flapis-bayes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleafo%2Flapis-bayes/lists"}