{"id":13423705,"url":"https://github.com/bnosac/ruimtehol","last_synced_at":"2025-04-07T09:20:12.785Z","repository":{"id":56935428,"uuid":"149111454","full_name":"bnosac/ruimtehol","owner":"bnosac","description":"R package to Embed All the Things! using StarSpace","archived":false,"fork":false,"pushed_at":"2024-02-23T08:21:33.000Z","size":41171,"stargazers_count":101,"open_issues_count":19,"forks_count":13,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-31T07:07:32.735Z","etag":null,"topics":["classification","embeddings","natural-language-processing","nlp","r","similarity","starspace","text-mining"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bnosac.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-17T10:55:58.000Z","updated_at":"2024-12-30T22:23:48.000Z","dependencies_parsed_at":"2024-05-01T18:38:05.272Z","dependency_job_id":null,"html_url":"https://github.com/bnosac/ruimtehol","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fruimtehol","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fruimtehol/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fruimtehol/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fruimtehol/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bnosac","download_url":"https://codeload.github.com/bnosac/ruimtehol/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247622983,"owners_count":20968575,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","embeddings","natural-language-processing","nlp","r","similarity","starspace","text-mining"],"created_at":"2024-07-31T00:00:40.971Z","updated_at":"2025-04-07T09:20:12.764Z","avatar_url":"https://github.com/bnosac.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# ruimtehol: R package to Embed All the Things! using StarSpace\n\nThis repository contains an R package which wraps the StarSpace C++ library (https://github.com/facebookresearch/StarSpace), allowing the following:\n\n- Text classification\n- Learning word, sentence or document level embeddings\n- Finding sentence or document similarity\n- Ranking web documents\n- Content-based recommendation (e.g. recommend text/music based on the content)\n- Collaborative filtering based recommendation (e.g. recommend text/music based on interest)\n- Identification of entity relationships\n\n\u003cimg src=\"vignettes/logo-ruimtehol.png\" width=\"600\"\u003e\n\n\n\n## Installation\n\n- For regular users, install the package from your local CRAN mirror `install.packages(\"ruimtehol\")`\n- For installing the development version of this package: `devtools::install_github(\"bnosac/ruimtehol\", build_vignettes = TRUE)`\n\nLook to the vignette and the documentation of the functions\n\n```\nvignette(\"ground-control-to-ruimtehol\", package = \"ruimtehol\")\nhelp(package = \"ruimtehol\")\n```\n\n\n## Main functionalities\n\nThis R package allows to *Build Starspace models* on your own text / *Get embeddings* of words/ngrams/sentences/documents/labels / Get *predictions* from a model (e.g. classification / ranking) / Get *nearest neighbours similarity*\n\nThe following functions are made available.\n\n| Function                      | Functionality                                                  |\n|-------------------------------|----------------------------------------------------------------|\n| `starspace`                   | Low-level interface to build a Starspace model                 |\n| `starspace_load_model`        | Load a pre-trained model or a tab-separated file               |\n| `starspace_save_model`        | Save a Starspace model                                         |\n| `starspace_embedding`         | Get embeddings of documents/words/ngrams/labels                |\n| `starspace_knn`               | Find k-nearest neighbouring information for new text           |\n| `starspace_dictonary`         | Get words/labels part of the model dictionary                  |\n| `predict.textspace`           | Get predictions along a Starspace model                        |\n| `as.matrix`                   | Get words and label embeddings                                 |\n| `embedding_similarity`        | Cosine/dot product similarity between embeddings - top-n most similar text                  |\n| `embed_wordspace`             | Build a Starspace model which calculates word/ngram embeddings                              |\n| `embed_sentencespace`         | Build a Starspace model which calculates sentence embeddings                                |\n| `embed_articlespace`          | Build a Starspace model for embedding articles - sentence-article similarities              |\n| `embed_tagspace`              | Build a Starspace model for multi-label classification                                      |\n| `embed_docspace`              | Build a Starspace model for content-based recommendation                                    |\n| `embed_pagespace`             | Build a Starspace model for interest-based recommendation                                   |\n| `embed_entityrelationspace`   | Build a Starspace model for entity relationship completion                                  |\n\n\n\n## Example\n\n\n### Short example showing word embeddings\n\n\n```r\nlibrary(ruimtehol)\nset.seed(123456789)\n\n## Get some training data\ndownload.file(\"https://s3.amazonaws.com/fair-data/starspace/wikipedia_train250k.tgz\", \"wikipedia_train250k.tgz\")\nx \u003c- readLines(\"wikipedia_train250k.tgz\", encoding = \"UTF-8\")\nx \u003c- x[-c(1:9)]\nx \u003c- x[sample(x = length(x), size = 10000)]\nwriteLines(text = x, sep = \"\\n\", con = \"wikipedia_train10k.txt\")\n```\n\n```r\n## Train\nset.seed(123456789)\nmodel \u003c- starspace(file = \"wikipedia_train10k.txt\", fileFormat = \"labelDoc\", dim = 10, trainMode = 3)\nmodel\n\nObject of class textspace\n dimension of the embedding: 10\n training arguments:\n      loss: hinge\n      margin: 0.05\n      similarity: cosine\n      epoch: 5\n      adagrad: TRUE\n      lr: 0.01\n      termLr: 1e-09\n      norm: 1\n      maxNegSamples: 10\n      negSearchLimit: 50\n      p: 0.5\n      shareEmb: TRUE\n      ws: 5\n      dropoutLHS: 0\n      dropoutRHS: 0\n      initRandSd: 0.001\n```\n\n```r\nembedding \u003c- as.matrix(model)\nembedding[c(\"school\", \"house\"), ]\n\n              [,1]         [,2]        [,3]        [,4]         [,5]        [,6]       [,7]       [,8]         [,9]       [,10]\nschool 0.008395348  0.002858619 0.004770191 -0.03791502 -0.016193179 0.008368539 -0.0221493 0.01587386 -0.002012054 0.029385706\nhouse  0.005371093 -0.007831781 0.010563998  0.01040361  0.000616577 0.005770847 -0.0097075 0.01678141 -0.004738560 0.009139475\ndictionary \u003c- starspace_dictionary(model)\n```\n\n```r\n## Save trained model as a binary file or as TSV so that you can inspect the embeddings e.g. with data.table::fread(\"wikipedia_embeddings.tsv\")\nstarspace_save_model(model, file = \"textspace.ruimtehol\",      method = \"ruimtehol\")\nstarspace_save_model(model, file = \"wikipedia_embeddings.tsv\", method = \"tsv-data.table\")\n## Load a pre-trained model or pre-trained embeddings\nmodel \u003c- starspace_load_model(\"textspace.ruimtehol\",      method = \"ruimtehol\")\nmodel \u003c- starspace_load_model(\"wikipedia_embeddings.tsv\", method = \"tsv-data.table\", trainMode = 3)\n\n## Get the document embedding\nstarspace_embedding(model, \"get the embedding of a full document\")\n\n                                          [,1]        [,2]      [,3]       [,4]      [,5]      [,6]       [,7]      [,8]     [,9]     [,10]\nget the embedding of a full document 0.1489144 -0.09543591 0.1242385 -0.1080941 0.6971645 0.3131362 -0.3405705 0.3293449 0.231894 -0.281555\n```\n\nThe following functionalities do similar things. They see what is the closest word or sentence to a provided sentence.\n\n```r\n## What is closest term from the dictionary\nstarspace_knn(model, \"What does this bunch of text look like\", k = 10)\n\n## What is closest sentence to vector of sentences\npredict(model, newdata = \"what does this bunch of text look like\", \n        basedoc = c(\"what does this bunch of text look like\", \n                    \"word abracadabra was not part of the dictionary\", \n                    \"give me back my mojo\",\n                    \"cosine distance is what i show\"))\n                    \n## Get cosine distance between 2 sentence vectors\nembedding_similarity(\n  starspace_embedding(model, \"what does this bunch of text look like\"),\n  starspace_embedding(model, \"word abracadabra was not part of the dictionary\"), \n  type = \"cosine\")\n```\n\n### Short example showing classification modelling (tagspace)\n\n\nBelow Starspace is used for classification\n\n```r\nlibrary(ruimtehol)\ndata(\"dekamer\", package = \"ruimtehol\")\ndekamer$x \u003c- strsplit(dekamer$question, \"\\\\W\")\ndekamer$x \u003c- sapply(dekamer$x, FUN = function(x) paste(setdiff(x, \"\"), collapse = \" \"))\ndekamer$x \u003c- tolower(dekamer$x)\ndekamer$y \u003c- strsplit(dekamer$question_theme, split = \",\")\ndekamer$y \u003c- lapply(dekamer$y, FUN=function(x) gsub(\" \", \"-\", x))\n\nset.seed(123456789)\nmodel \u003c- embed_tagspace(x = dekamer$x, y = dekamer$y,\n                        dim = 50, \n                        lr = 0.01, epoch = 40, loss = \"softmax\", adagrad = TRUE, \n                        similarity = \"cosine\", negSearchLimit = 50,\n                        ngrams = 2, minCount = 2)\nplot(model)                        \n            \ntext \u003c- c(\"de nmbs heeft het treinaanbod uitgebreid via onteigening ...\",\n          \"de migranten komen naar europa de asielcentra ...\")                   \npredict(model, text, k = 3)  \npredict(model, \"koning filip\", k = 10, type = \"knn\")\npredict(model, \"koning filip\", k = 10, type = \"embedding\")\n```\n\n## Notes\n\n- Why did you call the package ruimtehol? Because that is the translation of StarSpace in WestVlaams.\n- The R wrapper is distributed under the Mozilla Public License 2.0. The package contains a copy of the StarSpace C++ code (namely all code under src/Starspace) which has a BSD license (which is available in file LICENSE.notes) and also has an accompanying PATENTS file which you can inspect [here](inst/PATENTS).\n\n## Support in text mining\n\nNeed support in text mining?\nContact BNOSAC: http://www.bnosac.be\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbnosac%2Fruimtehol","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbnosac%2Fruimtehol","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbnosac%2Fruimtehol/lists"}