{"id":19723142,"url":"https://github.com/bnosac/doc2vec","last_synced_at":"2025-04-29T22:30:54.481Z","repository":{"id":55894366,"uuid":"312112015","full_name":"bnosac/doc2vec","owner":"bnosac","description":"Distributed Representations of Sentences and Documents","archived":false,"fork":false,"pushed_at":"2021-11-11T11:49:12.000Z","size":3351,"stargazers_count":48,"open_issues_count":9,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-16T13:40:52.454Z","etag":null,"topics":["doc2vec","embeddings","natural-language-processing","paragraph2vec","r-package","word2vec"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bnosac.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-11-11T23:05:53.000Z","updated_at":"2024-12-11T06:22:38.000Z","dependencies_parsed_at":"2022-08-15T08:50:40.211Z","dependency_job_id":null,"html_url":"https://github.com/bnosac/doc2vec","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fdoc2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fdoc2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fdoc2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2Fdoc2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bnosac","download_url":"https://codeload.github.com/bnosac/doc2vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251592936,"owners_count":21614445,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["doc2vec","embeddings","natural-language-processing","paragraph2vec","r-package","word2vec"],"created_at":"2024-11-11T23:19:41.320Z","updated_at":"2025-04-29T22:30:52.463Z","avatar_url":"https://github.com/bnosac.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# doc2vec \n\nThis repository contains an R package allowing to build `Paragraph Vector` models also known as `doc2vec` models. You can train the distributed memory ('PV-DM') and the distributed bag of words ('PV-DBOW') models. \nNext to that, it also allows to build a `top2vec` model allowing to cluster documents based on these embeddings.\n\n- doc2vec is based on the paper *Distributed Representations of Sentences and Documents* [Mikolov et al.](https://arxiv.org/pdf/1405.4053.pdf) while top2vec is based on the paper *Distributed Representations of Topics* [Angelov](https://arxiv.org/abs/2008.09470)\n- The doc2vec part is an Rcpp wrapper around https://github.com/hiyijian/doc2vec\n- The package allows one \n    - to train paragraph embeddings (also known as document embeddings) on character data or data in a text file\n    - use the embeddings to find similar documents, paragraphs, sentences or words\n    - cluster document embeddings using top2vec\n- Note. For getting word vectors in R: look at package https://github.com/bnosac/word2vec, details [here](https://www.bnosac.be/index.php/blog/100-word2vec-in-r), for Starspace embeddings: look at package https://github.com/bnosac/ruimtehol, details [here](https://CRAN.R-project.org/package=ruimtehol/vignettes/ground-control-to-ruimtehol.pdf)\n\n## Installation\n\n- For regular users, install the package from your local CRAN mirror `install.packages(\"doc2vec\")`\n- For installing the development version of this package: `remotes::install_github(\"bnosac/doc2vec\")`\n\nLook to the documentation of the functions\n\n\n```r\nhelp(package = \"doc2vec\")\n```\n\n\n## Example on doc2vec\n\n- Take some data and standardise it a bit. \n    - Make sure it has columns doc_id and text \n    - Make sure that each text has less than 1000 words (a word is considered separated by a single space)\n    - Make sure that each text does not contain newline symbols \n\n\n```r\nlibrary(doc2vec)\nlibrary(tokenizers.bpe)\nlibrary(udpipe)\ndata(belgium_parliament, package = \"tokenizers.bpe\")\nx \u003c- subset(belgium_parliament, language %in% \"dutch\")\nx \u003c- data.frame(doc_id = sprintf(\"doc_%s\", 1:nrow(x)), \n                text   = x$text, \n                stringsAsFactors = FALSE)\nx$text   \u003c- tolower(x$text)\nx$text   \u003c- gsub(\"[^[:alpha:]]\", \" \", x$text)\nx$text   \u003c- gsub(\"[[:space:]]+\", \" \", x$text)\nx$text   \u003c- trimws(x$text)\nx$nwords \u003c- txt_count(x$text, pattern = \" \")\nx        \u003c- subset(x, nwords \u003c 1000 \u0026 nchar(text) \u003e 0)\n```\n\n-  Build the model \n\n\n```r\n## Low-dimensional model using DM, low number of iterations, for speed and display purposes\nmodel \u003c- paragraph2vec(x = x, type = \"PV-DM\", dim = 5, iter = 3,  \n                       min_count = 5, lr = 0.05, threads = 1)\nstr(model)\n```\n\n```\n## List of 3\n##  $ model  :\u003cexternalptr\u003e \n##  $ data   :List of 4\n##   ..$ file        : chr \"C:\\\\Users\\\\Jan\\\\AppData\\\\Local\\\\Temp\\\\Rtmpk9Npjg\\\\textspace_1c446bffa0e.txt\"\n##   ..$ n           : num 170469\n##   ..$ n_vocabulary: num 3867\n##   ..$ n_docs      : num 1000\n##  $ control:List of 9\n##   ..$ min_count: int 5\n##   ..$ dim      : int 5\n##   ..$ window   : int 5\n##   ..$ iter     : int 3\n##   ..$ lr       : num 0.05\n##   ..$ skipgram : logi FALSE\n##   ..$ hs       : int 0\n##   ..$ negative : int 5\n##   ..$ sample   : num 0.001\n##  - attr(*, \"class\")= chr \"paragraph2vec_trained\"\n```\n\n\n```r\n## More realistic model\nmodel \u003c- paragraph2vec(x = x, type = \"PV-DBOW\", dim = 100, iter = 20, \n                       min_count = 5, lr = 0.05, threads = 4)\n```\n\n-  Get the embedding of the documents or words and get the vocabulary\n\n\n```r\nembedding \u003c- as.matrix(model, which = \"words\")\nembedding \u003c- as.matrix(model, which = \"docs\")\nvocab     \u003c- summary(model,   which = \"docs\")\nvocab     \u003c- summary(model,   which = \"words\")\n```\n\n-  Get the embedding of specific documents / words or sentences. \n\n\n```r\nsentences \u003c- list(\n  sent1 = c(\"geld\", \"diabetes\"),\n  sent2 = c(\"frankrijk\", \"koning\", \"proximus\"))\nembedding \u003c- predict(model, newdata = sentences,                     type = \"embedding\")\nembedding \u003c- predict(model, newdata = c(\"geld\", \"koning\"),           type = \"embedding\", which = \"words\")\nembedding \u003c- predict(model, newdata = c(\"doc_1\", \"doc_10\", \"doc_3\"), type = \"embedding\", which = \"docs\")\nncol(embedding)\n```\n\n```\n## [1] 100\n```\n\n```r\nembedding[, 1:4]\n```\n\n```\n##              [,1]        [,2]       [,3]        [,4]\n## doc_1  0.05721277 -0.10298843  0.1089350 -0.03075439\n## doc_10 0.09553983  0.05211980 -0.0513489 -0.11847925\n## doc_3  0.08008177 -0.03324692  0.1563442  0.06585038\n```\n\n-  Get similar documents or words when providing sentences, documents or words\n\n\n```r\nnn \u003c- predict(model, newdata = c(\"proximus\", \"koning\"), type = \"nearest\", which = \"word2word\", top_n = 5)\nnn\n```\n\n```\n## [[1]]\n##      term1              term2 similarity rank\n## 1 proximus telefoontoestellen  0.5357178    1\n## 2 proximus            belfius  0.5169221    2\n## 3 proximus                ceo  0.4839031    3\n## 4 proximus            klanten  0.4819543    4\n## 5 proximus               taal  0.4590944    5\n## \n## [[2]]\n##    term1          term2 similarity rank\n## 1 koning     ministerie  0.5615162    1\n## 2 koning verplaatsingen  0.5484987    2\n## 3 koning        familie  0.4911003    3\n## 4 koning       grondwet  0.4871097    4\n## 5 koning       gedragen  0.4694150    5\n```\n\n```r\nnn \u003c- predict(model, newdata = c(\"proximus\", \"koning\"), type = \"nearest\", which = \"word2doc\",  top_n = 5)\nnn\n```\n\n```\n## [[1]]\n##      term1   term2 similarity rank\n## 1 proximus doc_105  0.6684639    1\n## 2 proximus doc_863  0.5917463    2\n## 3 proximus doc_186  0.5233522    3\n## 4 proximus doc_620  0.4919243    4\n## 5 proximus doc_862  0.4619178    5\n## \n## [[2]]\n##    term1   term2 similarity rank\n## 1 koning  doc_44  0.6686417    1\n## 2 koning  doc_45  0.5616031    2\n## 3 koning doc_583  0.5379452    3\n## 4 koning doc_943  0.4855201    4\n## 5 koning doc_797  0.4573555    5\n```\n\n```r\nnn \u003c- predict(model, newdata = c(\"doc_198\", \"doc_285\"), type = \"nearest\", which = \"doc2doc\",   top_n = 5)\nnn\n```\n\n```\n## [[1]]\n##     term1   term2 similarity rank\n## 1 doc_198 doc_343  0.5522854    1\n## 2 doc_198 doc_899  0.4902798    2\n## 3 doc_198 doc_983  0.4847047    3\n## 4 doc_198 doc_642  0.4829021    4\n## 5 doc_198 doc_336  0.4674844    5\n## \n## [[2]]\n##     term1   term2 similarity rank\n## 1 doc_285 doc_319  0.5318567    1\n## 2 doc_285 doc_286  0.5100293    2\n## 3 doc_285 doc_113  0.5056069    3\n## 4 doc_285 doc_526  0.4840761    4\n## 5 doc_285 doc_488  0.4805686    5\n```\n\n```r\nsentences \u003c- list(\n  sent1 = c(\"geld\", \"frankrijk\"),\n  sent2 = c(\"proximus\", \"onderhandelen\"))\nnn \u003c- predict(model, newdata = sentences, type = \"nearest\", which = \"sent2doc\", top_n = 5)\nnn\n```\n\n```\n## $sent1\n##   term1   term2 similarity rank\n## 1 sent1 doc_742  0.4830917    1\n## 2 sent1 doc_151  0.4340138    2\n## 3 sent1 doc_825  0.4263285    3\n## 4 sent1 doc_740  0.4059283    4\n## 5 sent1 doc_776  0.4024554    5\n## \n## $sent2\n##   term1   term2 similarity rank\n## 1 sent2 doc_105  0.5497447    1\n## 2 sent2 doc_863  0.5061581    2\n## 3 sent2 doc_862  0.4973840    3\n## 4 sent2 doc_620  0.4793786    4\n## 5 sent2 doc_186  0.4755909    5\n```\n\n```r\nsentences \u003c- strsplit(setNames(x$text, x$doc_id), split = \" \")\nnn \u003c- predict(model, newdata = sentences, type = \"nearest\", which = \"sent2doc\", top_n = 5)\n```\n\n## Example on top2vec\n\n\nTop2vec clusters document semantically and finds most semantically relevant terms for each topic\n\n![](tools/example-viz.png)\n\n\n```r\nlibrary(doc2vec)\nlibrary(word2vec)\nlibrary(uwot)\nlibrary(dbscan)\ndata(be_parliament_2020, package = \"doc2vec\")\nx      \u003c- data.frame(doc_id = be_parliament_2020$doc_id,\n                     text   = be_parliament_2020$text_nl,\n                     stringsAsFactors = FALSE)\nx$text \u003c- txt_clean_word2vec(x$text)\nx      \u003c- subset(x, txt_count_words(text) \u003c 1000)\n\nd2v    \u003c- paragraph2vec(x, type = \"PV-DBOW\", dim = 50, \n                        lr = 0.05, iter = 10,\n                        window = 15, hs = TRUE, negative = 0,\n                        sample = 0.00001, min_count = 5, \n                        threads = 1)\nmodel  \u003c- top2vec(d2v, \n                  control.dbscan = list(minPts = 50), \n                  control.umap = list(n_neighbors = 15L, n_components = 3), umap = tumap, \n                  trace = TRUE)\ninfo   \u003c- summary(model, top_n = 7)\ninfo$topwords\n```\n\n## Note\n\nThe package has some hard limits namely\n\n- Each document should contain less than 1000 words\n- Each word has a maximum length of 100 letters\n\n\n## Support in text mining\n\nNeed support in text mining?\nContact BNOSAC: http://www.bnosac.be\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbnosac%2Fdoc2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbnosac%2Fdoc2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbnosac%2Fdoc2vec/lists"}