{"id":19723120,"url":"https://github.com/bnosac/btm","last_synced_at":"2025-09-12T07:41:46.676Z","repository":{"id":39705927,"uuid":"160680962","full_name":"bnosac/BTM","owner":"bnosac","description":"Biterm Topic Modelling for Short Text with R","archived":false,"fork":false,"pushed_at":"2023-02-11T14:26:49.000Z","size":182,"stargazers_count":96,"open_issues_count":4,"forks_count":15,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-09-08T16:40:56.771Z","etag":null,"topics":["biterm-topic-modelling","natural-language-processing","r","topic-modeling"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bnosac.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-06T13:47:19.000Z","updated_at":"2025-08-21T21:54:35.000Z","dependencies_parsed_at":"2023-02-15T02:35:36.231Z","dependency_job_id":null,"html_url":"https://github.com/bnosac/BTM","commit_stats":{"total_commits":54,"total_committers":2,"mean_commits":27.0,"dds":0.05555555555555558,"last_synced_commit":"205ae029a1ea275611a0d17b27c053f4c5ed8151"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/bnosac/BTM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2FBTM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2FBTM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2FBTM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2FBTM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bnosac","download_url":"https://codeload.github.com/bnosac/BTM/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bnosac%2FBTM/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274777620,"owners_count":25347648,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-12T02:00:09.324Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biterm-topic-modelling","natural-language-processing","r","topic-modeling"],"created_at":"2024-11-11T23:19:35.208Z","updated_at":"2025-09-12T07:41:46.625Z","avatar_url":"https://github.com/bnosac.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BTM - Biterm Topic Modelling for Short Text with R\n\nThis is an R package wrapping the C++ code available at https://github.com/xiaohuiyan/BTM for constructing a **Biterm Topic Model (BTM)**. This model models word-word co-occurrences patterns (e.g., biterms). \n\n\u003e Topic modelling using biterms is particularly good for finding topics in short texts (as occurs in short survey answers or twitter data).\n\n### Installation\n\nThis R package is on CRAN, just install it with `install.packages('BTM')`\n\n### What\n\nThe Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)\n\n- A biterm consists of two words co-occurring in the same context, for example, in the same short text window. \n- BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document). \n- It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic `z`. In other words, the distribution of a biterm `b=(wi,wj)` is defined as: `P(b) = sum_k{P(wi|z)*P(wj|z)*P(z)}` where k is the number of topics you want to extract.\n- Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for `P(w|k)=phi` and `P(z)=theta`.\n\nMore detail can be referred to the following paper:\n\n\u003e Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013.\n\u003e https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf\n\n\n![](tools/biterm-topic-model-example.png)\n\n\n### Example\n\n```\nlibrary(udpipe)\nlibrary(BTM)\ndata(\"brussels_reviews_anno\", package = \"udpipe\")\n\n## Taking only nouns of Dutch data\nx \u003c- subset(brussels_reviews_anno, language == \"nl\")\nx \u003c- subset(x, xpos %in% c(\"NN\", \"NNP\", \"NNS\"))\nx \u003c- x[, c(\"doc_id\", \"lemma\")]\n\n## Building the model\nset.seed(321)\nmodel  \u003c- BTM(x, k = 3, beta = 0.01, iter = 1000, trace = 100)\n\n## Inspect the model - topic frequency + conditional term probabilities\nmodel$theta\n[1] 0.3406998 0.2413721 0.4179281\n\ntopicterms \u003c- terms(model, top_n = 10)\ntopicterms\n[[1]]\n         token probability\n1  appartement  0.06168297\n2      brussel  0.04057012\n3        kamer  0.02372442\n4      centrum  0.01550855\n5      locatie  0.01547671\n6         stad  0.01229227\n7        buurt  0.01181460\n8     verblijf  0.01155985\n9         huis  0.01111402\n10         dag  0.01041345\n\n[[2]]\n         token probability\n1  appartement  0.05687312\n2      brussel  0.01888307\n3        buurt  0.01883812\n4        kamer  0.01465696\n5     verblijf  0.01339812\n6     badkamer  0.01285862\n7   slaapkamer  0.01276870\n8          dag  0.01213928\n9          bed  0.01195945\n10        raam  0.01164474\n\n[[3]]\n         token probability\n1  appartement 0.061804812\n2      brussel 0.035873377\n3      centrum 0.022193831\n4         huis 0.020091282\n5        buurt 0.019935537\n6     verblijf 0.018611710\n7     aanrader 0.014614272\n8        kamer 0.011447470\n9      locatie 0.010902365\n10      keuken 0.009448751\nscores \u003c- predict(model, newdata = x)\n```\n\n**Make a specific topic called the background**\n\n```\n# If you set background to TRUE\n# The first topic is set to a background topic that equals to the empirical word distribution. \n# This can be used to filter out common words.\nset.seed(321)\nmodel      \u003c- BTM(x, k = 5, beta = 0.01, background = TRUE, iter = 1000, trace = 100)\ntopicterms \u003c- terms(model, top_n = 5)\ntopicterms\n```\n\n### Visualisation of your model\n\n- Can be done using the textplot package (https://github.com/bnosac/textplot), which can be found at CRAN as well (https://cran.r-project.org/package=textplot) \n- An example visualisation built on a model of all R packages from the Natural Language Processing and Machine Learning task views is shown above (see also https://www.bnosac.be/index.php/blog/98-biterm-topic-modelling-for-short-texts)\n\n```\nlibrary(textplot)\nlibrary(ggraph)\nlibrary(concaveman)\nplot(model)\n```\n\n### Provide your own set of biterms\n\nAn interesting use case of this package is to \n\n- cluster based on parts of speech tags like nouns and adjectives which can be found in the text in the neighbourhood of one another\n- cluster dependency relationships provided by NLP tools like udpipe (https://CRAN.R-project.org/package=udpipe)\n\nThis can be done by providing your own set of biterms to cluster upon. \n\n**Example clustering cooccurrences of nouns/adjectives**\n\n```\nlibrary(data.table)\nlibrary(udpipe)\n## Annotate text with parts of speech tags\ndata(\"brussels_reviews\", package = \"udpipe\")\nanno \u003c- subset(brussels_reviews, language %in% \"nl\")\nanno \u003c- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)\nanno \u003c- udpipe(anno, \"dutch\", trace = 10)\n\n## Get cooccurrences of nouns / adjectives and proper nouns\nbiterms \u003c- as.data.table(anno)\nbiterms \u003c- biterms[, cooccurrence(x = lemma, \n                                  relevant = upos %in% c(\"NOUN\", \"PROPN\", \"ADJ\"),\n                                  skipgram = 2), \n                   by = list(doc_id)]\n                   \n## Build the model\nset.seed(123456)\nx     \u003c- subset(anno, upos %in% c(\"NOUN\", \"PROPN\", \"ADJ\"))\nx     \u003c- x[, c(\"doc_id\", \"lemma\")]\nmodel \u003c- BTM(x, k = 5, beta = 0.01, iter = 2000, background = TRUE, \n             biterms = biterms, trace = 100)\ntopicterms \u003c- terms(model, top_n = 5)\ntopicterms\n```\n\n**Example clustering dependency relationships**\n\n```\nlibrary(udpipe)\nlibrary(tm)\nlibrary(data.table)\ndata(\"brussels_reviews\", package = \"udpipe\")\nexclude \u003c- stopwords(\"nl\")\n\n## Do annotation on Dutch text\nanno \u003c- subset(brussels_reviews, language %in% \"nl\")\nanno \u003c- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)\nanno \u003c- udpipe(anno, \"dutch\", trace = 10)\nanno \u003c- setDT(anno)\nanno \u003c- merge(anno, anno, \n              by.x = c(\"doc_id\", \"paragraph_id\", \"sentence_id\", \"head_token_id\"), \n              by.y = c(\"doc_id\", \"paragraph_id\", \"sentence_id\", \"token_id\"), \n              all.x = TRUE, all.y = FALSE, suffixes = c(\"\", \"_parent\"), sort = FALSE)\n\n## Specify a set of relationships you are interested in (e.g. objects of a verb)\nanno$relevant \u003c- anno$dep_rel %in% c(\"obj\") \u0026 !is.na(anno$lemma_parent)\nbiterms \u003c- subset(anno, relevant == TRUE)\nbiterms \u003c- data.frame(doc_id = biterms$doc_id, \n                      term1 = biterms$lemma, \n                      term2 = biterms$lemma_parent,\n                      cooc = 1, \n                      stringsAsFactors = FALSE)\nbiterms \u003c- subset(biterms, !term1 %in% exclude \u0026 !term2 %in% exclude)\n\n## Put in x only terms whch were used in the biterms object such that frequency stats of terms can be computed in BTM\nanno \u003c- anno[, keep := relevant | (token_id %in% head_token_id[relevant == TRUE]), by = list(doc_id, paragraph_id, sentence_id)]\nx    \u003c- subset(anno, keep == TRUE, select = c(\"doc_id\", \"lemma\"))\nx    \u003c- subset(x, !lemma %in% exclude)\n\n## Build the topic model\nmodel \u003c- BTM(data = x, \n             biterms = biterms, \n             k = 6, iter = 2000, background = FALSE, trace = 100)\ntopicterms \u003c- terms(model, top_n = 5)\ntopicterms\n```\n\n\n## Support in text mining\n\nNeed support in text mining?\nContact BNOSAC: http://www.bnosac.be\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbnosac%2Fbtm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbnosac%2Fbtm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbnosac%2Fbtm/lists"}