{"id":13857341,"url":"https://github.com/contefranz/OpTop","last_synced_at":"2025-07-13T21:32:23.319Z","repository":{"id":221611922,"uuid":"138142794","full_name":"contefranz/OpTop","owner":"contefranz","description":"Optimal topic identification from a pool of Latent Dirichlet Allocation models","archived":false,"fork":false,"pushed_at":"2022-02-10T08:44:52.000Z","size":335,"stargazers_count":12,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-22T01:04:33.379Z","etag":null,"topics":["latent-dirichlet-allocation","lda","model-selection","natural-language-processing","nlp","text-mining","topic-modeling"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/contefranz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-21T08:34:59.000Z","updated_at":"2024-08-20T10:09:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"de0aac10-1287-4b6f-8169-eebee45c4170","html_url":"https://github.com/contefranz/OpTop","commit_stats":null,"previous_names":["contefranz/optop"],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/contefranz%2FOpTop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/contefranz%2FOpTop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/contefranz%2FOpTop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/contefranz%2FOpTop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/contefranz","download_url":"https://codeload.github.com/contefranz/OpTop/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225920277,"owners_count":17545463,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["latent-dirichlet-allocation","lda","model-selection","natural-language-processing","nlp","text-mining","topic-modeling"],"created_at":"2024-08-05T03:01:33.825Z","updated_at":"2024-11-22T15:30:30.693Z","avatar_url":"https://github.com/contefranz.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"[![lifecycle](https://lifecycle.r-lib.org/articles/figures/lifecycle-experimental.svg)](https://www.tidyverse.org/lifecycle/#maturing)\n[![release](https://img.shields.io/badge/release-v0.9.5-blue.svg)](https://github.com/contefranz/OpTop/releases/tag/0.9.5)\n[![license](https://img.shields.io/badge/license-GPL--3-blue.svg)](https://en.wikipedia.org/wiki/GNU_General_Public_License)\n[![DOI](https://zenodo.org/badge/138142794.svg)](https://zenodo.org/badge/latestdoi/138142794)\n\n# OpTop: detect the optimal number of topics from a pool of LDA models\n\n## Overview\n\n__OpTop__ is an `R` package that implements the testing approach described in \nthe paper _A Statistical Approach for Optimal Topic Model Identification_ \nby Lewis and Grossetti (2019). \n\nLatent Dirichlet Allocation (LDA) was developed by Blei, Ng, and Jordan in \n2003 [Blei et al., (2003)] and is based on the idea that a corpus can be \nrepresented by a set of topics. LDA has been used extensively in computational \nlinguistics, is replicable, and is automated so it cannot be influenced by \nresearcher prejudice. LDA uses a likelihood approach to discover clusters of \ntext, namely topics that frequently appear in a corpus.\n\nOne of the open challenges in topic modeling is to rigorously determine the \noptimal number of topics for a corpus. Extant research relies on heuristic \napproaches such as iterative trial-and-error procedures to select the number \nof topics. For example, a standard approach is to determine which specification \nis the least perplexed by the test sets. Perplexity is based on the intuition \nthat a high degree of similarity, identified as a low level of perplexity, can \nbe used to determine the appropriate number of topics [Blei et al., (2003); \nHornik and Grün, (2011)].\n\n__OpTop__ introduces a set of parametric tests to identify the optimal number of topics from a \ncollection of LDA models. OpTop also includes several tests to explore topic stability and redundancy.\n\n\n## Installation\n\nThe package is not on CRAN yet. You can install the development version as follows:\n``` r\n# Install the development version from Github:\ndevtools::install_github(\"contefranz/OpTop\")\n```\n\n## Functions\n\nAll the procedures described in the paper will be implemented in this package.\nThe package is in beta stage and contains the following functions whose most of the internals \nare in `C++` and `C` to increase the performance.\n\n* `get_topic_models()`: handy function to immediately get the list of topic models\nthe user wants to process from a specified environment;\n\n* `optimal_topic()`: implements _Test 1_ of optimality from the methodological \npaper [Lewis and Grossetti (2019)].\n\n* `topic_stability()`: implements _Test 2_ of topic stability from the \nmethodological paper [Lewis and Grossetti (2019)].\n\n* `agg_topic_stability()`: implements _Test 3_ of aggregate topic stability \nfrom the methodological paper [Lewis and Grossetti (2019)].\n\n* `agg_document_stability()`: implements _Test 4_ of overall topic stability and\n_Test 5_ of relative topic importance from the methodological paper \n[Lewis and Grossetti (2019)].\n\n* `sim_dfm()`: convenient function to simulate a **quanteda** `dfm` object from a given \nLDA model of class `LDA_VEM` from **topicmodels**.\n\n## Bug Reporting\n\nBugs and issues can be reported at\n[https://github.com/contefranz/OpTop/issues](https://github.com/contefranz/OpTop/issues).\n\n## Authors\n\n* [Francesco Grossetti](http://faculty.unibocconi.eu/francescogiovannigrossetti/) \n\n  Assistant Professor of Data Science and Accounting Information Systems  \n  Bocconi Institute for Data Science and Analytics ([BIDSA](https://www.bidsa.unibocconi.eu/wps/wcm/connect/Site/Bidsa/Home))  \n  Accounting Department, Bocconi University.  \n  Contact Francesco at: francesco.grossetti@unibocconi.it.  \n\n* [Craig M. Lewis](https://business.vanderbilt.edu/bio/craig-lewis/)\n\n  Madison S. Wigginton Professor of Finance  \n  Owen Business School, Vanderbilt University.  \n  Contact Craig at: craig.lewis@owen.vanderbilt.edu.  \n\n## Bibliography\n\n1. Lewis, C. and Grossetti, F. (2022): _A Statistical Approach\nfor Optimal Topic Model Identification_ (forthcoming on Journal of Machine Learning Research)\n2. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). _Latent Dirichlet Allocation_.\nJournal of Machine Learning Research, 3(Jan):993–1022.\n3. Benoit K., Watanabe K., Wang H., Nulty P., Obeng A., M\u0026uuml;ller S., Matsuo A.\n(2018): _`quanteda`: An R package for the\nquantitative analysis of textual data_. Journal of Open Source Software, 3(30), 774. doi: 10.21105/joss.00774\n(URL: http://doi.org/10.21105/joss.00774), URL: https://quanteda.io)\n\n***\n  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcontefranz%2FOpTop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcontefranz%2FOpTop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcontefranz%2FOpTop/lists"}