{"id":15288173,"url":"https://github.com/christoph/robics","last_synced_at":"2025-04-13T07:34:31.895Z","repository":{"id":57462265,"uuid":"242975040","full_name":"Christoph/robics","owner":"Christoph","description":"Automatic detection of robust parametrizations for LDA and NMF. Compatible with scikit-learn and gensim.","archived":false,"fork":false,"pushed_at":"2020-11-13T11:29:20.000Z","size":129,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-18T09:39:26.368Z","etag":null,"topics":["gensim","lda","natural-language-processing","nmf","robust-parametrizations","scikit-learn","topic-modeling","topic-models"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Christoph.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-25T10:42:01.000Z","updated_at":"2021-01-23T06:02:37.000Z","dependencies_parsed_at":"2022-09-10T19:10:31.018Z","dependency_job_id":null,"html_url":"https://github.com/Christoph/robics","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Christoph%2Frobics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Christoph%2Frobics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Christoph%2Frobics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Christoph%2Frobics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Christoph","download_url":"https://codeload.github.com/Christoph/robics/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240260762,"owners_count":19773378,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gensim","lda","natural-language-processing","nmf","robust-parametrizations","scikit-learn","topic-modeling","topic-models"],"created_at":"2024-09-30T15:44:33.211Z","updated_at":"2025-02-23T02:30:41.460Z","avatar_url":"https://github.com/Christoph.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# robics\n**rob**ustTop**ics** is a library targeted at **non-machine learning experts** interested in building robust\ntopic models. The main goal is to provide a simple to use framework to check if\na topic model reaches each run the same or at least a similar result.\n\n## Features\n- Supports sklearn (LatentDirichletAllocation, NMF) and gensim (LdaModel, ldamulticore, nmf) topic models\n- Creates samples based on the [sobol sequence](https://en.wikipedia.org/wiki/Sobol_sequence) which requires less samples than grid-search and makes sure the whole parameter space is used which is not sure in random-sampling.\n- Simple topic matching between the different re-initializations for each sample using word vector based coherence scores.\n- Ranking of all models based on four metrics:\n  - [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) of the top n words for each topic\n  - Similarity of topic distributions based on the [Jensen Shannon Divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)\n  - Ranking correlation of the top n words based on [Kendall's Tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)\n  - Word vector based coherence score (simple version of the TC-W2V)\n- Word based analysis of samples and topic model instances.\n\n## Install\n- **Python Version:** 3.5+\n- **Package Managers:** pip\n\n### pip\nUsing pip, robics releases are available as source packages and binary wheels:\n```\npip install robics\n```\n\n## Examples\nTest dataset from sklearn\n```python\nfrom sklearn.datasets import fetch_20newsgroups\n\n# PREPROCESSING\ndataset = fetch_20newsgroups(\n    shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))\n\n# Only 1000 dokuments for performance reasons\ndocuments = dataset.data[:1000]\n```\n\nLoad word vectors used for coherence computation\n```python\nimport spacy\n\nnlp = spacy.load(\"en_core_web_md\")\n```\n\n\nDetect robust sklearn topic models\n```python\nfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\nfrom sklearn.decomposition import LatentDirichletAllocation, NMF\nfrom robics import RobustTopics\n\n# Document vectorization using TFIDF\ntf_vectorizer = CountVectorizer(\n    max_df=0.95, min_df=2, stop_words='english')\ntf = tf_vectorizer.fit_transform(documents)\ntf_feature_names = tf_vectorizer.get_feature_names()\n\n\n# TOPIC MODELLING\nrobustTopics = RobustTopics(nlp)\n\n# Load  NMF and LDA models\nrobustTopics.load_sklearn_model(\n    LatentDirichletAllocation, tf, tf_vectorizer, dimension_range=[5, 50], n_samples=4, n_initializations=3)\nrobustTopics.load_sklearn_model(NMF, tf, tf_vectorizer, dimension_range=[5, 50], n_samples=4, n_initializations=3)\n\n# Fit the models - Warning, this might take a lot of time based on the number of samples (n_models*n_sample*n_initializations)\nrobustTopics.fit_models()\n\n```\n\nAnalysis\nWe start by looking at the ranking of all models\n\n```python\nrobustTopics.rank_models()\n\n    # Output:\n    [{'model': sklearn.decomposition._nmf.NMF,\n    'model_id': 1,\n    'sample_id': 0,\n    'n_topics': 27,\n    'params': {'n_components': 27},\n    'topic_coherence': 0.3184709231908624,\n    'jaccard': 0.9976484420928865,\n    'kendalltau': 0.9987839560655095,\n    'jensenshannon': 0.9999348301145501},\n    {'model': sklearn.decomposition._nmf.NMF,\n    'model_id': 1,\n    'sample_id': 3,\n    'n_topics': 21,\n    'params': {'n_components': 21},\n    'topic_coherence': 0.31157823571739196,\n    'jaccard': 1.0,\n    'kendalltau': 1.0,\n    'jensenshannon': 0.9999784246484099},\n    ...\n     {'model': sklearn.decomposition._lda.LatentDirichletAllocation,\n    'model_id': 0,\n    'sample_id': 1,\n    'n_topics': 38,\n    'params': {'n_components': 38},\n    'topic_coherence': 0.30808284548733383,\n    'jaccard': 0.07074176815158277,\n    'kendalltau': 0.1023597955928783,\n    'jensenshannon': 0.22596536037699871}]\n\n```\n\nThe top model is the NMF model with 27 topics (model_id 1 and sample_id 0). The next step is to look at the model on a word level.\n\n```python\nrobustTopics.display_sample_topic_consistency(model_id=1, sample_id=0)\n\n    # Output:\n    Words in 3 out of 3 runs:\n    ['edu', 'graphics', 'pub', 'mail', 'ray', '128', 'send', '3d', 'ftp', 'com', 'objects', 'server', 'amiga', 'image', 'archie', 'files', 'file', 'images', 'archive', 'package', 'section', 'firearm', 'weapon', 'military', 'license', 'shall', 'dangerous', 'person', 'division', 'application', 'means', 'device', 'use', 'following', 'issued', 'state', 'act',...\n    Words in 2 out of 3 runs:\n    ['know']\n\n    Words in 1 out of 3 runs:\n    ['goals']\n\n```\n\nMost of the words are in all of the three re-initializations. Only 'know' and 'goals' are inconsistent.\n\nIn comparison lets look at the model performing worst:\n```python\nrobustTopics.display_sample_topic_consistency(model_id=0, sample_id=1)\n\n    # Output:\n    Words in 3 out of 3 runs:\n    ['know', 'know', 'people', 'use', 'drive', 'think', 'just', 'like', 'just', 'people', 'just', 'know', 'think', 'good', 'think', 'like', 'people', 'know', 'don', 'think', 'like', 'just', 'don', 'know', 'just', 'don', 'think', 'like', 'windows', 'like', 'don']\n\n    Words in 2 out of 3 runs:\n    ['just', 'think', 'people', 'dc', 'just', 'like', 'want', 'said', 'use', 'said', 'local', 'say', 'god', 'shall', 'rights', 'know', 'like', 'window', 'like', 'just', 'time', 'new', 'like', 'just', 'good', 'don', 'used', 'does', 'think', 'like', 'new', 'state', 'like', 'contact', 'know', 'bike', 'just', 'like', 'year', 'data', 'use', 'way', 've', 'people', 'don', 'know', 'didn', 'years', 'little', 'rocket', 'like', 'generation', 'build', 'll', 'max', 'ssrt', 'just', 'time', 'using', 'edu', 'ftp', 'file', 'available', 'server', '10', 'good', 'new', 'know', 'people', 'way', 'know', 'good', 'don', 'just', 'like', 'time', 'like', 'don', 'insurance', 'year', 'time', 'car', 'years', 'people', 'say', 'new', 'new', 'need', 'just', 'think', 'like', 'good', 'just', 'don', 'work', 'time', 'government', 'rights', 'make', 'people', 'edu', 'com', 'graphics', 'time', 'good', 'people', 'need', 'like', 'don', 'know', 'just', 'people', 'server', 'edu', 'file', 'video', 'support', 'mit', 'ftp', 'linux', 'time', 'binaries', 'information', 'available', 'new', 'greek', '10', 'just', 'just', 'does', 'know', 'use', 'need', 'like', 'don', 'use', 'know', 'think', 'like', 'new', 'like', 'just', 'edu', 've', 'does', 'problem', 'think', 'just', 'way', 'power', 'wrong', 'edu', 'points', 'point', 'just', 'hp', 'don', 'good', 'people', 'phone', 'food', 'just', 'bit', 'good', 'know', 'just', 'like', 'use', 'people', 'used', 'going', 'people', 'know', 'time', 'say', 'use', 'drive', '10', 'like', 'observations', 'think', 'don', 'want', 'thing', 'know', 'll', 'good', 'like', 'mm', 'used', 'say']\n\n    Words in 1 out of 3 runs:\n    ['jews', 'israel', 'true', 'state', 'year',\n    ...\n```\nThe worst model has most words in the 1 out of 3 runs section and only filler words are consistent between the runs.\n\"know\" appears five times which means, it is in five different topics in all three runs a top word.\n\nLet us look now at the topic in the top and bottom models.\n```python\n# Top model\nrobustTopics.display_sample_topics(1, 0)\n\n    # Output\n    Topic 0\n    In 3 runs: edu graphics pub mail ray 128 send 3d ftp com objects server amiga image archie files file images archive package\n    Topic 1\n    In 3 runs: section firearm weapon military license shall dangerous person division application means device use following issued state act designed code automatic\n    Topic 2\n    In 3 runs: aids health care said children medical infected new patients disease 1993 10 information research april national study trials service number\n    Topic 3\n    In 3 runs: god good people just suppose brothers makes fisher like does jews joseph did worship judaism right instead jesus end read\n    Topic 4\n    In 3 runs: server support edu file 386bsd ftp mit binaries vga information supported linux svga readme available os new faq files video\n    Topic 5\n    In 3 runs: edu com mil navy cs vote misc votes health ca hp nrl gov email cc creation au john thomas uk\n    Topic 6\n    In 3 runs: probe space titan earth orbiter launch mission jupiter orbit atmosphere 93 saturn gravity 10 surface satellite ray 12 possible 97\n\n# Bottom model\nrobustTopics.display_sample_topics(0, 1)\n\n    # Output\n    Topic 0\n    In 3 runs: know\n    Topic 1\n    In 3 runs: know\n    Topic 2\n    Topic 3\n    Topic 4\n    In 3 runs: people\n    Topic 5\n    Topic 6\n    In 3 runs: use\n    Topic 7\n    Topic 8\n    In 3 runs: drive think just\n```\nThe top model produces over all runs consistent topic with words that are connected.\nIn strong contrast to the bottom model, which has very little consistent words and these are provide little information.\n\nWe want to take a look at the topics of one of the instances from the top model (model 1, sample 0, instance 0).\n\n```python\nrobustTopics.display_run_topics(1, 0, 0)\n\n    # Output\n    Topic 0:\n    edu graphics pub mail ray 128 send 3d ftp com objects server amiga image archie files file images archive package\n    Topic 1:\n    section firearm weapon military license shall dangerous person division application means device use following issued state act designed code automatic\n    Topic 2:\n    aids health care said children medical infected new patients disease 1993 10 information research april national study trials service number\n    Topic 3:\n    god good people just suppose brothers makes fisher like does jews joseph did worship judaism right instead jesus end read\n    Topic 4:\n    server support edu file 386bsd ftp mit binaries vga information supported linux svga readme available os new faq files video\n    Topic 5:\n    edu com mil navy cs vote misc votes health ca hp nrl gov email cc creation au john thomas uk\n    Topic 6:\n    probe space titan earth orbiter launch mission jupiter orbit atmosphere 93 saturn gravity 10 surface satellite ray 12 possible 97\n```\n\nIf the output is meaningful we can now use this model for further use\nor export more indept information for additional analysis.\n\n```python\n# Use the results in your own workflow\n# Export specific model based on the ranked models and the analysis (model_id, sample_id, run_id)\nsklearn_model = robustTopics.export_model(1,0,0)\n\n# Look at the full reports inclusing separate values for each initialization\nrobustTopics.models[1].report_full\n\n# Convert the full report to a pandas dataframe for further use or export\nimport pandas as pd\nreport = pd.DataFrame.from_records(robustTopics.models[model_id].report) # or report_full\n```\n\nGensim topic models are also supported. Following is a example how to setup a\nsimple pipeline. The analysis steps are exactly the same as above.\n```python\nfrom gensim.models import LdaModel, nmf, ldamulticore\nfrom gensim.utils import simple_preprocess\nfrom gensim import corpora\nfrom robics import RobustTopics\n\ndef docs_to_words(docs):\n    for doc in docs:\n        yield(simple_preprocess(str(doc), deacc=True))\n\ntokenized_data = list(docs_to_words(documents))\ndictionary = corpora.Dictionary(tokenized_data)\ncorpus = [dictionary.doc2bow(text) for text in tokenized_data]\n\n# TOPIC MODELLING\nrobustTopics = RobustTopics(nlp)\n\n# Load 4 different models\nrobustTopics.load_gensim_model(\n    ldamulticore.LdaModel, corpus, dictionary, dimension_range=[5, 50], n_samples=4, n_initializations=3)\nrobustTopics.load_gensim_model(\n    nmf.Nmf, corpus, dictionary, dimension_range=[5, 50], n_samples=4, n_initializations=3)\n\nrobustTopics.fit_models()\n\n# Same analysis steps as in the sklearn example\n```\n\n## Next Steps\n- Visual interface\n- Adding support for more modells if required.\n- Add logging\n- Writing unit tests.\n- Improving the overall performance.\n- Implementing the Cv coherence measure from this [paper](https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf)\n\n## Contribution\nI am happy to receive help in any of the things mentioned above or other interesting feature request.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchristoph%2Frobics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchristoph%2Frobics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchristoph%2Frobics/lists"}