{"id":13696247,"url":"https://github.com/machine-intelligence-laboratory/TopicNet","last_synced_at":"2025-05-03T16:33:03.239Z","repository":{"id":35109685,"uuid":"206595209","full_name":"machine-intelligence-laboratory/TopicNet","owner":"machine-intelligence-laboratory","description":"Interface for easier topic modelling.","archived":false,"fork":false,"pushed_at":"2024-07-29T08:54:27.000Z","size":11013,"stargazers_count":138,"open_issues_count":30,"forks_count":17,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-12T01:17:58.811Z","etag":null,"topics":["bigartm-library","custom-score","document-representation","modalities","multimodal-data","multimodal-learning","pypi","topic-modeling","topic-modelling"],"latest_commit_sha":null,"homepage":"https://machine-intelligence-laboratory.github.io/TopicNet","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/machine-intelligence-laboratory.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-05T15:20:12.000Z","updated_at":"2025-01-09T03:35:38.000Z","dependencies_parsed_at":"2024-04-08T02:53:07.504Z","dependency_job_id":"881a8110-ee98-477d-9bfd-d6ed5360e953","html_url":"https://github.com/machine-intelligence-laboratory/TopicNet","commit_stats":{"total_commits":198,"total_committers":10,"mean_commits":19.8,"dds":0.6565656565656566,"last_synced_commit":"88963c16c65b90789739419ec1697843c9a97129"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine-intelligence-laboratory%2FTopicNet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine-intelligence-laboratory%2FTopicNet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine-intelligence-laboratory%2FTopicNet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine-intelligence-laboratory%2FTopicNet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/machine-intelligence-laboratory","download_url":"https://codeload.github.com/machine-intelligence-laboratory/TopicNet/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252216133,"owners_count":21713103,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigartm-library","custom-score","document-representation","modalities","multimodal-data","multimodal-learning","pypi","topic-modeling","topic-modelling"],"created_at":"2024-08-02T18:00:37.668Z","updated_at":"2025-05-03T16:32:58.221Z","avatar_url":"https://github.com/machine-intelligence-laboratory.png","language":"Python","funding_links":[],"categories":["Libraries \u0026 Toolkits","Python"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eTopicNet\u003c/h1\u003e\n\u003cimg align=\"right\" height=\"15%\" width=\"15%\" src=\"https://avatars3.githubusercontent.com/u/49844788?s=200\u0026v=4\" style=\"max-width:100%;\"\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/topicnet\"\u003e\n        \u003cimg alt=\"PyPI Version\" src=\"https://img.shields.io/pypi/v/topicnet?color=blue\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.python.org/downloads/\"\u003e\n        \u003cimg alt=\"Python Version\" src=\"https://img.shields.io/pypi/pyversions/TopicNet\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://app.travis-ci.com/machine-intelligence-laboratory/TopicNet\"\u003e\n        \u003cimg alt=\"Travis Build Status\" src=\"https://api.travis-ci.com/machine-intelligence-laboratory/TopicNet.svg?branch=master\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://codecov.io/gh/machine-intelligence-laboratory/TopicNet\"\u003e\n        \u003cimg alt=\"Code Coverage\" src=\"https://codecov.io/gh/machine-intelligence-laboratory/TopicNet/branch/master/graph/badge.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/machine-intelligence-laboratory/TopicNet/blob/master/LICENSE.txt\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/pypi/l/TopicNet?color=Black\"\u003e\n    \u003c/a\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    A high-level interface developed by \u003ca href=\"http://machine-intelligence.ru/en\"\u003eMachine Intelligence Laboratory\u003c/a\u003e for \u003ca href=\"https://github.com/bigartm/bigartm\"\u003eBigARTM\u003c/a\u003e library.\n\u003c/div\u003e\n\n\n## What is TopicNet\n\n`TopicNet` library was created to assist in the task of building topic models.\nIt aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.\n\nConsider using TopicNet if:\n\n* you want to explore BigARTM functionality without writing an overhead;\n* you need help with rapid solution prototyping;\n* you want to build a good topic model quickly (out-of-box, with default parameters);\n* you have an ARTM model at hand and you want to explore it's topics.\n\n`TopicNet` provides an infrastructure for your prototyping with the help of `Experiment` class and helps to observe results of your actions via [`viewers`](topicnet/viewers) module.\n\n\u003cp\u003e\n    \u003cdiv align=\"center\"\u003e\n        \u003cimg src=\"./docs/readme_images/training_scheme_example.png\" width=\"50%\" alt/\u003e\n    \u003c/div\u003e\n    \u003cem\u003e\n        Example of the two-stage experiment scheme.\n        At the first stage, regularizer with parameter \u003cimg src=\"./docs/readme_images/tau.svg\"\u003e taking values in some range \u003cimg src=\"./docs/readme_images/tau1-tau2-tau3.svg\"\u003e is applied.\n        Best models after the first stage are \u003cem\u003eModel 1\u003c/em\u003e and \u003cem\u003eModel 2\u003c/em\u003e — so \u003cem\u003eModel 3\u003c/em\u003e is not taking part in the training process anymore.\n        The second stage is connected with another regularizer with parameter \u003cimg src=\"./docs/readme_images/xi.svg\"\u003e taking values in range \u003cimg src=\"./docs/readme_images/xi1-xi2.svg\"\u003e.\n        As a result of this stage, two descendant models of \u003cem\u003eModel 1\u003c/em\u003e and two descendant models of \u003cem\u003eModel 2\u003c/em\u003e are obtained.\n    \u003c/em\u003e\n\u003c/p\u003e\n\nAnd here is sample code of the TopicNet baseline experiment:\n\n```python\nfrom topicnet.cooking_machine.config_parser import build_experiment_environment_from_yaml_config\nfrom topicnet.cooking_machine.recipes import ARTM_baseline as config_string\n\n\nconfig_string = config_string.format(\n    dataset_path      = '/data/datasets/NIPS/dataset.csv',\n    modality_list     = ['@word'],\n    main_modality     = '@word',\n    specific_topics   = [f'spc_topic_{i}' for i in range(19)],\n    background_topics = [f'bcg_topic_{i}' for i in range( 1)],\n)\nexperiment, dataset = (\n    build_experiment_environment_from_yaml_config(\n        yaml_string   = config_string,\n        experiment_id = 'sample_config',\n        save_path     = 'sample_save_folder_path',\n    )\n)\n\nexperiment.run(dataset)\n\nbest_model = experiment.select('PerplexityScore@all -\u003e min')[0]\n```\n\n\n## How to Start\n\nDefine `TopicModel` from an ARTM model at hand or with help from `model_constructor` module, where you can set models main parameters. Then create an `Experiment`, assigning a root position to this model and path to store your experiment. Further, you can define a set of training stages by the functionality provided by the `cooking_machine.cubes` module.\n\nFurther you can read documentation [here](https://machine-intelligence-laboratory.github.io/TopicNet/).\n\nIf you want to get familiar with BigARTM (which is not necessary, but generally useful), we recommend the [video tutorial](https://youtu.be/AIN00vWOJGw) by [Murat Apishev](https://github.com/MelLain).\nThe tutorial is in Russian, but it comes with a [Colab Notebook](https://colab.research.google.com/drive/13oUI1yxZHdQWUfmMpFY4KVlkyWzAkoky).\n\n\n## Installation\n\n**Core library functionality is based on BigARTM library**.\nSo BigARTM should also be installed on the machine.\nFortunately, the installation process should not be so difficult now.\nBelow are the detailed explanations.\n\n\n### Via Pip\n\nThe easiest way to install everything is via `pip` (but currently works fine only for Linux users!)\n\n```bash\npip install topicnet\n```\n\nThe command also installs BigARTM library, not only TopicNet.\nHowever, [BigARTM Command Line Utility](https://bigartm.readthedocs.io/en/stable/tutorials/bigartm_cli.html) will not be assembled.\nPip installation makes it possible to use BigARTM only through Python Interface.\n\nIf working on Windows or Mac, you should install BigARTM by yourself first, then `pip install topicnet` will work just fine.\nWe are hoping to bring all-in-`pip` installation support to the mentioned systems.\nHowever, right now you may find the following guide useful.\n\n### BigARTM for Non-Linux Users\n\nTo avoid installing BigARTM you can use [docker images](https://hub.docker.com/r/xtonev/bigartm/tags) with preinstalled different versions of BigARTM library:\n\n```bash\ndocker pull xtonev/bigartm:v0.10.0\ndocker run -t -i xtonev/bigartm:v0.10.0\n```\n\nChecking if all installed successfully:\n\n```bash\n$ python\n\n\u003e\u003e\u003e import artm\n\u003e\u003e\u003e artm.version()\n```\n\nAlternatively, you can follow [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html).\nThere is also a pair of tips which may provide additional help for Windows users:\n\n1. Go to the [installation page for Windows](http://docs.bigartm.org/en/stable/installation/windows.html) and download the 7z archive in the Downloads section.\n2. Use Anaconda `conda install` to download all the Python packages that BigARTM requires.\n3. Path variables must be set through the GUI window of system variables, and, if the variable `PYTHONPATH` is missing — add it to the **system wide** variables. Close the GUI window.\n\nAfter setting up the environment you can fork this repository or use `pip install topicnet` to install the library.\n\n\n### From Source\n\nOne can also install the library from GitHub, which may give more flexibility in developing (for example, making one's own viewers or regularizers a part of the module as .py files)\n\n```bash\ngit clone https://github.com/machine-intelligence-laboratory/TopicNet.git\ncd topicnet\npip install .\n```\n\n### Google Colab \u0026 Kaggle Notebooks\n\nAs Linux installation may be done solely using `pip`, TopicNet can be used in such online services as\n[Google Colab](https://colab.research.google.com) and\n[Kaggle Notebooks](https://www.kaggle.com/kernels).\nAll you need is to run the following command in a notebook cell:\n\n```bash\n! pip install topicnet\n```\n\nThere is also a [notebook in Google Colab](https://colab.research.google.com/drive/1Tr1ZO03iPufj11HtIH3JjaWWU1Wyxkzv) made by [Nikolay Gerasimenko](https://github.com/Nikolay-Gerasimenko), where BigARTM is build from source.\nThis may be useful, for example, if you plan to use the BigARTM Command Line Utility.\n\n\n# Usage\n\nLet's say you have a handful of raw text mined from some source and you want to perform some topic modelling on them.\nWhere should you start?\n\n## Data Preparation\n\nEvery ML problem starts with data preprocess step.\nTopicNet does not perform data preprocessing itself.\nInstead, it demands data being prepared by the user and loaded via [Dataset](topicnet/cooking_machine/dataset.py) class.\nHere is a basic example of how one can achieve that: [rtl_wiki_preprocessing](topicnet/demos/RTL-Wiki-Preprocessing.ipynb).\n\nFor the convenience of everyone who wants to use TopicNet and in general for everyone interested in topic modeling, we provide a couple of already proprocessed datasets (see [DemoDataset.ipynb](topicnet/dataset_manager/DemoDataset.ipynb) notebook for more information).\nThese datasets can be downloaded from code.\nFor example:\n\n```python\nfrom topicnet.dataset_manager import api\n\n\ndataset = api.load_dataset('postnauka')\n```\n\nOr, in case the API is broken or something, you can just go to the [TopicNet's page on Hugging Face](https://huggingface.co/TopicNet) and get the needed .csv files there.\n\n\n## Training a Topic Model\n\nHere we can finally get on the main part: making your own, best of them all, manually crafted Topic Model\n\n### Get Your Data\n\nWe need to load our previously prepared data with Dataset:\n\n```python\nDATASET_PATH = '/Wiki_raw_set/wiki_data.csv'\n\ndataset = Dataset(DATASET_PATH)\n```\n\n### Make an Initial Model\n\nIn case you want to start from a fresh model we suggest you use this code:\n\n```python\nfrom topicnet.cooking_machine.model_constructor import init_simple_default_model\n\n\nartm_model = init_simple_default_model(\n    dataset=dataset,\n    modalities_to_use={'@lemmatized': 1.0, '@bigram':0.5},\n    main_modality='@lemmatized',\n    specific_topics=14,\n    background_topics=1,\n)\n```\n\nNote that here we have model with two modalities: `'@lemmatized'` and `'@bigram'`.\nFurther, if needed, one can define a custom score to be calculated during the model training.\n\n```python\nfrom topicnet.cooking_machine.models.base_score import BaseScore\n\n\nclass CustomScore(BaseScore):\n    def __init__(self):\n        super().__init__()\n\n    def call(self,\n             model,\n             eps=1e-5,\n             n_specific_topics=14):\n\n        phi = model.get_phi().values[:,:n_specific_topics]\n        specific_sparsity = np.sum(phi \u003c eps) / np.sum(phi \u003c 1)\n\n        return specific_sparsity\n```\n\nNow, `TopicModel` with custom score can be defined:\n\n```python\nfrom topicnet.cooking_machine.models.topic_model import TopicModel\n\n\ncustom_scores = {'SpecificSparsity': CustomScore()}\ntopic_model = TopicModel(artm_model, model_id='Groot', custom_scores=custom_scores)\n```\n\n### Define an Experiment\n\nFor further model training and tuning `Experiment` is necessary:\n\n```python\nfrom topicnet.cooking_machine.experiment import Experiment\n\n\nexperiment = Experiment(\n    experiment_id=\"simple_experiment\", save_path=\"experiments\", topic_model=topic_model\n)\n```\n\n### Toy with the Cubes\n\nDefining a next stage of the model training to select a decorrelator parameter:\n\n```python\nfrom topicnet.cooking_machine.cubes import RegularizersModifierCube\n\n\nmy_first_cube = RegularizersModifierCube(\n    num_iter=5,\n    tracked_score_function='PerplexityScore@lemmatized',\n    regularizer_parameters={\n        'regularizer': artm.DecorrelatorPhiRegularizer(name='decorrelation_phi', tau=1),\n        'tau_grid': [0,1,2,3,4,5],\n    },\n    reg_search='grid',\n    verbose=True,\n)\n\nmy_first_cube(topic_model, dataset)\n```\n\nSelecting a model with best perplexity score:\n\n```python\nperplexity_criterion = 'PerplexityScore@lemmatized -\u003e min COLLECT 1'\nbest_model = experiment.select(perplexity_criterion)\n```\n\n### Alternatively: Use Recipes\n\nIf you need a topic model now, you can use one of the code snippets we call *recipes*.\n```python\nfrom topicnet.cooking_machine.recipes import BaselineRecipe\n\n\nEXPERIMENT_PATH = '/home/user/experiment/'\n\ntraining_pipeline = BaselineRecipe()\ntraining_pipeline.format_recipe(dataset_path=DATASET_PATH)\nexperiment, dataset = training_pipeline.build_experiment_environment(\n    save_path=EXPERIMENT_PATH\n)\n```\nafter that you can expect a following result:\n![run_result](./docs/readme_images/experiment_train.gif)\n\n\n### View the Results\n\nBrowsing the model is easy: create a viewer and call its `view()` method (or `view_from_jupyter()` — it is advised to use it if working in Jupyter Notebook):\n\n```python\nfrom topicnet.viewers import TopTokensViewer\n\n\ntoptok_viewer = TopTokensViewer(best_model, num_top_tokens=10, method='phi')\ntoptok_viewer.view_from_jupyter()\n```\n\nMore info about different viewers is available here: [`viewers`](topicnet/viewers).\n\n# FAQ\n\n### In the example we used to write vw modality like **@modality**, is it a VowpalWabbit format?\n\nIt is a convention to write data designating modalities with @ sign taken by TopicNet from BigARTM.\n\n### CubeCreator helps to perform a grid search over initial model parameters. How can I do it with modalities?\n\nModality search space can be defined using standart library logic like:\n\n```python\nclass_ids_cube = CubeCreator(\n    num_iter=5,\n    parameters: [\n        name: 'class_ids',\n        values: {\n            '@text':   [1, 2, 3],\n            '@ngrams': [4, 5, 6],\n        },\n    ]\n    reg_search='grid',\n    verbose=True,\n)\n```\n\nHowever, for the case of modalities a couple of slightly more convenient methods are availiable:\n\n```python\nparameters : [\n    {\n        'name'  : 'class_ids@text',\n        'values': [1, 2, 3]\n    },\n    {\n        'name'  : 'class_ids@ngrams',\n        'values': [4, 5, 6]\n    }\n]\nparameters:[\n    {\n        'class_ids@text'  : [1, 2, 3],\n        'class_ids@ngrams': [4, 5, 6]\n    }\n]\n```\n\n# Contribution\n\nIf you find a bug, or if you would like the library to have some new features — you are welcome to contact us or create an issue or a pull request!\n\nIt also worth noting that TopicNet library is always open to improvements in several areas:\n\n* New custom regularizers.\n* New topic model scores.\n* New topic models or recipes to train topic models for a particular task/with some special properties.\n* New datasets (so as to make them available for everyone to download and conduct experiments with topic models).\n\n\n# Citing TopicNet\n\nWhen citing `topicnet` in academic papers and theses, please use this BibTeX entry:\n\n```\n@InProceedings{bulatov-EtAl:2020:LREC,\n  author    = {Bulatov, Victor  and  Alekseev, Vasiliy  and  Vorontsov, Konstantin  and  Polyudova, Darya  and  Veselova, Eugenia  and  Goncharov, Alexey  and  Egorov, Evgeny},\n  title     = {TopicNet: Making Additive Regularisation for Topic Modelling Accessible},\n  booktitle      = {Proceedings of The 12th Language Resources and Evaluation Conference},\n  month          = {May},\n  year           = {2020},\n  address        = {Marseille, France},\n  publisher      = {European Language Resources Association},\n  pages     = {6747--6754},\n  url       = {https://www.aclweb.org/anthology/2020.lrec-1.833}\n}\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachine-intelligence-laboratory%2FTopicNet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmachine-intelligence-laboratory%2FTopicNet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachine-intelligence-laboratory%2FTopicNet/lists"}