{"id":13482402,"url":"https://github.com/bigartm/bigartm","last_synced_at":"2025-04-08T02:36:10.669Z","repository":{"id":20787485,"uuid":"24072579","full_name":"bigartm/bigartm","owner":"bigartm","description":"Fast topic modeling platform","archived":false,"fork":false,"pushed_at":"2023-08-19T16:18:52.000Z","size":17619,"stargazers_count":668,"open_issues_count":138,"forks_count":120,"subscribers_count":40,"default_branch":"master","last_synced_at":"2025-04-01T01:36:41.921Z","etag":null,"topics":["bigartm","bigdata","c-plus-plus","machine-learning","python","python-api","regularizer","text-mining","topic-modeling"],"latest_commit_sha":null,"homepage":"http://bigartm.org/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigartm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2014-09-15T20:26:11.000Z","updated_at":"2025-03-21T11:39:07.000Z","dependencies_parsed_at":"2023-10-20T18:24:30.833Z","dependency_job_id":null,"html_url":"https://github.com/bigartm/bigartm","commit_stats":{"total_commits":1285,"total_committers":51,"mean_commits":25.19607843137255,"dds":0.6038910505836577,"last_synced_commit":"47e37f982de87aa67bfd475ff1f39da696b181b3"},"previous_names":[],"tags_count":30,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigartm%2Fbigartm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigartm%2Fbigartm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigartm%2Fbigartm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigartm%2Fbigartm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigartm","download_url":"https://codeload.github.com/bigartm/bigartm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247765460,"owners_count":20992314,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigartm","bigdata","c-plus-plus","machine-learning","python","python-api","regularizer","text-mining","topic-modeling"],"created_at":"2024-07-31T17:01:01.650Z","updated_at":"2025-04-08T02:36:10.649Z","avatar_url":"https://github.com/bigartm.png","language":"C++","funding_links":[],"categories":["C++","Libraries","Libraries \u0026 Toolkits","Python","APIs and Libraries","函式庫","Packages","Machine Learning"],"sub_categories":["Videos and Online Courses","General-Purpose Machine Learning","Knowledge Graphs","書籍","Libraries","CI/CD"],"readme":"\u003cp align=\"center\"\u003e\n\t\u003cimg alt=\"BigARTM Logo\" src=\"http://bigartm.org/img/BigARTM-logo.svg\" width=\"250\"\u003e\n\u003c/p\u003e\n\nThe state-of-the-art platform for topic modeling.\n\n[![Build Status](https://secure.travis-ci.org/bigartm/bigartm.png)](https://travis-ci.org/bigartm/bigartm)\n[![Windows Build Status](https://ci.appveyor.com/api/projects/status/i18k840shuhr2jtk/branch/master?svg=true)](https://ci.appveyor.com/project/bigartm/bigartm)\n[![GitHub license](https://img.shields.io/badge/license-New%20BSD-blue.svg)](https://raw.github.com/bigartm/bigartm/master/LICENSE.txt)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.288960.svg)](https://doi.org/10.5281/zenodo.288960)\n\n  - [Full Documentation](http://docs.bigartm.org/)\n  - [User Mailing List](https://groups.google.com/forum/#!forum/bigartm-users)\n  - [Download Releases](https://github.com/bigartm/bigartm/releases)\n  - [User survey](http://goo.gl/forms/tr5EsPMcL2)\n\n\n# What is BigARTM?\n\nBigARTM is a powerful tool for [topic modeling](https://en.wikipedia.org/wiki/Topic_model) based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.\n\n### References\n\n* Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: [Open Source Library for Regularized Multimodal Topic Modeling of Large Collections](https://s3-eu-west-1.amazonaws.com/artm/Voron15aist.pdf) //  Analysis of Images, Social Networks and Texts. 2015.\n* Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M., Yanina A. [Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections](https://s3-eu-west-1.amazonaws.com/artm/Voron15cikm-tm.pdf) // Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications, October 19, 2015 - pp. 29-37.\n* Vorontsov K., Potapenko A., Plavin A. [Additive Regularization of Topic Models for Topic Selection and Sparse Factorization.](https://s3-eu-west-1.amazonaws.com/artm/voron15slds.pdf) // Statistical Learning and Data Sciences. 2015 — pp. 193-202.\n* Vorontsov K. V., Potapenko A. A. [Additive Regularization of Topic Models](https://s3-eu-west-1.amazonaws.com/artm/voron-potap14artm-eng.pdf) // Machine Learning Journal, Special Issue “Data Analysis and Intelligent Optimization”, Springer, 2014.\n* More publications can be found in our [wiki page](https://github.com/bigartm/bigartm/wiki/Publications).\n\n### Related Software Packages\n\n- [TopicNet](https://github.com/machine-intelligence-laboratory/TopicNet/) is a high-level interface for BigARTM which is helpful for rapid solution prototyping and for exploring the topics of finished ARTM models.\n- [David Blei's List](http://www.cs.columbia.edu/~blei/topicmodeling_software.html) of Open Source topic modeling software\n- [MALLET](http://mallet.cs.umass.edu/topics.php): Java-based toolkit for language processing with topic modeling package\n- [Gensim](https://radimrehurek.com/gensim/): Python topic modeling library\n- [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit) has an implementation of [Online-LDA algorithm](https://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation)\n\n\n# Installation\n### Installing with pip (Linux only)\n\nWe have a PyPi release for Linux:\n```bash\n$ pip install bigartm\n```\nor \n```bash\n$ pip install bigartm10\n```\n\n### Installing on Windows\nWe suggest [using pre-build binaries](https://bigartm.readthedocs.io/en/master/installation/windows.html).\n\nIt is also possible to [compile C++ code on Windows](https://bigartm.readthedocs.io/en/master/devguide/dev_build_windows.html) you want the latest development version.\n\n### Installing on Linux / MacOS\nDownload [binary release](https://github.com/bigartm/bigartm/releases) or build from source using cmake:\n```bash\n$ mkdir build \u0026\u0026 cd build\n$ cmake ..\n$ make install\n```\n\nSee [here](https://bigartm.readthedocs.io/en/master/installation/linux.html) for detailed instructions.\n\n# How to Use\n\n### Command-line interface\n\nCheck out [documentation for `bigartm`](http://docs.bigartm.org/en/latest/tutorials/bigartm_cli.html).\n\nExamples:\n\n* Basic model (20 topics, outputed to CSV-file, inferred in 10 passes)\n\n```bash\nbigartm.exe -d docword.kos.txt -v vocab.kos.txt --write-model-readable model.txt\n--passes 10 --batch-size 50 --topics 20\n```\n\n* Basic model with less tokens (filtered extreme values based on token's frequency)\n```bash\nbigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2\n--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt\n```\n\n* Simple regularized model (increase sparsity up to 60-70%)\n```bash\nbigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2\n--passes 10 --batch-size 50 --topics 20  --write-model-readable model.txt \n--regularizer \"0.05 SparsePhi\" \"0.05 SparseTheta\"\n```\n\n* More advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics\n```bash\nbigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2\n--passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt\n--regularizer \"0.05 SparsePhi #obj\"\n--regularizer \"0.05 SparseTheta #obj\"\n--regularizer \"0.25 SmoothPhi #background\"\n--regularizer \"0.25 SmoothTheta #background\" \n```\n\n### Interactive Python interface\n\nBigARTM supports full-featured and clear Python API (see [Installation](http://docs.bigartm.org/en/latest/installation/index.html) to configure Python API for your OS).\n\nExample:\n\n```python\nimport artm\n\n# Prepare data\n# Case 1: data in CountVectorizer format\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.datasets import fetch_20newsgroups\nfrom numpy import array\n\ncv = CountVectorizer(max_features=1000, stop_words='english')\nn_wd = array(cv.fit_transform(fetch_20newsgroups().data).todense()).T\nvocabulary = cv.get_feature_names()\n\nbv = artm.BatchVectorizer(data_format='bow_n_wd',\n                          n_wd=n_wd,\n                          vocabulary=vocabulary)\n\n# Case 2: data in UCI format (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)\nbv = artm.BatchVectorizer(data_format='bow_uci',\n                          collection_name='kos',\n                          target_folder='kos_batches')\n\n# Learn simple LDA model (or you can use advanced artm.ARTM)\nmodel = artm.LDA(num_topics=15, dictionary=bv.dictionary)\nmodel.fit_offline(bv, num_collection_passes=20)\n\n# Print results\nmodel.get_top_tokens()\n```\n\nRefer to [tutorials](http://docs.bigartm.org/en/latest/tutorials/python_tutorial.html) for details on how to start using BigARTM from Python, [user's guide](http://docs.bigartm.org/en/latest/tutorials/python_userguide/index.html) can provide information about more advanced features and cases.\n\n### Low-level API\n\n  - [C++ Interface](http://docs.bigartm.org/en/latest/api_references/cpp_interface.html)\n  - [Plain C Interface](http://docs.bigartm.org/en/latest/api_references/c_interface.html)\n\n\n## Contributing\n\nRefer to the [Developer's Guide](http://docs.bigartm.org/en/latest/devguide.html) and follows [Code Style](https://github.com/bigartm/bigartm/wiki/Code-style).\n\nTo report a bug use [issue tracker](https://github.com/bigartm/bigartm/issues). To ask a question use [our mailing list](https://groups.google.com/forum/#!forum/bigartm-users). Feel free to make [pull request](https://github.com/bigartm/bigartm/pulls).\n\n\n## License\n\nBigARTM is released under [New BSD License](https://raw.github.com/bigartm/bigartm/master/LICENSE) that allowes unlimited redistribution for any purpose (even for commercial use) as long as its copyright notices and the license’s disclaimers of warranty are maintained.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigartm%2Fbigartm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigartm%2Fbigartm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigartm%2Fbigartm/lists"}