{"id":19865604,"url":"https://github.com/ct-clmsn/miniaturist","last_synced_at":"2026-06-09T03:06:19.480Z","repository":{"id":76760439,"uuid":"386125526","full_name":"ct-clmsn/miniaturist","owner":"ct-clmsn","description":"latent dirichlet allocation (topic modeling) implementations for hpc and cloud systems","archived":false,"fork":false,"pushed_at":"2021-09-14T01:34:44.000Z","size":377,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-11T15:46:02.266Z","etag":null,"topics":["cloud-computing","cxx17","distributed-computing","hdfs","hpc","hpx","latent-dirichlet-allocation","machine-learning","natural-language-processing","nlp","phylanx","pybind11","supercomputing","topic-modeling"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ct-clmsn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-15T01:33:02.000Z","updated_at":"2021-09-14T01:34:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"06185326-3b57-4e72-bbff-611647db2658","html_url":"https://github.com/ct-clmsn/miniaturist","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ct-clmsn%2Fminiaturist","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ct-clmsn%2Fminiaturist/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ct-clmsn%2Fminiaturist/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ct-clmsn%2Fminiaturist/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ct-clmsn","download_url":"https://codeload.github.com/ct-clmsn/miniaturist/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241275029,"owners_count":19937295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud-computing","cxx17","distributed-computing","hdfs","hpc","hpx","latent-dirichlet-allocation","machine-learning","natural-language-processing","nlp","phylanx","pybind11","supercomputing","topic-modeling"],"created_at":"2024-11-12T15:23:23.251Z","updated_at":"2026-06-09T03:06:14.460Z","avatar_url":"https://github.com/ct-clmsn.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- Copyright (c) 2021 Christopher Taylor                                          --\u003e\n\u003c!--                                                                                --\u003e\n\u003c!--   Distributed under the Boost Software License, Version 1.0. (See accompanying --\u003e\n\u003c!--   file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)        --\u003e\n\u003c!--                                                                                --\u003e\n# [miniaturist](https://github.com/ct-clmsn/miniaturist)\n\nThis project provides implementations of a Latent Dirichlet\nAllocation algorithm found [here](https://www.ics.uci.edu/~asuncion/software/fast.htm).\nLatent Dirichlet Allocation is colloquially called \"topic modeling\". This\nproject offers topic modeling capabilities that work as a sequential program,\na parallel (threaded) program, and a distributed parallel program that can be\nexecuted on High Performance Computing (HPC)/Supercomputing systems or a Cloud.\n\nThe following implementations are provided:\n\n* lda - the sequential implementation\n* parlda - the parallel implementation for multi-core systems\n* distparlda - distributed, parallel, implementation for High Performance Computers (HPC)/Supercomputers\n* distparldahdfs - distributed, parallel, implementation for Clouds ([Hadoop filesystem, HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html))\n\nOptional extensions provided:\n\n* A plugin for the [Phylanx](https://github.com/STEllAR-GROUP/phylanx) distributed array toolkit\n* [Python](https://www.python.org/) bindings (modules: pylda, pyparlda, pydistparlda)\n\nThe following tools are provided:\n\n* vocab - a sequential program to compute the vocabulary set found in all\ndocuments of a corpus\n* distvocab - a distributed memory program to compute the vocabulary set\nfound in all documents of a corpus\n* distvocabhdfs - a distributed memory program to compute the vocabulary set\nfound in all documents of a corpus on HDFS\n\nThe implementation uses a modified version of the collapsed gibbs\nsampler as defined by Newman, Asuncion, Smyth, and Welling.\n\nModifications have been made to the Newman, et al treatment in an\neffort to utilize term-document-matrices as the storage structure\nfor document histograms.\n\nThis implementation only scales in the direction of documents. Users\nare required to provide a vocabulary list in order to make use of this\nimplementation. Vocabulary lists can be populated programmatically or\nloaded into a modeling program with a file containing a 'new-line'\ndelimited list of words.\n\nThe vocabulary building tools print out a set of words encountered during\n1 linear traversal of the documents. All vocabulary building tools print\nresults to stdout (the terminal).\n\nThe distributed vocabulary building tool prints to the stdout of each\nmachine it is running upon; it is suggested that users pipe the output\nof the distributed vocabulary building tool to a distributed filesystem\nusing a filename that is: unique to the locality identifier (integer) of\nthe program instance, or into a remote /tmp directory that is accessible\nfor a file copy (scp).\n\n## How To Build The Container\n\nContainer users should use the following options. The container is\nUbuntu 20.04 based and uses some packages from 'universe'.\n\n* `sudo singularity build miniaturist.sif miniaturist.def`\n* `singularity build --fakeroot miniaturist.sif miniaturist.def`\n\nThe container uses wget to download a binary build of cmake from\nkitware's website and the OTF2 source from VI-HPS. Other software\ndependencies are `git cloned` from their respective source\nrepositories and compiled into the container.\n\n## How To Build From Source\n\nThis project requires using cmake. cmake requires creating a directory\ncalled 'build'. Users will change directory into 'build'. At this point\nthe user will need to type something like the following:\n\n* `cmake -Dblaze_DIR=\u003cPATH_TO_BLAZE_CMAKEFILE\u003e -DHPX_DIR=\u003cPATH_TO_HPX_CMAKEFILE\u003e ..`\n\nThese are possible directories where the blaze and hpx cmakefiles can\nbe found:\n\n* `PATH_TO_BLAZE_CMAKEFILE=/usr/share/blaze/cmake`\n* `PATH_TO_HPX_CMAKEFILE=/usr/lib/cmake/HPX`\n\nAdd the following for Cloud (Hadoop File System - HDFS) support:\n\n* `cmake -Dblaze_DIR=\u003cPATH_TO_BLAZE_CMAKEFILE\u003e -DHPX_DIR=\u003cPATH_TO_HPX_CMAKEFILE\u003e -Dlibhdfs3_DIR=\u003cPATH_TO_LIBHDFS3_INSTALL\u003e ..`\n\nThis is a possible location for libhdfs3.so and hdfs/hdfs.h:\n\n* `PATH_TO_LIBHDFS3_INSTALL=/usr/`\n\nAdd the following for Python bindings to be built. To inform cmake where\npybind11 is installed, use the `-Dpybind11_DIR=\u003cPATH_TO_PYBIND11_CMAKEFILE\u003e`\nflag.\n\n* `cmake -Dblaze_DIR=\u003cPATH_TO_BLAZE_CMAKEFILE\u003e -DHPX_DIR=\u003cPATH_TO_HPX_CMAKEFILE\u003e -Dpybind11_DIR=\u003cPATH_TO_PYBIND11_CMAKEFILE\u003e`\n\nHere is a possible directory where the pybind11 cmakefiles are located:\n\n* `PATH_TO_PYBIND11_CMAKEFILE=/usr/share/cmake/pybind11`\n\n## How To Use\n\nTopic Modeling Program names:\n\n* lda, single node, sequential (no threads), implementation\n* parlda, single node, parallel, implementation\n* distparlda, distributed (multi-node), parallel, implementation\n* distparldahdfs, distributed (multi-node), parallel, HDFS implementation\n\nCommand line arguments for all topic modeling programs:\n\n* --num_topics=[enter an unsigned integer value for number of topics], required\n* --vocab_list=[enter a valid path to the file containing the vocabulary list], required\n* --corpus_dir=[enter a valid path to the directory containing the training corpus], required\n* --regex=[enter a regular expression], default [\\p{L}\\p{M}]+\n* --num_iters=[enter an unsigned integer value for iterations], default 1000\n* --alpha=[enter a floating point number for alpha prior], default 0.1\n* --beta=[enter a floating point number for beta prior], default 0.01\n\nAdditional command line arguments for parlda:\n\n* --hpx:threads=[enter an unsigned integer value for number of threads], optional\n* --hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional\n\nAdditional command line arguments for distparlda:\n\n* --hpx:threads=[enter an unsigned integer value for number of threads], optional\n* --hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional\n* --hpx:nodes=[enter an unsigned integer value for number of threads], optional\n\nAdditional command line arguments for distparldahdfs:\n\n* --hpx:threads=[enter an unsigned integer value for number of threads], optional\n* --hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional\n* --hpx:nodes=[enter an unsigned integer value for number of threads], optional\n* --hdfs_namenode_address=[enter string], required\n* --hdfs_namenode_port=[unsigned integer for hdfs namenode port], required\n* --hdfs_buffer_size=[unsigned integer buffer size for file reads from hdfs], default 1024\n* --hdfs_block_size=[unsigned integer buffer size for file writes to hdfs], default 1024\n\nVocabulary Building Program names:\n\n* vocab, single node, sequential (no thread), vocabulary builder\n* distvocab, distributed, sequential (no thread), vocabulary builder\n* distvocabhdfs, distributed, sequential (no thread), vocabulary builder for HDFS\n\nCommand line arguments for all vocabulary programs:\n\n* --corpus_dir=[enter a valid path to the directory containing the training corpus], required\n* --regex=[enter a regular expression], default [\\p{L}\\p{M}]+, optional\n* --filter=[unsigned integer frequency count above which vocabulary words are printed out], optional\n* --histogram, print out the global count of each word (default off), optional\n\nAdditional command line arguments for distvocabhdfs:\n\n* --hdfs_namenode_address=[enter string], required\n* --hdfs_namenode_port=[unsigned integer for hdfs namenode port], required\n* --hdfs_buffer_size=[unsigned integer buffer size for file reads from hdfs], default 1024\n* --hdfs_block_size=[unsigned integer buffer size for file writes to hdfs], default 1024\n\nTopic Modeling Libraries:\n\n* libldalib.a, ldalib.hpp single node, sequential (no threads), implementation\n* libparldalib.a, parldalib.hpp single node, parallel, implementation\n* libdistparldalib.a, distparldalib.hpp distributed (multi-node), parallel, implementation\n\nThe Python bindings requires users to type the following in python3.8:\n\n`from pylda import lda`\n\n## How To Use Container\n\nUse the following command line arguments to print out the command line arguments enumerated\nabove for each program in the container.\n\n* `singularity help miniaturist.sif`\n* `singularity help --app vocab miniaturist.sif`\n* `singularity help --app distvocab miniaturist.sif`\n* `singularity help --app distvocabhdfs miniaturist.sif`\n* `singularity help --app lda miniaturist.sif`\n* `singularity help --app parlda miniaturist.sif`\n* `singularity help --app distparlda miniaturist.sif`\n* `singularity help --app distparldahdfs miniaturist.sif`\n\n## Implementation Notes\n\nThis implementation loads the corpus into an inverted index. The inverted index\nis converted into a sparse matrix that is transposed into a document-term matrix.\nThis step is required to permit the parallel processing of the corpus by document.\nThe implementation processes subsets of the document-term matrix in parallel chunks.\nA sparse matrix is used to store the document-term matrix to minimize a\nrequired O(N^2) algorithmic cost spent traversing the matrix.\n\nThe most time-consuming portion of each implementation is the creation of the sparse\nmatrix which stores the document-term matrix. To accelerate the creaton of the matrix,\nit is strongly encouraged that a user spends a fair amount of time studying the corpus\nto identify a reasonably sized vocabulary set. The larger the vocabulary set the sparser\nthe document-term matrix becomes. There is a direct correlation between each implementation's \nperformance and the size of the vocabulary set.\n\n## Usage Notes\n\nIf you use the vocabulary building tools with a specific regular expression in mind, make\nsure to use the same regular expression when invoking lda, parlda, or distparlda. Consistent\nuse of regular expressions parsing text when building the vocabulary and when modeling is\nimportant for successful program execution.\n\n## HPX Compilation Flags\n\nTake time to review the following build options for HPX [here](https://hpx-docs.stellar-group.org/latest/html/manual/building_hpx.html).\nBelow are a short list of recommended options.\n\nFor Cloud Environments:\n\n* HPX_WITH_PARCELPORT_TCP=ON\n* HPX_WITH_FAULT_TOLERANCE=ON\n\nFor HPC Environments w/MPI:\n\n* HPX_WITH_PARCELPORT_MPI=ON\n\nFor HPC Environments w/libfabric:\n\n* HPX_WITH_PARCELPORT_LIBFABRIC=ON\n* HPX_WITH_PARCELPORT_LIBFABRIC_PROVIDER=[gni,verbs,psm2,etc]\n\nDynamic runtime instrumentation for Cloud and/or HPC:\n\n* HPX_WITH_APEX=ON (requires [APEX](https://github.com/khuck/xpress-apex) installation)\n\nFor OpenMP enabled LAPACK, BLAS, Blaze implementations:\n\n* [hpxMP](https://github.com/STEllAR-GROUP/hpxMP) - provides HPX/APEX the ability to manage OpenMP\n\n## Licenses\n\n* The Jump Consistency Hash source code was taken from a blog post\nwith no license provided. The blog post and author are referenced\nin a comment at the top of the file 'jch.hpp'.\n\n* The drand48 logic comes from [musl-libc](https://www.musl-libc.org/). The\nsource code is MIT Licensed.\n\n* The remainder of the source code in this project is [Boost Licensed](https://www.boost.org/users/license.html)\nand the license terms can be found in the file 'LICENSE'.\n\n## Dependencies\n\n* C++17\n* STE||AR HPX\n* Blaze\n* ICU\n* LAPACK\n* BLAS\n* OpenSSL\n* pkg-config\n* cmake \u003e= 3.17\n\n## Optional Dependencies\n\n* APEX\n* hpxMP\n* Phylanx\n* pybind11 (Python support)\n* libhdfs3 (Hadoop Filesystem/HDFS support)\n* singularity\n\n## Special Thanks\n\n* [STE||AR Group](https://github.com/STEllAR-GROUP/hpx)\n* [Phylanx](https://github.com/STEllAR-GROUP/phylanx)\n* Blaze C++ Library ([Blaze](https://bitbucket.org/blaze-lib/blaze/src/master/))\n* Erlend Hamberg (Jump Consistency Hash)\n* Erik Muttersbach ([libhdfs3](https://github.com/erikmuttersbach/libhdfs3))\n\n## References\n* D. Newman, A. Asuncion, P. Smyth, M. Welling. \"Distributed Algorithms for Topic Models.\" JMLR 2009.\n* H. Kaiser, M. Brodowicz and T. Sterling: ParalleX: An Advanced Parallel Execution Model for Scaling-Impaired Applications, International Conference on Parallel Processing Workshops (2009 – Los Alamos, California).\n* Kevin Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. An Autonomic Performance Environment for Exascale. Supercomputing frontiers and innovations, 2.3 (2015).\n* Jeremy Kemp; Tianyi Zhang; Shahrzad Shirzad; Bryce Adelstein Lelbach aka wash; Hartmut Kaiser; Bibek Wagle; Parsa Amini; Alireza Kheirkhahan. \"[hpxMP](https://github.com/STEllAR-GROUP/hpxMP) v0.3.0: An OpenMP runtime implemented using HPX\".\n\n## Author\n\nChristopher Taylor\n\n## Date\n\n07/03/2021\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fct-clmsn%2Fminiaturist","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fct-clmsn%2Fminiaturist","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fct-clmsn%2Fminiaturist/lists"}