{"id":22668562,"url":"https://github.com/lanl/pydnmfk","last_synced_at":"2025-04-12T11:08:19.850Z","repository":{"id":48985131,"uuid":"355419372","full_name":"lanl/pyDNMFk","owner":"lanl","description":"Python Distributed Non Negative Matrix Factorization with custom clustering ","archived":false,"fork":false,"pushed_at":"2023-08-22T05:27:08.000Z","size":13021,"stargazers_count":20,"open_issues_count":1,"forks_count":6,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-06-27T00:19:26.003Z","etag":null,"topics":["cupy","distributed-computing","hpc","latent-features","machine-learning","mpi4py","nccl","nonnegative-matrix-factorization","outofmemory","python","tensorfactorization"],"latest_commit_sha":null,"homepage":"https://lanl.github.io/pyDNMFk/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lanl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-04-07T05:10:21.000Z","updated_at":"2024-05-18T04:44:56.000Z","dependencies_parsed_at":"2023-09-26T03:55:44.440Z","dependency_job_id":null,"html_url":"https://github.com/lanl/pyDNMFk","commit_stats":{"total_commits":138,"total_committers":3,"mean_commits":46.0,"dds":"0.35507246376811596","last_synced_commit":"c4616e641d57d8d58a368cf5b65f9e1d0f5b2d39"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanl%2FpyDNMFk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanl%2FpyDNMFk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanl%2FpyDNMFk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanl%2FpyDNMFk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lanl","download_url":"https://codeload.github.com/lanl/pyDNMFk/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228911888,"owners_count":17990774,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cupy","distributed-computing","hpc","latent-features","machine-learning","mpi4py","nccl","nonnegative-matrix-factorization","outofmemory","python","tensorfactorization"],"created_at":"2024-12-09T15:15:45.142Z","updated_at":"2024-12-09T15:15:45.896Z","avatar_url":"https://github.com/lanl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# [pyDNMFk: Python Distributed Non Negative Matrix Factorization with determination of hidden features](https://github.com/lanl/pyDNMFk)\n\n\n\u003cdiv align=\"center\", style=\"font-size: 50px\"\u003e\n\n[![Build Status](https://github.com/lanl/pyDNMFk/actions/workflows/ci_test.yml/badge.svg?branch=main)](https://github.com/lanl/pyDNMFk/actions/workflows/ci_test.yml/badge.svg?branch=main) [![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg) [![Python Version](https://img.shields.io/badge/python-v3.7.1-blue)](https://img.shields.io/badge/python-v3.7.1-blue) [![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.4722448-blue.svg)](https://doi.org/10.5281/zenodo.4722448)\n\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"160\" height=\"200\" src=\"docs/RD100.png\"\u003e\n\u003c/p\u003e\n\n\n\u003cbr\u003e\n\n[pyDNMFk](https://github.com/lanl/pyDNMFk) is a software package for applying non-negative matrix factorization in a distributed fashion to large datasets. It can minimize the difference between reconstructed data and the original data through various norms (Frobenius, KL-divergence).  Additionally, the Custom Clustering algorithm allows for automated determination for the number of Latent features.\n\n\u003cdiv align=\"center\", style=\"font-size: 50px\"\u003e\n\n### [:information_source: Documentation](https://lanl.github.io/pyDNMFk/) \u0026emsp; [:orange_book: Examples](examples/) \u0026emsp; [:bar_chart: Datasets](data/) \u0026emsp; [:page_facing_up: Paper](https://ieeexplore.ieee.org/abstract/document/9286234)\n\n\u003c/div\u003e\n\n\u003chr/\u003e\n\n\n![plot](./docs/pyDNMFk_RD500.png)\n\n## Features:\n\n* Utilization of MPI4py for distributed operation.\n* Distributed NNSVD and SVD initializations.\n* Distributed Custom Clustering algorithm for estimating automated latent feature number (k) determination.\n* Objective of minimization of KL divergence/Frobenius norm. \n* Optimization with multiplicative updates, BCD, and HALS. \n* Checkpoints for tracking runtime status enabling restart from saved state.\n* Distributed Pruning of zero rows and zero columns of the data. \n\n![plot](./docs/pyDNMFk.png)\n\nOverview of the pyDNMFk workflow implementation.\n## Installation:\n\nOn a desktop machine:\n```\ngit clone https://github.com/lanl/pyDNMFk.git\ncd pyDNMFk\nconda create --name pyDNMFk python=3.7.1 openmpi mpi4py\nsource activate pyDNMFk\npython setup.py install\n```\n\n\u003chr/\u003e\n\nOn a HPC server:\n```\ngit clone https://github.com/lanl/pyDNMFk.git\ncd pyDNMFk\nconda create --name pyDNMFk python=3.7.1 \nsource activate pyDNMFk\nmodule load \u003copenmpi\u003e\npip install mpi4py\npython setup.py install\n```\n\n## Prerequisites\n* conda\n* numpy\u003e=1.2\n* matplotlib\n* MPI4py\n* scipy\n* h5py\n\n## Documentation\n\nYou can find the documentation [here](https://lanl.github.io/pyDNMFk/). \n\n\n## Usage\n**[main.py](main.py) can be used to run the software on command line:**\n\n```bash\nmpirun -n \u003cprocs\u003e python main.py [-h] [--process PROCESS] --p_r P_R --p_c P_C [--k K]\n               [--fpath FPATH] [--ftype FTYPE] [--fname FNAME] [--init INIT]\n               [--itr ITR] [--norm NORM] [--method METHOD] [--verbose VERBOSE]\n               [--results_path RESULTS_PATH] [--checkpoint CHECKPOINT]\n               [--timing_stats TIMING_STATS] [--prune PRUNE]\n               [--precision PRECISION] [--perturbations PERTURBATIONS]\n               [--noise_var NOISE_VAR] [--start_k START_K] [--end_k END_K]\n               [--step_k STEP_K] [--sill_thr SILL_THR] [--sampling SAMPLING]\n\n\narguments:\n  -h, --help            show this help message and exit\n  --process PROCESS     pyDNMF/pyDNMFk\n  --p_r P_R             Now of row processors\n  --p_c P_C             Now of column processors\n  --k K                 feature count\n  --fpath FPATH         data path to read(eg: tmp/)\n  --ftype FTYPE         data type : mat/folder/h5\n  --fname FNAME         File name\n  --init INIT           NMF initializations: rand/nnsvd\n  --itr ITR             NMF iterations, default:1000\n  --norm NORM           Reconstruction Norm for NMF to optimize:KL/FRO\n  --method METHOD       NMF update method:MU/BCD/HALS\n  --verbose VERBOSE\n  --results_path RESULTS_PATH\n                        Path for saving results\n  --checkpoint CHECKPOINT\n                        Enable checkpoint to track the pyNMFk state\n  --timing_stats TIMING_STATS\n                        Switch to turn on/off benchmarking.\n  --prune PRUNE         Prune zero row/column.\n  --precision PRECISION\n                        Precision of the data(float32/float64/float16).\n  --perturbations PERTURBATIONS\n                        perturbation for NMFk\n  --noise_var NOISE_VAR\n                        Noise variance for NMFk\n  --start_k START_K     Start index of K for NMFk\n  --end_k END_K         End index of K for NMFk\n  --step_k STEP_K       step for K search\n  --sill_thr SILL_THR   SIll Threshold for K estimation\n  --sampling SAMPLING   Sampling noise for NMFk i.e uniform/poisson\n```\n\n**Example on running  pyDNMFk using [main.py](main.py):**\n```bash\nmpirun -n 4 python main.py --p_r=4 --p_c=1 --process='pyDNMFk'  --fpath='data/' --ftype='mat' --fname='swim' --init='nnsvd' --itr=5000 --norm='kl' --method='mu' --results_path='results/' --perturbations=20 --noise_var=0.015 --start_k=2 --end_k=5 --sill_thr=.9 --sampling='uniform'\n```\n\n**Example estimation of k using the provided sample dataset:**\n```python\n'''Imports block'''\nimport pyDNMFk.config as config\nconfig.init(0)\nfrom pyDNMFk.pyDNMFk import *\nfrom pyDNMFk.data_io import *\nfrom pyDNMFk.dist_comm import *\nfrom scipy.io import loadmat\nfrom mpi4py import MPI\ncomm = MPI.COMM_WORLD\nargs = parse()  \n\n\n'''parameters initialization block'''\n\n# Data Read here\nargs.fpath = 'data/'\nargs.fname = 'wtsi'  \nargs.ftype = 'mat'\nargs.precision = np.float32\n\n#Distributed Comm config block\np_r, p_c = 4, 1  \n\n#NMF config block\nargs.norm = 'kl'\nargs.method = 'mu'\nargs.init = 'nnsvd'\nargs.itr = 5000\nargs.verbose = True\n\n#Cluster config block\nargs.start_k = 2 \nargs.end_k = 5\nargs.sill_thr = 0.9\n\n#Data Write\nargs.results_path = 'results/'\n\n\n'''Parameters prep block'''\ncomms = MPI_comm(comm, p_r, p_c)\ncomm1 = comms.comm\nrank = comm.rank\nsize = comm.size\nargs.size, args.rank, args.comm, args.p_r, args.p_c = size, rank, comms, p_r, p_c\nargs.row_comm, args.col_comm, args.comm1 = comms.cart_1d_row(), comms.cart_1d_column(), comm1\nA_ij = data_read(args).read().astype(args.precision)\n\nnopt = PyNMFk(A_ij, factors=None, params=args).fit()\nprint('Estimated k with NMFk is ',nopt)\n```\n\n**Example on running pyDNMFk to get the W and H matrices:**\n```python\n# Use \"mpirun -n 4 python -m code.py\" to run this example\nfrom pyDNMFk.runner import pyDNMFk_Runner\nimport numpy as np\n\nrunner = pyDNMFk_Runner(itr=100, init='nnsvd', verbose=True, \n                        norm='fro', method='mu', precision=np.float32,\n                        checkpoint=False, sill_thr=0.6)\n\nresults = runner.run(grid=[4,1], fpath='data/', fname='wtsi', \n                     ftype='mat', results_path='results/',\n                     k_range=[1,3], step_k=1)\n\nW = results[\"W\"]\nH = results[\"H\"]\n```\n\n**See the [examples](examples/) or [tests](tests/) for more use cases.**\n\u003chr/\u003e\n\n## Benchmarking\n\n![plot](./docs/benchmark.png)\nFigure: Scaling benchmarks for 10 iterations for Frobenius norm based MU updates with MPI\noperations for i) strong and ii) weak scaling and  Communication vs computation \noperations for iii) strong and iv) weak scaling. \n\n## Scalability\n![plot](./docs/scalability.png)\n\n## Authors\n\n* [Manish Bhattarai](mailto:ceodspspectrum@lanl.gov) - Los Alamos National Laboratory\n* [Ben Nebgen](mailto:bnebgen@lanl.gov) - Los Alamos National Laboratory\n* [Erik Skau](mailto:ewskau@lanl.gov) - Los Alamos National Laboratory\n* [Maksim Eren](mailto:maksim@lanl.gov) - Los Alamos National Laboratory\n* [Gopinath Chennupati](mailto:gchennupati@lanl.gov) - Los Alamos National Laboratory\n* [Raviteja Vangara](mailto:rvangara@lanl.gov) - Los Alamos National Laboratory\n* [Hristo Djidjev](mailto:djidjev@lanl.gov) - Los Alamos National Laboratory\n* [John Patchett](mailto:patchett@lanl.gov) - Los Alamos National Laboratory\n* [Jim Ahrens](mailto:ahrens@lanl.gov) - Los Alamos National Laboratory\n* [Boian Alexandrov](mailto:boian@lanl.gov) - Los Alamos National Laboratory\n\n## How to cite pyDNMFk?\n\n```latex\n  @misc{pyDNMFk,\n  author = {Bhattarai, Manish and Nebgen, Ben and Skau, Erik and Eren, Maksim and Chennupati, Gopinath and Vangara, Raviteja and Djidjev, Hristo and Patchett, John and Ahrens, Jim and ALexandrov, Boian},\n  title = {pyDNMFk: Python Distributed Non Negative Matrix Factorization},\n  year = {2021},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  doi = {10.5281/zenodo.4722448},\n  howpublished = {\\url{https://github.com/lanl/pyDNMFk}}\n}\n\n\n@article{vangara2021finding,\n  title={Finding the Number of Latent Topics With Semantic Non-Negative Matrix Factorization},\n  author={Vangara, Raviteja and Bhattarai, Manish and Skau, Erik and Chennupati, Gopinath and Djidjev, Hristo and Tierney, Tom and Smith, James P and Stanev, Valentin G and Alexandrov, Boian S},\n  journal={IEEE Access},\n  volume={9},\n  pages={117217--117231},\n  year={2021},\n  publisher={IEEE}\n}\n\n @inproceedings{bhattarai2020distributed,\n  title={Distributed Non-Negative Tensor Train Decomposition},\n  author={Bhattarai, Manish and Chennupati, Gopinath and Skau, Erik and Vangara, Raviteja and Djidjev, Hristo and Alexandrov, Boian S},\n  booktitle={2020 IEEE High Performance Extreme Computing Conference (HPEC)},\n  pages={1--10},\n  year={2020},\n  organization={IEEE}\n}\n@inproceedings {s.20211055,\nbooktitle = {EuroVis 2021 - Short Papers},\neditor = {Agus, Marco and Garth, Christoph and Kerren, Andreas},\ntitle = {{Selection of Optimal Salient Time Steps by Non-negative Tucker Tensor Decomposition}},\nauthor = {Pulido, Jesus and Patchett, John and Bhattarai, Manish and Alexandrov, Boian and Ahrens, James},\nyear = {2021},\npublisher = {The Eurographics Association},\nISBN = {978-3-03868-143-4},\nDOI = {10.2312/evs.20211055}\n}\n@article{chennupati2020distributed,\n  title={Distributed non-negative matrix factorization with determination of the number of latent features},\n  author={Chennupati, Gopinath and Vangara, Raviteja and Skau, Erik and Djidjev, Hristo and Alexandrov, Boian},\n  journal={The Journal of Supercomputing},\n  pages={1--31},\n  year={2020},\n  publisher={Springer}\n}\n```\n\n## Acknowledgments\nLos Alamos National Lab (LANL), T-1\n\n## Copyright Notice\n\u003e© (or copyright) 2020. Triad National Security, LLC. All rights reserved.\nThis program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos\nNational Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S.\nDepartment of Energy/National Nuclear Security Administration. All rights in the program are\nreserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear\nSecurity Administration. The Government is granted for itself and others acting on its behalf a\nnonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare\nderivative works, distribute copies to the public, perform publicly and display publicly, and to permit\nothers to do so.\n\n\n## License\n\nThis program is open source under the BSD-3 License.\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n1. Redistributions of source code must retain the above copyright notice, this\n   list of conditions and the following disclaimer.\n\n2. Redistributions in binary form must reproduce the above copyright notice,\n   this list of conditions and the following disclaimer in the documentation\n   and/or other materials provided with the distribution.\n\n3. Neither the name of the copyright holder nor the names of its\n   contributors may be used to endorse or promote products derived from\n   this software without specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\nAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\nIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\nFOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\nDAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\nSERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\nCAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\nOR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\nOF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flanl%2Fpydnmfk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flanl%2Fpydnmfk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flanl%2Fpydnmfk/lists"}