{"id":13906080,"url":"https://github.com/peterwittek/somoclu","last_synced_at":"2026-02-18T22:02:20.786Z","repository":{"id":6402441,"uuid":"7640569","full_name":"peterwittek/somoclu","owner":"peterwittek","description":"Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters","archived":false,"fork":false,"pushed_at":"2024-01-18T11:58:52.000Z","size":4012,"stargazers_count":276,"open_issues_count":37,"forks_count":72,"subscribers_count":24,"default_branch":"master","last_synced_at":"2025-12-19T20:31:12.128Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://peterwittek.github.io/somoclu/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/peterwittek.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-01-16T06:33:16.000Z","updated_at":"2025-11-06T21:37:53.000Z","dependencies_parsed_at":"2024-11-25T22:30:36.587Z","dependency_job_id":null,"html_url":"https://github.com/peterwittek/somoclu","commit_stats":{"total_commits":577,"total_committers":21,"mean_commits":"27.476190476190474","dds":0.6221837088388216,"last_synced_commit":"e6c57a23a321552cea88b7888c9b0369146e8f14"},"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/peterwittek/somoclu","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterwittek%2Fsomoclu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterwittek%2Fsomoclu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterwittek%2Fsomoclu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterwittek%2Fsomoclu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/peterwittek","download_url":"https://codeload.github.com/peterwittek/somoclu/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterwittek%2Fsomoclu/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29596329,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-18T20:59:56.587Z","status":"ssl_error","status_checked_at":"2026-02-18T20:58:41.434Z","response_time":162,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-06T23:01:29.023Z","updated_at":"2026-02-18T22:02:20.768Z","avatar_url":"https://github.com/peterwittek.png","language":"C","funding_links":[],"categories":["分布式机器学习","Python"],"sub_categories":["General-Purpose Machine Learning"],"readme":"# Somoclu\n[![Ubuntu2004](https://github.com/peterwittek/somoclu/actions/workflows/u2004.yml/badge.svg)](https://github.com/peterwittek/somoclu/actions/workflows/u2004.yml)\n[![Ubuntu2004-R](https://github.com/peterwittek/somoclu/actions/workflows/u2004-r.yml/badge.svg)](https://github.com/peterwittek/somoclu/actions/workflows/u2004-r.yml)\n\nSomoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing the workload in a cluster, and it can be accelerated by CUDA. A sparse kernel is also included, which is useful for training maps on vector spaces generated in text mining processes.\n\nKey features:\n\n* Fast execution by parallelization: OpenMP, MPI, and CUDA are supported.\n* Multi-platform: Linux, macOS, and Windows are supported.\n* Planar and toroid maps.\n* Rectangular and hexagonal grids.\n* Gaussian and bubble neighborhood functions.\n* Both dense and sparse input data are supported.\n* Large maps of several hundred thousand neurons are feasible.\n* Integration with [Databionic ESOM Tools](http://databionic-esom.sourceforge.net/).\n* [Python](https://somoclu.readthedocs.io/), [R](https://cran.r-project.org/web/packages/Rsomoclu/), [Julia](https://github.com/peterwittek/Somoclu.jl), and [MATLAB](https://github.com/peterwittek/somoclu/tree/master/src/MATLAB) interfaces for the dense CPU and GPU kernels.\n\nFor more information, refer to the manuscript about the library [1].\n\n# Usage\n\n## Basic Command Line Use\n\nSomoclu takes a plain text input file -- either dense or sparse data. Example files are included.\n\n    $ [mpirun -np NPROC] somoclu [OPTIONs] INPUT_FILE OUTPUT_PREFIX\n\nArguments:\n\n    -c FILENAME              Specify an initial codebook for the map.\n    -d NUMBER                Coefficient in the Gaussian neighborhood function\n                             exp(-||x-y||^2/(2*(coeff*radius)^2)) (default: 0.5)\n    -e NUMBER                Maximum number of epochs\n    -g TYPE                  Grid type: square or hexagonal (default: square)\n    -h, --help               This help text\n    -k NUMBER                Kernel type\n                                0: Dense CPU\n                                1: Dense GPU\n                                2: Sparse CPU\n    -l NUMBER                Starting learning rate (default: 0.1)\n    -L NUMBER                Finishing learning rate (default: 0.01)\n    -m TYPE                  Map type: planar or toroid (default: planar)\n    -n FUNCTION              Neighborhood function (bubble or gaussian, default: gaussian)\n    -p NUMBER                Compact support for Gaussian neighborhood\n                             (0: false, 1: true, default: 0)\n    -r NUMBER                Start radius (default: half of the map in direction min(x,y))\n    -R NUMBER                End radius (default: 1)\n    -s NUMBER                Save interim files (default: 0):\n                                0: Do not save interim files\n                                1: Save U-matrix only\n                                2: Also save codebook and best matching\n    -t STRATEGY              Radius cooling strategy: linear or exponential (default: linear)\n    -T STRATEGY              Learning rate cooling strategy: linear or exponential (default: linear)\n    -v NUMBER                Verbosity level, 0-2 (default: 0)\n    -x, --columns NUMBER     Number of columns in map (size of SOM in direction x)\n    -y, --rows    NUMBER     Number of rows in map (size of SOM in direction y)\n\nExamples:\n\n    $ somoclu data/rgbs.txt data/rgbs\n    $ mpirun -np 4 somoclu -k 0 --rows 20 --columns 20 data/rgbs.txt data/rgbs\n\nWith random initialization, the initial codebook will be filled with random numbers ranging from 0 to 1. Either supply your own initial codebook or normalize your data to fall in this range.\n\nIf the range of the values of the features includes negative numbers, the codebook will eventually adjust. It is, however, not advised to have negative values, especially if the codebook is initialized from 0 to 1. This comes from the batch training nature of the parallel implementation. The batch update rule will change the codebook values with weighted averages of the data points, and with negative values, the updates can cancel out.\n\nThe maps generated by the GPU and the CPU kernels are likely to be different. For computational efficiency, Somoclu uses single-precision floats. This occasionally results in identical distances between a data instance and the neurons. The CPU version will pick the best matching unit with the lowest coordinate values. Such sequentiality cannot be guaranteed in the reduction kernel of the GPU variant. This is not a bug, but it is better to be aware of it.\n\n## Efficient Parallel and Distributed Execution\n\nThe CPU kernels use OpenMP to load multicore processors. On a single node, this is more efficient than launching tasks with MPI to match the number of cores. The MPI tasks replicated the codebook, which is especially inefficient for large maps.\n\nFor instance, given a single node with eight cores, the following execution will use 1/8th of the memory, and will run 10-20% faster:\n\n    $ somoclu -x 200 -y 200 data/rgbs.txt data/rgbs\n\nOr, equivalently:\n\n    $ OMP_NUM_THREADS=8 somoclu -x 200 -y 200 data/rgbs.txt data/rgbs\n\nAvoid the following on a single node:\n\n    $ OMP_NUM_THREADS=1 mpirun -np 8 somoclu -x 200 -y 200 data/rgbs.txt data/rgbs\n\nThe same caveats apply for the sparse CPU kernel.\n\n## Visualisation\n\nThe primary purpose of generating a map is visualisation. Apart from the Python interface, Somoclu does not come with its own functions for visualisation, since there are numerous generic tools that are capable of plotting high-quality figures. The R version integrates with [kohonen](https://cran.r-project.org/package=kohonen) and the MATLAB version with [somtoolbox](www.cis.hut.fi/somtoolbox/).\n\nThe output formats U-matrix and the codebook of the command-line version are compatible with [Databionic ESOM Tools](http://databionic-esom.sourceforge.net/) for more advanced visualisation.\n\n# Input File Formats\n\nOne sparse and two dense data formats are supported. All of them are plain text files. The entries can be separated by any white-space character. One row represents one data instance across all formats. Comment lines starting with a hash mark are ignored.\n\nThe sparse format follows the [libsvm](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) guidelines. The first feature is zero-indexed. For instance, the vector [ 1.2 0 0 3.4] is represented as the following line in the file:\n0:1.2 3:3.4. The file is parsed twice: once to get the number of instances and features, and the second time to read the data in the individual threads.\n\nThe basic dense format includes the coordinates of the data vectors, separated by a white-space. Just like the sparse format, this file is parsed twice to get the basic dimensions right.\n\nThe .lrn file of [Databionic ESOM Tools](http://databionic-esom.sourceforge.net/) is also accepted and it is parsed only once. The format is described as follows:\n\n% n\n\n% m\n\n% s1 s2 .. sm\n\n% var_name1 var_name2 .. var_namem\n\nx11 x12 .. x1m\n\nx21 x22 .. x2m\n\n. . . .\n\n. . . .\n\nxn1 xn2 .. xnm\n\nHere n is the number of rows in the file, that is, the number of data instances. Parameter m defines the number of columns in the file. The next row defines the column mask: the value 1 for a column means the column should be used in the training. Note that the first column in this format is always a unique key, so this should have the value 9 in the column mask. The row with the variable names is ignore by Somoclu. The elements of the matrix follow -- from here, the file is identical to the basic dense format, with the addition of the first column as the unique key.\n\nIf the input file is sparse, but a dense kernel is invoked, Somoclu will execute and results will be incorrect. Invoking a sparse kernel on a dense input file is likely to lead to a segmentation fault.\n\n# Interfaces\n\n[Python](https://somoclu.readthedocs.io/), [Julia](https://github.com/peterwittek/Somoclu.jl), [R](https://cran.r-project.org/web/packages/Rsomoclu/), and [MATLAB](https://github.com/peterwittek/somoclu/tree/master/src/MATLAB) interfaces are available for the dense CPU and GPU kernels. MPI and the sparse kernel are not support through the interfaces. For respective examples, see the folders in src.\n\nThe Python version is also available in [PyPI](https://pypi.python.org/pypi/somoclu). You can install it with\n\n    $ pip install somoclu\n\nAlternatively, it is also available on [conda-forge](https://github.com/conda-forge/somoclu-feedstock):\n\n    $ conda install somoclu\n\nSome pre-built binaries in the wheel format or Windows installer are provided at [PyPI Dowloads](https://pypi.python.org/pypi/somoclu#downloads), they are tested with [Anaconda](https://www.continuum.io/downloads) distributions. If you encounter errors like `ImportError: DLL load failed: The specified module could not be found` when `import somoclu`, you may need to use [Dependency Walker](http://www.dependencywalker.com/) as shown [here](http://stackoverflow.com/a/24704384/1136027) on `_somoclu_wrap.pyd` to find out missing DLLs and place them at the write place. Usually right version (32/64bit) of `vcomp90.dll, msvcp90.dll, msvcr90.dll` should be put to `C:\\Windows\\System32` or `C:\\Windows\\SysWOW64`.\n\nThe wheel binaries for macOS are compiled with the system `clang++`, which means by default it is not parallelized. To use the parallel version on Mac, you can either use the version in [conda-forge](https://github.com/conda-forge/somoclu-feedstock) or compile it from source with your favourite OpenMP-friendly compiler. To get it working with the GPU kernel, you might have to follow the instructions at the [Somoclu - Python Interface](https://github.com/peterwittek/somoclu/tree/master/src/Python).\n\nThe R version is available on CRAN. You can install it with\n\n    install.packages(\"Rsomoclu\")\n\nTo get it working with the GPU kernel, download the source zip file and specify your CUDA directory the following way:\n\n    R CMD INSTALL src/Rsomoclu_version.tar.gz --configure-args=/path/to/cuda\n\nThe Julia version is available on [GitHub](https://github.com/peterwittek/Somoclu.jl). The standard `Pkg.add(\"Somoclu\")` should work.\n\nFor using the MATLAB toolbox, install SOM-Toolbox following the instructions at [ilarinieminen/SOM-Toolbox](https://github.com/ilarinieminen/SOM-Toolbox) and define the location of your MATLAB install to the configure script:\n\n    ./configure --without-mpi --with-matlab=/usr/local/MATLAB/R2014a\n\nFor the GPU kernel, specify the location of your CUDA library for the configure script. More detailed instructions are in the [MATLAB source folder](https://github.com/peterwittek/somoclu/tree/master/src/MATLAB).\n\n# Compilation \u0026 Installation\n\nThese are the instructions for compiling the core library and the command line interface. The only dependency is a C++ compiler chain -- GCC, ICC, clang, and VC were tested.\n\nMulticore execution is supported through OpenMP -- the compiler must support this. Distributed systems are supported through MPI. The package was tested with OpenMPI. It should also work with other MPI flavours. CUDA support is optional.\n\n## Linux or macOS\n\nIf you have just cloned the git repository first run\n\n    $ ./autogen.sh\n\nThen follow the standard POSIX procedure:\n\n    $ ./configure [options]\n    $ make\n    $ make install\n\nOptions for configure\n\n    --prefix=PATH           Set directory prefix for installation\n\nBy default Somoclu is installed into /usr/local. If you prefer a\ndifferent location, use this option to select an installation\ndirectory.\n\n    --without-mpi           Disregard any MPI installation found.\n    --with-mpi=MPIROOT      Use MPI root directory.\n    --with-mpi-compilers=DIR or --with-mpi-compilers=yes\n                              use MPI compiler (mpicxx) found in directory DIR, or\n                              in your PATH if =yes\n    --with-mpi-libs=\"LIBS\"  MPI libraries [default \"-lmpi\"]\n    --with-mpi-incdir=DIR   MPI include directory [default MPIROOT/include]\n    --with-mpi-libdir=DIR   MPI library directory [default MPIROOT/lib]\n\nThe above flags allow the identification of the correct MPI library the user wishes to use. The flags are especially useful if MPI is installed in a non-standard location, or when multiple MPI libraries are available.\n\n    --with-cuda=/path/to/cuda           Set path for CUDA\n\nSomoclu looks for CUDA in /usr/local/cuda. If your installation is not there, then specify the path with this parameter. If you do not want CUDA enabled, set the parameter to `--without-cuda`.\n\n## Windows\n\nUse the `somoclu.sln` under `src/Windows/somoclu` as an example Visual Studio 2015 solution. Modify the CUDA version or VC compiler version according to your needs.\n\nThe default solution enables all of OpenMP, MPI, and CUDA. The default MPI installation path is `C:\\Program Files (x86)\\Microsoft SDKs\\MPI\\`, modify the settings if yours is in a different path. The configuration default CUDA version is 9.1. Disable MPI by removing `HAVE_MPI` macro in the project properties (`Properties -\u003e Configuration Properties -\u003e C/C++ -\u003e Preprocessor`). Disable CUDA by removing `CUDA` macro in the solution properties and uncheck CUDA in `Project -\u003e Custom Build Rules`. If you open the solution without CUDA installed, please remove the following sections in `somoclu.vcxproj`:\n\n```\n  \u003cImportGroup Label=\"ExtensionSettings\"\u003e\n    \u003cImport Project=\"$(VCTargetsPath)\\BuildCustomizations\\CUDA 9.1.props\" /\u003e\n  \u003c/ImportGroup\u003e\n```\n\nand\n\n```\n  \u003cImportGroup Label=\"ExtensionTargets\"\u003e\n    \u003cImport Project=\"$(VCTargetsPath)\\BuildCustomizations\\CUDA 9.1.targets\" /\u003e\n  \u003c/ImportGroup\u003e\n```\n\nor change the version number according to which you installed.\n\nThe usage is identical to the Linux version through command line (see the relevant section).\n\n# Acknowledgment\n\nThis work was supported by the European Commission Seventh Framework Programme under Grant Agreement Number FP7-601138 PERICLES and by the AWS in Education Machine Learning Grant award.\n\n# Citation\n\n1. Peter Wittek, Shi Chao Gao, Ik Soo Lim, Li Zhao (2017). Somoclu: An Efficient Parallel Library for Self-Organizing Maps. Journal of Statistical Software, 78(9), pp.1--21. DOI:[10.18637/jss.v078.i09](https://doi.org/10.18637/jss.v078.i09).\n   arXiv:[1305.1422](https://arxiv.org/abs/1305.1422).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterwittek%2Fsomoclu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpeterwittek%2Fsomoclu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterwittek%2Fsomoclu/lists"}