https://github.com/dennisfrancis/clusterrows
A LibreOffice Calc extension that clusters the rows in a table and colors them to indicate the clusters.
https://github.com/dennisfrancis/clusterrows
clustering cpp expectation-maximization gaussian-mixture-models gmm libreoffice-calc python
Last synced: 12 months ago
JSON representation
A LibreOffice Calc extension that clusters the rows in a table and colors them to indicate the clusters.
- Host: GitHub
- URL: https://github.com/dennisfrancis/clusterrows
- Owner: dennisfrancis
- License: gpl-3.0
- Created: 2016-11-27T19:51:44.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2025-01-19T11:40:42.000Z (about 1 year ago)
- Last Synced: 2025-03-27T23:41:51.033Z (about 1 year ago)
- Topics: clustering, cpp, expectation-maximization, gaussian-mixture-models, gmm, libreoffice-calc, python
- Language: C++
- Homepage:
- Size: 2.3 MB
- Stars: 11
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ClusterRows
ClusterRows is a LibreOffice Calc extension that clusters the rows in a table and colors them to indicate the clusters.
## Install
Use the extension manager of LibreOffice to install the pre-built extension downloaded from the [releases page](https://github.com/dennisfrancis/ClusterRows/releases). Alternatively you can install the downloaded extension from the console as:
```
$ unopkg install /ClusterRows.oxt
```
## What's new?
* Support for full covariance GMM in GMMCluster() formula.
* Support for Windows.
* Platform specific extension packages in the releases page.
* Dialog based workflow
* GMMCluster() formula
* Input range and the output location may be selected interactively and edited in the dialog
## How to use?
1. Click the toolbar item named `Cluster rows`  which is next to the AutoFilter item or click on the `Clustering` menu item under `Data`.
2. Now a dialog will appear where the input data cell-range(with or without header), the output location and the parameters of clustering can be set. If the input cell range was selected before launching the dialog then these two fields will be pre-filled. By default the output location is set to the column next to the last column of the input data cell-range for convenience. If the first row of the input range has the column headers then it can be specified by checking the `Header in the first row` checkbox. It is also possible to specify whether the data rows need to be colored according to the cluster assignments.\

3. After pressing the *Compute* button, two new columns [ClusterId and Confidence] will be written to the user specified output location. These two columns will have headers if the dialog option `Header in the first row` was checked. The first column **ClusterId** specifies the cluster to which the row is assigned and the second column **Confidence** indicates the algorithm's confidence in scale [0,1] that this cluster assignment may be correct (higher number implies higher confidence). Depending on the choice provided in the dialog, the data rows are colored according to the cluster assignments.
## Advanced usage via `GMMCLUSTER` formula
The dialog based clustering (described above) uses an in-house array formula `GMMCLUSTER` to compute the clusters and confidence scores. Hence it is possible to directly use this array formula to compute the clusters instead of using the dialog. Doing so has the advantage of specifying the exact data-range (without the header) and placement of the results. In addition the parameters (including the data-range) could be specified programatically as functions of other cells/ranges in general. The syntax of `GMMCLUSTER` is:
```
GMMCLUSTER(data, numClusters, numEpochs, numIterations, fullGMM)
```
where **data** is the array(cell-range) holding the data, **numClusters** is the desired number of clusters (optional, default is to automatically estimate this), **numEpochs** is the maximum number of epochs to use (optional), **numIteration** is the maximum number of iterations to do in each epoch (optional) and **fullGMM** specified whether to do a full covariance GMM or not (optional, default setting is 0(FALSE)). Note that after entering the formula expression remember to press `Ctrl+Shift+Enter` instead of just `Enter` to commit the array formula.
## Implementation
The project uses an in-house C++ implementation of full [Expectation Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) algorithm to compute the clusters. In the auto mode (when number of clusters is specified as 0) it chooses the number of clusters parameter via [Bayesian information criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion).
The project does not depend on any machine learning libraries but it uses [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) for its linear algebra capabilities. Full source code of ClusterRows is made available under [GPL3 license](https://www.gnu.org/licenses/gpl-3.0.en.html).
## Planned features
* See TODO.md for short term planned features.
* Add support for other common clustering algorithms.
## Sample usage of the extension
In `testdocs` directory there is a spreadsheet file called `no-results.ods`. In that sheet there is a dataset in the range `A1:C301` synthetically generated from a 3-cluster Gaussian mixture model with diagonal covariance matrices. This toy dataset has two dimensions/variables(column A and B). Column C has the ground truth cluster id information of each row. To test the extension a copy of the dataset is placed(excluding the ground-truth column) at `F1:G301`. Go to the menu `Data > Statistics > Clustering...` and select or enter the input data range `F1:G301` and check the option `Header in the first row` and click `Compute` button in the dialog. The extension will compute the 3 clusters and produce the columns `ClusterId` and `Confidence` next to the data. The cell `L9` in that sheet will indicate the clustering accuracy. This is a measure of how well the clustering algorithm was able to assign clusters compared to the ground truths in the original dataset. Typically we get around 97% accuracy for this dataset.

The document also contains visualizations of the data in the sheet `charts`. The left chart shows the data points colored according to the ground truth clusters of the dataset. The middle chart shows the data points without the cluster information (which is the input to the clustering algorithm). The right chart shows the data colored according to the cluster assignments made by the algorithm.

The `testdocs` directory also contains a spreadsheet called `no-results-full.ods` which has a similar dataset in the range `A1:C901` generated from a 3-cluster Gaussian mixture model but with full covariance matrices. A cluster analysis on this dataset without a `Full covariance GMM?` checkbox checked will give poorer results due to the algorithm not learning the off-diagonal elements of cluster covariance matrices. Doing a full covariance GMM (i.e. with that checkbox checked) will produce good results. The far right chart in the `charts` tab visualizes the clusters found by the algorithm.

## Building the extension from source
This is known to work only in GNU/Linux systems. Builds for Windows and MacOS can be done using cmake toolchain files provided with the project.
1. Install `cmake` and optionally `ccmake`. In Ubuntu 20.04+ `sudo apt install cmake cmake-curses-gui` will install them.
2. Install LibreOffice SDK. For instance in Ubuntu 20.04+ `sudo apt install libreoffice-dev` will install it.
3. In the terminal go to `/build/linux/`
* Run `cmake ../../ -DCMAKE_BUILD_TYPE=RelWithDebInfo`
* Run `make ClusterRows`
* Alternatively run `make deploy` to build and deploy this extension to the default LibreOffice installation.
* Or run `make deployrun` to build, deploy and start LibreOffice Calc with a test document.
* See the debug logs using `make showlogs`
The built extensions will be placed in `/extension`. When building for Linux, this file is named `ClusterRows-Linux.oxt` which can be manually installed by invoking `unopkg add `.
If you get errors on running any of these commands or if you want to report any bug please open an issue here.
As always pull-requests are welcome. Happy hacking!