{"id":15893335,"url":"https://github.com/nunofachada/amvidc","last_synced_at":"2025-03-20T12:34:22.274Z","repository":{"id":11368142,"uuid":"13804086","full_name":"nunofachada/amvidc","owner":"nunofachada","description":"Data clustering algorithm based on agglomerative hierarchical clustering (AHC) which uses minimum volume increase (MVI) and minimum direction change (MDC) clustering criteria.","archived":false,"fork":false,"pushed_at":"2016-01-12T17:19:43.000Z","size":134,"stargazers_count":8,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-10-07T08:09:53.690Z","etag":null,"topics":["algorithm","cluster-analysis","clustering","clustering-algorithm","clustering-criteria","convex-hull","convexhull","data-clustering-algorithm","fscore","matlab","matlab-toolbox","minimum-direction-change","minimum-volume-increase","pddp","principal-components","volume"],"latest_commit_sha":null,"homepage":"","language":"Matlab","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nunofachada.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"license.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-10-23T13:53:51.000Z","updated_at":"2024-09-14T02:37:15.000Z","dependencies_parsed_at":"2022-09-21T01:42:21.575Z","dependency_job_id":null,"html_url":"https://github.com/nunofachada/amvidc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nunofachada%2Famvidc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nunofachada%2Famvidc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nunofachada%2Famvidc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nunofachada%2Famvidc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nunofachada","download_url":"https://codeload.github.com/nunofachada/amvidc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221766823,"owners_count":16877360,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","cluster-analysis","clustering","clustering-algorithm","clustering-criteria","convex-hull","convexhull","data-clustering-algorithm","fscore","matlab","matlab-toolbox","minimum-direction-change","minimum-volume-increase","pddp","principal-components","volume"],"created_at":"2024-10-06T08:09:57.954Z","updated_at":"2024-10-28T02:22:04.312Z","avatar_url":"https://github.com/nunofachada.png","language":"Matlab","funding_links":[],"categories":[],"sub_categories":[],"readme":"# User Manual\n\n## Introduction\n\nAMVIDC is a data clustering algorithm based on agglomerative \nhierarchical clustering (AHC) which uses minimum volume increase (MVI) \nand minimum direction change (MDC) as clustering criteria. The\nalgorithm is presented in detail in the following publication:\n\n-   Fachada, N., Figueiredo, M.A.T., Lopes, V.V., Martins, R.C., Rosa, \nA.C., [Spectrometric differentiation of yeast strains using minimum volume \nincrease and minimum direction change clustering criteria](http://www.sciencedirect.com/science/article/pii/S0167865514000889),\nPattern Recognition Letters, vol. 45, pp. 55-61 (2014), doi: http://dx.doi.org/10.1016/j.patrec.2014.03.008\n\n### Data format\n\nData for clustering is presented as a set of samples (or points), each \nwith a constant number of dimensions. As such, for the rest of this \nguide, data matrices are considered to be in the following format:\n\n-   *m* x *n*, with *m* samples (points) and *n* dimensions (variables)\n\n### Generating data\n\nAMVIDC was inspired on the differentiation of spectrometric data. \nHowever, to further validate the clustering algorithms, synthetic\ndata sets can be generated with [generateData](https://github.com/FakenMC/generateData), \nwhich generates data in the *m* x *n* format, with *m* samples (points) and \n*n* dimensions (variables) according to a set of parameters.\n\n## Running the algorithm\n\nAMVIDC is implemented in the [clusterdata_amvidc](clusterdata_amvidc.m) \nfunction:\n\n    idx = clusterdata_amvidc(X, k, idx_init);\n\nwhere **X**, **k** and **idx\\_init** are the data matrix, maximum number \nof clusters and initial clustering, respectively. Initial clustering \nis required so that all possible new clusters have volume, a requirement \nfor MVI. The [clusterdata_amvidc](clusterdata_amvidc.m) function has many \noptional parameters, with reasonable defaults, as specified in the \nfollowing table:\n\n  Parameter    | Default                |  Options/Description\n  ------------ | ---------------------- | ------------------------------------------------------------------------------------------------------\n  *volume*     | ‘convhull’             |  Volume type: ‘ellipsoid’ or ‘convhull’\n  *tol*        | 0.01                   |  Tolerance for minimum volume ellipse calculation (‘ellipsoid’ volume only)\n  *dirweight*  | 0                      |  Direction weight in last iteration (0 means MDC linkage is ignored)\n  *dirpower*   | 2                      |  Convergence power to dirweight (higher values make convergence steeper and occurring more to the end)\n  *dirtype*    | ‘svd’                  |  Direction type: ‘pca’, ‘svd’\n  *nvi*        | true                   |  Allow negative volume increase?\n  *loglevel*   | 3 (show warnings only) |  Log level: 0 (show all messages) to 4 (only show critical errors), default is 3 (show warnings)\n\nFor example, to perform clustering using ellipsoid volume taking into\naccount direction change, where cluster direction is determined using\nPCA, one would do:\n\n    idx = clusterdata_mvidc(X, k, idx_init, 'volume', 'ellipsoid', 'dirweight',0.5, 'dirpower', 4, 'dirtype', 'pca');\n\nAs specified, the [clusterdata_amvidc](clusterdata_amvidc.m) function \nrequires initial clusters which, if joined, produce new clusters with \nvolume. Two functions are included for this purpose (however, others can be \nused):\n\n-   [initClust](initClust.m) - Performs very simple initial clustering based\n    on AHC with single linkage (nearest neighbor) and user defined\n    distance. Each sample is associated with the same cluster of its\n    nearest point. Allows to define a minimum size for each cluster,\n    distance type (as supported by Matlab `pdist`) and the number of\n    clusters which are allowed to have less than the minimum size.\n-   [pddp](pddp.m) - Perform PDDP (principal direction divisive\n    clustering) on input data. This implementation always selects the\n    largest cluster for division, with the algorithm proceeding while\n    the division of a cluster yields sub-clusters which can have a\n    volume.\n\n## Analysis of results\n\n### F-score\n\nThe [F-score](http://en.wikipedia.org/wiki/F1_score) measure is used \nto evaluate clustering results. The measure is implemented in the \n[fscore](fscore.m) function. To run this function:\n\n    eval = fscore(idx, numclasses, numclassmembers);\n\nwhere:\n\n-   **idx** - *m* x *1* vector containing the cluster indices of each\n    point (as returned by the clustering functions)\n-   **numclasses** - Correct number of clusters\n-   **numclassmembers** - Vector with the correct size of each cluster\n    (or a scalar if all clusters are of the same size)\n\nThe [fscore](fscore.m) function returns:\n\n-   **eval** - Value between 0 (worst case) and 1 (perfect clustering)\n\n### Plotting clusters\n\nVisualizing how an algorithm grouped clusters can provide important \ninsight on its effectiveness. Also, it may be important to visually \ncompare an algorithm’s clustering result with the correct result. The \n[plotClusters](plotClusters.m) function can show two clustering results in the same \nimage (e.g. the correct one and one returned by an algorithm). The \n[plotClusters](plotClusters.m) function can be executed in the following way:\n\n    h_out = plotClusters(X, dims, idx_marker, idx_encircle, encircle_method, h_in);\n\nwhere:\n\n-   **X** - Data matrix, *m* x *n*, with m samples (points) and n\n    dimensions (variables)\n-   **dims** - Number of dimensions (2 or 3)\n-   **idx_marker** - Clustering result^ to be shown directly in\n    points using markers\n-   **idx_encircle** - Clustering result^ to be shown using\n    encirclement/grouping of points\n-   **encircle_method** - How to encircle the **idx_encircle**\n    result: ‘convhull’ (default), ‘ellipsoid’ or ‘none’\n-   **h_in** - (Optional) Existing figure handle where to create\n    plot\n\n^ *m* x *1* vector containing the cluster indices of each point\n\nThe [plotClusters](plotClusters.m) function returns:\n\n-   **h_out** - Figure handle of plot\n\n## Example\n\nIn this example, we demonstrate how to test AMVIDC using \n[Fisher's iris data](http://en.wikipedia.org/wiki/Iris_flower_data_set), \nwhich is included in the MatLab Statistics Toolbox. We chose this data set\nas it is readily available, not necessarily because AMVIDC is the most\nappropriate algorithm to apply in this case. First, we load the data:\n\n    \u003e\u003e load fisheriris\n\nThe data set consists of 150 samples, 50 samples for each of three \nspecies of the Iris flower. Four features (variables) were measured per \nsample. The data itself loads into the `meas` variable, while the\nspecies to which each sample is associated with is given in the `species`\nvariable. The samples are ordered by species, so the first 50 samples\nbelong to one species, and so on. First, we test the \n[k-Means](http://en.wikipedia.org/wiki/K-means_clustering) algorithm,\nspecifying three clusters, one per species:\n\n    \u003e\u003e idx_km = kmeans(meas, 3);\n\nWe can evaluate the performance of k-Means using the [fscore](fscore.m)\nfunction (the value of 1 being perfect clustering):\n\n```\n\u003e\u003e fscore(idx_km, 3, [50, 50, 50])\n\nans =\n\n    0.8918\n```\n\nVisual observation can be accomplished with the [plotClusters](plotClusters.m) \nfunction. First, [PCA](http://en.wikipedia.org/wiki/Principal_component_analysis)\nis applied on the data, yielding its principal components (i.e., the \ncomponents which have the largest possible variance). The first two \ncomponents (the two directions of highest variance) are useful for \nvisually discriminating the data in 2D, even though k-Means was \nperformed on the four dimensions of the data). \n\n    \u003e\u003e [~, iris_pca] = princomp(meas);\n\nWe can now plot the data:\n\n    \u003e\u003e plotClusters(iris_pca, 2, [50,50,50], idx_km);\n    \u003e\u003e legend(unique(species), 'Location','Best')\n\n![k-Means clustering of the Iris data set](images/kmeans.png \"k-Means clustering of the Iris data set\")\n\nAMVIDC is a computationally expensive algorithm, so it is preferable to\napply it on a reduced number of dimensions. The following command applies \nAMVIDC clustering to the first two principal components of the data set,\nusing [pddp](pddp.m) for the initial clustering, ellipsoid volume \nminimization and direction change minimization:\n\n    \u003e\u003e idx_amvidc = clusterdata_amvidc(iris_pca(:, 1:2), 3, pddp(iris_pca(:, 1:2)), 'dirweight', 0.6, 'dirpower', 8, 'volume', 'ellipsoid');\n\nThe [fscore](fscore.m) evaluation is obtained as follows:\n\n```\n\u003e\u003e fscore(idx_amvidc, 3, [50, 50, 50])\n\nans =\n\n    0.9599\n```\n\nSlightly better than k-Means. Visual inspection also provides a\ngood insight on the clustering result:\n\n    \u003e\u003e plotClusters(iris_pca, 2, [50,50,50], idx_amvidc, 'ellipsoid');\n    \u003e\u003e legend(unique(species), 'Location','Best');\n\n![AMVIDC clustering of the Iris data set](images/amvidc.png \"AMVIDC clustering of the Iris data set\")\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnunofachada%2Famvidc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnunofachada%2Famvidc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnunofachada%2Famvidc/lists"}