https://github.com/nunofachada/generatedata
Generates 2D data clusters
https://github.com/nunofachada/generatedata
axis center clustering dataset dataset-generation datasets distance matlab octave octave-functions octave-scripts slope totalpoints
Last synced: about 2 months ago
JSON representation
Generates 2D data clusters
- Host: GitHub
- URL: https://github.com/nunofachada/generatedata
- Owner: nunofachada
- License: mit
- Created: 2014-04-28T23:05:22.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2023-01-26T17:48:21.000Z (over 2 years ago)
- Last Synced: 2025-04-02T14:22:06.650Z (6 months ago)
- Topics: axis, center, clustering, dataset, dataset-generation, datasets, distance, matlab, octave, octave-functions, octave-scripts, slope, totalpoints
- Language: MATLAB
- Homepage:
- Size: 35.2 KB
- Stars: 2
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://github.com/fakenmc/generateData/releases)
[](https://opensource.org/licenses/MIT/)
[](https://www.mathworks.com/matlabcentral/fileexchange/37435-generate-data-for-clustering)# generateData
## Summary
A MATLAB/Octave function which generates 2D data clusters. Data is
created along straight lines, which can be more or less parallel
depending on the selected input parameters.## Synopsis
```MATLAB
[data, clustPoints, idx, centers, angles, lengths] = ...
generateData(angleMean, angleStd, numClusts, xClustAvgSep, yClustAvgSep, ...
lengthMean, lengthStd, lateralStd, totalPoints, ...)
```## Input parameters
### Required parameters
Parameter | Description
-------------- | -----------
`angleMean` | Mean angle in radians of the lines on which clusters are based. Angles are drawn from the normal distribution.
`angleStd` | Standard deviation of line angles.
`numClusts` | Number of clusters (and therefore of lines) to generate.
`xClustAvgSep` | Average separation of line centers along the X axis.
`yClustAvgSep` | Average separation of line centers along the Y axis.
`lengthMean` | Mean length of the lines on which clusters are based. Line lengths are drawn from the folded normal distribution.
`lengthStd` | Standard deviation of line lengths.
`lateralStd` | Cluster "fatness", i.e., the standard deviation of the distance from each point to its projection on the line. The way this distance is obtained is controlled by the optional `'pointOffset'` parameter.
`totalPoints` | Total points in generated data. These will be randomly divided between clusters using the half-normal distribution with unit standard deviation.### Optional named parameters
Parameter name | Parameter values | Default value | Description
-------------- | ---------------------------------- | ------------- | -----------
`allowEmpty` | `true`, `false` | `false` | Allow empty clusters?
`pointDist` | `'unif'`, `'norm'` | `unif` | Specifies the distribution of points along lines, with two possible values: 1) `'unif'` distributes points uniformly along lines; or, 2) `'norm'` distribute points along lines using a normal distribution (line center is the mean and the line length is equal to 3 standard deviations).
`pointOffset` | `1D`, `2D` | `2D` | Controls how points are created from their projections on the lines, with two possible values: 1) `'1D'` places points on a second line perpendicular to the cluster line using a normal distribution centered at their intersection; or, 2) `'2D'` places point using a bivariate normal distribution centered at the point projection.## Return values
Value | Description
------------- | --------------------------------------------------------------------------------------
`data` | Matrix (`totalPoints` x *2*) with the generated data.
`clustPoints` | Vector (`numClusts` x *1*) containing number of points in each cluster.
`idx` | Vector (`totalPoints` x *1*) containing the cluster indices of each point.
`centers` | Matrix (`numClusts` x *2*) containing line centers from where clusters were generated.
`angles` | Vector (`numClusts` x *1*) containing the effective angles of the lines used to generate clusters.
`lengths` | Vector (`numClusts` x *1*) containing the effective lengths of the lines used to generate clusters.## Usage examples
### Basic usage
```MATLAB
[data cp idx] = generateData(pi / 2, pi / 8, 5, 15, 15, 5, 1, 2, 200);
```The previous command creates 5 clusters with a total of 200 points, with
a mean angle of π/2 (*std*=π/8), separated in average by 15 units in both
*x* and *y* directions, with mean length of 5 units (*std*=1) and a
"fatness" or spread of 2 units.The following command plots the generated clusters:
```MATLAB
scatter(data(:, 1), data(:, 2), 8, idx);
```### Using optional parameters
The following command generates 7 clusters with a total of 100 000 points.
Optional parameters are used to override the defaults.```MATLAB
[data cp idx] = generateData(0, pi / 16, 7, 25, 25, 25, 5, 1, 100000, ...
'pointDist', 'norm', 'pointOffset', '1D', 'allowEmpty', true);
```The generated clusters can be visualized with the same `scatter` command used
in the previous example.### Reproducible cluster generation
To make cluster generation reproducible, set the random number generator seed
to a specific value (e.g. 123) before generating the data:```MATLAB
rng(123);
```For GNU Octave, use the following instructions instead:
```MATLAB
rand("state", 123);
randn("state", 123);
```## Previous behaviors and reproducibility of results
Before [v2.0.0](https://github.com/fakenmc/generateData/tree/v2.0.0), lines
supporting clusters were parameterized with slopes instead of angles. We found
this caused difficulties when choosing line orientation, thus the change to
angles, which are much easier to work with.
Version [v1.3.0](https://github.com/fakenmc/generateData/tree/v1.3.0) still
uses slopes, for those who prefer this behavior.For reproducing results in studies published before May 2020, use version
[v1.2.0](https://github.com/fakenmc/generateData/tree/v1.2.0) instead.
Subsequent versions were optimized in a way that changed the order in which
the required random values are generated, thus producing slightly different
results.## Reference
If you use this function in your work, please cite the following reference:
- Fachada, N., & Rosa, A. C. (2020).
[generateData—A 2D data generator](https://doi.org/10.1016/j.simpa.2020.100017).
Software Impacts, 4:100017. doi: [10.1016/j.simpa.2020.100017](https://doi.org/10.1016/j.simpa.2020.100017)## Multidimensional alternative
The [*MOCluGen*](https://github.com/clugen/MOCluGen) toolbox extends
*generateData* with arbitrary dimensions and statistical distributions.
Therefore, *generateData* offers a limited subset of the functionality provided
by *MOCluGen*, although it's probably simpler to use.## License
This script is made available under the [MIT License](LICENSE).