https://github.com/deric/handl-data-generators

Clustering data generators
https://github.com/deric/handl-data-generators

Last synced: about 1 year ago
JSON representation

Clustering data generators

Host: GitHub
URL: https://github.com/deric/handl-data-generators
Owner: deric
License: lgpl-3.0
Created: 2015-05-26T15:39:08.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2015-11-23T12:33:35.000Z (over 10 years ago)
Last Synced: 2025-02-09T20:42:19.842Z (over 1 year ago)
Language: C
Size: 1.13 MB
Stars: 3
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

# J. Handl: data generators

Syntetic data generators for creating datasets with Gaussian distribution. The code was taken from [official website](http://personalpages.manchester.ac.uk/mbs/Julia.Handl/generators.html) and slightly modified for modern compilers.

## Requirements

You'll need `g++` compiler.

* Debian: `apt-get install build-essential` should be enough

and run

```
$ make
```

## mult_generator

Currently does not take any parameters, all settings is hard-coded in constants:

```c
#define DIM 2 // dimensionality of the data
#define NUM 40 // number of clusters

#define MAXMU 10 // mean in each dimension is in range [0,MAXMU]
#define MINMU -10
#define MINSIGMA 0
#define MAXSIGMA 20*sqrt(DIM) // standard deviation (to be added on top
// of row sum in each dimension is in range [0,MAXSIGMA]
#define MAXSIZE 100 // size of each cluster is in range [MINSIZE,MAXSIZE]
#define MINSIZE 10
#define RUNS 10 // number of data sets to be generated
```

simply run:

```
$ ./mult_generator
```

![mult_generator](https://raw.githubusercontent.com/deric/handl-data-generators/screens/img/2d-4c-no9.png)

## elly

Ellipsoid generator

```
$ ./elly [-k ] [-d ] [-s ]
```
where all parameters are optional and:
* `` is a positive int >= 2
* `` is a positive int >= 2
* `` is a long int.

![elly example](https://raw.githubusercontent.com/deric/handl-data-generators/screens/img/elly-2d10c13s.png)

## cure

CURE data sets generator. See Guha, Sudipto, Rajeev Rastogi, and Kyuseok Shim. "CURE: an efficient clustering algorithm for
large databases." ACM SIGMOD Record. Vol. 27. No. 2. ACM, 1998. for more details.

The distribution of data points is just approximated

```
$ ./cure -n [-d ] [-s ] [-l ] [-m ] [-t type of data]
```
where:
* `-l` minimal x/y value
* `-m` maximal x/y value
* `-t` type of dataset, currently supports values 0-2

![cure t0](https://raw.githubusercontent.com/deric/handl-data-generators/screens/img/cure-t0-2k-2d.png)

![cure t1](https://raw.githubusercontent.com/deric/handl-data-generators/screens/img/cure-t1-2k-2d.png)

![cure t2](https://raw.githubusercontent.com/deric/handl-data-generators/screens/img/cure-t2-4k.png)

## disk

Disk in disk dataset are two clusters formed by a circle and an annulus around it.

![disk-4600](https://raw.githubusercontent.com/deric/handl-data-generators/screens/img/disk-4600.png)

## Authors

* Julia Handl
* Joshua Knowles
* John Burkardt
* Tomas Barton

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deric/handl-data-generators

Awesome Lists containing this project

README