Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/prateekiiest/a-novel-clustering-approach

A a new approach to clustering the IRIS dataset
https://github.com/prateekiiest/a-novel-clustering-approach

cluster-analysis clustering clustering-algorithm correlation iris machine-learning machine-learning-algorithms

Last synced: 10 days ago
JSON representation

A a new approach to clustering the IRIS dataset

Host: GitHub
URL: https://github.com/prateekiiest/a-novel-clustering-approach
Owner: prateekiiest
License: gpl-2.0
Created: 2018-09-29T00:30:59.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2018-09-30T01:05:00.000Z (over 6 years ago)
Last Synced: 2024-11-13T07:40:48.644Z (2 months ago)
Topics: cluster-analysis, clustering, clustering-algorithm, correlation, iris, machine-learning, machine-learning-algorithms
Language: Python
Homepage:
Size: 223 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # A Novel Approach to Clustering

## Abstract

Clustering is one of the most important topics under machine learning and is used to find potential structure in the data, how the data items are similar to each other.Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields.

In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm. 

Here we propose a new clustering approach to gather data points that are highly similar to each other into one cluster and separate the points that are highly dissimilar in nature.

## Data Set 

We use the iris data set in our case to evaluate our model in this respect.

- The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

- The data set contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

![](https://raw.githubusercontent.com/prateekiiest/A-Novel-Clustering-Approach/master/images/iris.jpeg)

### DataSet visualization

So we first make the data visualization of the data set

#### Feature Plot

```

import pandas as pd

import matplotlib.pyplot as plt

wbcd = pd.read_csv('iris_new.csv')

tmp = wbcd.drop('Id', axis=1)

g = sns.pairplot(tmp, hue='Species', markers='+')

plt.show()

```

![](https://raw.githubusercontent.com/prateekiiest/A-Novel-Clustering-Approach/master/images/iris_feature_plot.png)

#### Feature Correlation

We then see the correlation among the features of the data set.

```

corr = tmp.iloc[:,:].corr()

colormap = sns.diverging_palette(220, 10, as_cmap = True)

plt.figure(figsize=(14,14))

sns.heatmap(corr, cbar = True,  square = True, annot=True, fmt= '.2f',annot_kws={'size': 8},

            cmap = colormap, linewidths=0.1, linecolor='white')

plt.title('Correlation of Iris Data Features', y=1.05, size=15)

plt.show()

```

![](https://raw.githubusercontent.com/prateekiiest/A-Novel-Clustering-Approach/master/images/feature_plot.png)

#### Class Distribution

```

tmp = wbcd.drop('Id', axis=1)

sns.countplot(tmp['Species'],label="Count")

plt.show()

```

![](https://raw.githubusercontent.com/prateekiiest/A-Novel-Clustering-Approach/master/images/iris_plot_species.png)

----------------------------------

## Clustering Approach

We know move on to our new clustering approach proposed.

The clustering algorithm proposed is as follows

- We initially have 150 clusters initialised by the object iteself, such as c_1, c_2,..........,c_150

- For every pair of clusters c_i and c_j, compute cluster similarity using

   s_ij = | c_i ∩ c_j | / | c_i u c_j |

- Thus you have a cluster similarity matrix S = (s_ij)

   If s_kl is the highest value in S then merge clusters c_k and c_l .

   If multiple highest values are there separately merge them

- If any cluster say c_t (≠ c_k and c_l) is a subset of (c_k u c_l) then discard it for all t = 1,2,...,150

  t≠k and t≠l

  Let c_d = number of clusters discarded

- Let you have n = 150 - c_d - 1

- If no. of clusters = m(predefined, user input) then return else goto step 2