An open API service indexing awesome lists of open source software.

https://github.com/moligarch/ml-concepts-methods-iris-dataset

Hands-on Machine learning training on IRIS dataset by @moligarch
https://github.com/moligarch/ml-concepts-methods-iris-dataset

Last synced: 2 months ago
JSON representation

Hands-on Machine learning training on IRIS dataset by @moligarch

Awesome Lists containing this project

README

        

# Introduction
First of all: Hello Everyone!

My name is Moein Verkiani but You're going to know me as **Moligarch** or if you are Persian, I'm **Kian**!
in this file we are going to do some hands-on ML exercise on IRIS dataset.
It's good to know [What is Overfitting and how to avoid it?](overfitting.md) after finishing these practices.

Now, let's prepare our environment for further operations:

+ Import libraries
+ Modify environment variable
+ Define dataset

```python
#prepare environment
%reset -f
import os
from sklearn import datasets, tree, svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, roc_curve, auc
from sklearn.model_selection import train_test_split, KFold, cross_val_score, StratifiedKFold, LeaveOneOut,LeavePOut
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import io
from itertools import cycle

iris_data = pd.read_csv('iris.csv')
iris = datasets.load_iris()
Xsk = iris['data']
ysk = iris['target']
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OMP_NUM_THREADS"] = "1"
```

```python
print(type(Xsk),Xsk.shape,type(ysk),ysk.shape)
#print(X,y)
```

(150, 4) (150,)

# Basic Statistics

In this part We're going to calculate some basic statistics metrics in order to understand our dataset better.

IRIS dataset is a collection of 4 features (petal/sepal lenght and width) and 1 target that contains 3 spicies:

* Setosa
* Versicolor
* Virginica

so that statistics metrics mentioned before, have better definition if we run them on these spiceis separatly.

```python
#Calculate Features mean, with respect to kind of flower
X_arr=np.array(pd.DataFrame(Xsk , columns=iris['feature_names']))
setosa_mean = [np.mean(X_arr[:50, i]) for i in range(4)]
versicolor_mean = [np.mean(X_arr[50:100, i]) for i in range(4)]
virginica_mean = [np.mean(X_arr[100:150, i]) for i in range(4)]

spicies = {'setosa': setosa_mean, 'versicolor': versicolor_mean, 'virginica': virginica_mean}

Xmean_df = pd.DataFrame(spicies, index=['sepal length', 'sepal width', 'petal length', 'petal width'])
print('Features Mean\n',Xmean_df)
```

Features Mean
setosa versicolor virginica
sepal length 5.006 5.936 6.588
sepal width 3.428 2.770 2.974
petal length 1.462 4.260 5.552
petal width 0.246 1.326 2.026

```python
#Calculate Features Standard Deviation

setosa_std = [np.std(X_arr[:50, i]) for i in range(4)]
versicolor_std = [np.std(X_arr[50:100, i]) for i in range(4)]
virginica_std = [np.std(X_arr[100:150, i]) for i in range(4)]
X_std=[np.std(X_arr[:150, i]) for i in range(4)]
categ = {'Total':X_std, 'setosa': setosa_std, 'versicolor': versicolor_std, 'virginica': virginica_std}

Xstd_df=pd.DataFrame(categ, index=['sepal length', 'sepal width', 'petal length', 'petal width'])
print('Features Standard Deviation\n',Xstd_df)
```

Features Standard Deviation
Total setosa versicolor virginica
sepal length 0.825301 0.348947 0.510983 0.629489
sepal width 0.434411 0.375255 0.310644 0.319255
petal length 1.759404 0.171919 0.465188 0.546348
petal width 0.759693 0.104326 0.195765 0.271890

```python
#Calculate Features Variance

setosa_var = [np.var(X_arr[:50, i]) for i in range(4)]
versicolor_var = [np.var(X_arr[50:100, i]) for i in range(4)]
virginica_var = [np.var(X_arr[100:150, i]) for i in range(4)]
X_var=[np.var(X_arr[:150, i]) for i in range(4)]
categ = {'Total':X_var, 'setosa': setosa_var, 'versicolor': versicolor_var, 'virginica': virginica_var}

Xvar_df=pd.DataFrame(categ, index=['sepal length', 'sepal width', 'petal length', 'petal width'])
print('Features Variance\n',Xvar_df)
```

Features Variance
Total setosa versicolor virginica
sepal length 0.681122 0.121764 0.261104 0.396256
sepal width 0.188713 0.140816 0.096500 0.101924
petal length 3.095503 0.029556 0.216400 0.298496
petal width 0.577133 0.010884 0.038324 0.073924

## Scale

When your data has different values, and even different measurement units, it can be difficult to compare them.What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

The standardization method uses this formula:

```
z = (x - u) / s
```
Where `z` is the new value, `x` is the original value, `u` is the mean and `s` is the standard deviation.

sklearn do all of this with a single command:

```python
scale = StandardScaler()

scaledX = scale.fit_transform(Xsk)
print(scaledX,scaledX.shape)
```

[[-9.00681170e-01 1.01900435e+00 -1.34022653e+00 -1.31544430e+00]
[-1.14301691e+00 -1.31979479e-01 -1.34022653e+00 -1.31544430e+00]
[-1.38535265e+00 3.28414053e-01 -1.39706395e+00 -1.31544430e+00]
[-1.50652052e+00 9.82172869e-02 -1.28338910e+00 -1.31544430e+00]
[-1.02184904e+00 1.24920112e+00 -1.34022653e+00 -1.31544430e+00]
[-5.37177559e-01 1.93979142e+00 -1.16971425e+00 -1.05217993e+00]
[-1.50652052e+00 7.88807586e-01 -1.34022653e+00 -1.18381211e+00]
[-1.02184904e+00 7.88807586e-01 -1.28338910e+00 -1.31544430e+00]
[-1.74885626e+00 -3.62176246e-01 -1.34022653e+00 -1.31544430e+00]
[-1.14301691e+00 9.82172869e-02 -1.28338910e+00 -1.44707648e+00]
[-5.37177559e-01 1.47939788e+00 -1.28338910e+00 -1.31544430e+00]
[-1.26418478e+00 7.88807586e-01 -1.22655167e+00 -1.31544430e+00]
[-1.26418478e+00 -1.31979479e-01 -1.34022653e+00 -1.44707648e+00]
[-1.87002413e+00 -1.31979479e-01 -1.51073881e+00 -1.44707648e+00]
[-5.25060772e-02 2.16998818e+00 -1.45390138e+00 -1.31544430e+00]
[-1.73673948e-01 3.09077525e+00 -1.28338910e+00 -1.05217993e+00]
[-5.37177559e-01 1.93979142e+00 -1.39706395e+00 -1.05217993e+00]
[-9.00681170e-01 1.01900435e+00 -1.34022653e+00 -1.18381211e+00]
[-1.73673948e-01 1.70959465e+00 -1.16971425e+00 -1.18381211e+00]
[-9.00681170e-01 1.70959465e+00 -1.28338910e+00 -1.18381211e+00]
[-5.37177559e-01 7.88807586e-01 -1.16971425e+00 -1.31544430e+00]
[-9.00681170e-01 1.47939788e+00 -1.28338910e+00 -1.05217993e+00]
[-1.50652052e+00 1.24920112e+00 -1.56757623e+00 -1.31544430e+00]
[-9.00681170e-01 5.58610819e-01 -1.16971425e+00 -9.20547742e-01]
[-1.26418478e+00 7.88807586e-01 -1.05603939e+00 -1.31544430e+00]
[-1.02184904e+00 -1.31979479e-01 -1.22655167e+00 -1.31544430e+00]
[-1.02184904e+00 7.88807586e-01 -1.22655167e+00 -1.05217993e+00]
[-7.79513300e-01 1.01900435e+00 -1.28338910e+00 -1.31544430e+00]
[-7.79513300e-01 7.88807586e-01 -1.34022653e+00 -1.31544430e+00]
[-1.38535265e+00 3.28414053e-01 -1.22655167e+00 -1.31544430e+00]
[-1.26418478e+00 9.82172869e-02 -1.22655167e+00 -1.31544430e+00]
[-5.37177559e-01 7.88807586e-01 -1.28338910e+00 -1.05217993e+00]
[-7.79513300e-01 2.40018495e+00 -1.28338910e+00 -1.44707648e+00]
[-4.16009689e-01 2.63038172e+00 -1.34022653e+00 -1.31544430e+00]
[-1.14301691e+00 9.82172869e-02 -1.28338910e+00 -1.31544430e+00]
[-1.02184904e+00 3.28414053e-01 -1.45390138e+00 -1.31544430e+00]
[-4.16009689e-01 1.01900435e+00 -1.39706395e+00 -1.31544430e+00]
[-1.14301691e+00 1.24920112e+00 -1.34022653e+00 -1.44707648e+00]
[-1.74885626e+00 -1.31979479e-01 -1.39706395e+00 -1.31544430e+00]
[-9.00681170e-01 7.88807586e-01 -1.28338910e+00 -1.31544430e+00]
[-1.02184904e+00 1.01900435e+00 -1.39706395e+00 -1.18381211e+00]
[-1.62768839e+00 -1.74335684e+00 -1.39706395e+00 -1.18381211e+00]
[-1.74885626e+00 3.28414053e-01 -1.39706395e+00 -1.31544430e+00]
[-1.02184904e+00 1.01900435e+00 -1.22655167e+00 -7.88915558e-01]
[-9.00681170e-01 1.70959465e+00 -1.05603939e+00 -1.05217993e+00]
[-1.26418478e+00 -1.31979479e-01 -1.34022653e+00 -1.18381211e+00]
[-9.00681170e-01 1.70959465e+00 -1.22655167e+00 -1.31544430e+00]
[-1.50652052e+00 3.28414053e-01 -1.34022653e+00 -1.31544430e+00]
[-6.58345429e-01 1.47939788e+00 -1.28338910e+00 -1.31544430e+00]
[-1.02184904e+00 5.58610819e-01 -1.34022653e+00 -1.31544430e+00]
[ 1.40150837e+00 3.28414053e-01 5.35408562e-01 2.64141916e-01]
[ 6.74501145e-01 3.28414053e-01 4.21733708e-01 3.95774101e-01]
[ 1.28034050e+00 9.82172869e-02 6.49083415e-01 3.95774101e-01]
[-4.16009689e-01 -1.74335684e+00 1.37546573e-01 1.32509732e-01]
[ 7.95669016e-01 -5.92373012e-01 4.78571135e-01 3.95774101e-01]
[-1.73673948e-01 -5.92373012e-01 4.21733708e-01 1.32509732e-01]
[ 5.53333275e-01 5.58610819e-01 5.35408562e-01 5.27406285e-01]
[-1.14301691e+00 -1.51316008e+00 -2.60315415e-01 -2.62386821e-01]
[ 9.16836886e-01 -3.62176246e-01 4.78571135e-01 1.32509732e-01]
[-7.79513300e-01 -8.22569778e-01 8.07091462e-02 2.64141916e-01]
[-1.02184904e+00 -2.43394714e+00 -1.46640561e-01 -2.62386821e-01]
[ 6.86617933e-02 -1.31979479e-01 2.51221427e-01 3.95774101e-01]
[ 1.89829664e-01 -1.97355361e+00 1.37546573e-01 -2.62386821e-01]
[ 3.10997534e-01 -3.62176246e-01 5.35408562e-01 2.64141916e-01]
[-2.94841818e-01 -3.62176246e-01 -8.98031345e-02 1.32509732e-01]
[ 1.03800476e+00 9.82172869e-02 3.64896281e-01 2.64141916e-01]
[-2.94841818e-01 -1.31979479e-01 4.21733708e-01 3.95774101e-01]
[-5.25060772e-02 -8.22569778e-01 1.94384000e-01 -2.62386821e-01]
[ 4.32165405e-01 -1.97355361e+00 4.21733708e-01 3.95774101e-01]
[-2.94841818e-01 -1.28296331e+00 8.07091462e-02 -1.30754636e-01]
[ 6.86617933e-02 3.28414053e-01 5.92245988e-01 7.90670654e-01]
[ 3.10997534e-01 -5.92373012e-01 1.37546573e-01 1.32509732e-01]
[ 5.53333275e-01 -1.28296331e+00 6.49083415e-01 3.95774101e-01]
[ 3.10997534e-01 -5.92373012e-01 5.35408562e-01 8.77547895e-04]
[ 6.74501145e-01 -3.62176246e-01 3.08058854e-01 1.32509732e-01]
[ 9.16836886e-01 -1.31979479e-01 3.64896281e-01 2.64141916e-01]
[ 1.15917263e+00 -5.92373012e-01 5.92245988e-01 2.64141916e-01]
[ 1.03800476e+00 -1.31979479e-01 7.05920842e-01 6.59038469e-01]
[ 1.89829664e-01 -3.62176246e-01 4.21733708e-01 3.95774101e-01]
[-1.73673948e-01 -1.05276654e+00 -1.46640561e-01 -2.62386821e-01]
[-4.16009689e-01 -1.51316008e+00 2.38717193e-02 -1.30754636e-01]
[-4.16009689e-01 -1.51316008e+00 -3.29657076e-02 -2.62386821e-01]
[-5.25060772e-02 -8.22569778e-01 8.07091462e-02 8.77547895e-04]
[ 1.89829664e-01 -8.22569778e-01 7.62758269e-01 5.27406285e-01]
[-5.37177559e-01 -1.31979479e-01 4.21733708e-01 3.95774101e-01]
[ 1.89829664e-01 7.88807586e-01 4.21733708e-01 5.27406285e-01]
[ 1.03800476e+00 9.82172869e-02 5.35408562e-01 3.95774101e-01]
[ 5.53333275e-01 -1.74335684e+00 3.64896281e-01 1.32509732e-01]
[-2.94841818e-01 -1.31979479e-01 1.94384000e-01 1.32509732e-01]
[-4.16009689e-01 -1.28296331e+00 1.37546573e-01 1.32509732e-01]
[-4.16009689e-01 -1.05276654e+00 3.64896281e-01 8.77547895e-04]
[ 3.10997534e-01 -1.31979479e-01 4.78571135e-01 2.64141916e-01]
[-5.25060772e-02 -1.05276654e+00 1.37546573e-01 8.77547895e-04]
[-1.02184904e+00 -1.74335684e+00 -2.60315415e-01 -2.62386821e-01]
[-2.94841818e-01 -8.22569778e-01 2.51221427e-01 1.32509732e-01]
[-1.73673948e-01 -1.31979479e-01 2.51221427e-01 8.77547895e-04]
[-1.73673948e-01 -3.62176246e-01 2.51221427e-01 1.32509732e-01]
[ 4.32165405e-01 -3.62176246e-01 3.08058854e-01 1.32509732e-01]
[-9.00681170e-01 -1.28296331e+00 -4.30827696e-01 -1.30754636e-01]
[-1.73673948e-01 -5.92373012e-01 1.94384000e-01 1.32509732e-01]
[ 5.53333275e-01 5.58610819e-01 1.27429511e+00 1.71209594e+00]
[-5.25060772e-02 -8.22569778e-01 7.62758269e-01 9.22302838e-01]
[ 1.52267624e+00 -1.31979479e-01 1.21745768e+00 1.18556721e+00]
[ 5.53333275e-01 -3.62176246e-01 1.04694540e+00 7.90670654e-01]
[ 7.95669016e-01 -1.31979479e-01 1.16062026e+00 1.31719939e+00]
[ 2.12851559e+00 -1.31979479e-01 1.61531967e+00 1.18556721e+00]
[-1.14301691e+00 -1.28296331e+00 4.21733708e-01 6.59038469e-01]
[ 1.76501198e+00 -3.62176246e-01 1.44480739e+00 7.90670654e-01]
[ 1.03800476e+00 -1.28296331e+00 1.16062026e+00 7.90670654e-01]
[ 1.64384411e+00 1.24920112e+00 1.33113254e+00 1.71209594e+00]
[ 7.95669016e-01 3.28414053e-01 7.62758269e-01 1.05393502e+00]
[ 6.74501145e-01 -8.22569778e-01 8.76433123e-01 9.22302838e-01]
[ 1.15917263e+00 -1.31979479e-01 9.90107977e-01 1.18556721e+00]
[-1.73673948e-01 -1.28296331e+00 7.05920842e-01 1.05393502e+00]
[-5.25060772e-02 -5.92373012e-01 7.62758269e-01 1.58046376e+00]
[ 6.74501145e-01 3.28414053e-01 8.76433123e-01 1.44883158e+00]
[ 7.95669016e-01 -1.31979479e-01 9.90107977e-01 7.90670654e-01]
[ 2.24968346e+00 1.70959465e+00 1.67215710e+00 1.31719939e+00]
[ 2.24968346e+00 -1.05276654e+00 1.78583195e+00 1.44883158e+00]
[ 1.89829664e-01 -1.97355361e+00 7.05920842e-01 3.95774101e-01]
[ 1.28034050e+00 3.28414053e-01 1.10378283e+00 1.44883158e+00]
[-2.94841818e-01 -5.92373012e-01 6.49083415e-01 1.05393502e+00]
[ 2.24968346e+00 -5.92373012e-01 1.67215710e+00 1.05393502e+00]
[ 5.53333275e-01 -8.22569778e-01 6.49083415e-01 7.90670654e-01]
[ 1.03800476e+00 5.58610819e-01 1.10378283e+00 1.18556721e+00]
[ 1.64384411e+00 3.28414053e-01 1.27429511e+00 7.90670654e-01]
[ 4.32165405e-01 -5.92373012e-01 5.92245988e-01 7.90670654e-01]
[ 3.10997534e-01 -1.31979479e-01 6.49083415e-01 7.90670654e-01]
[ 6.74501145e-01 -5.92373012e-01 1.04694540e+00 1.18556721e+00]
[ 1.64384411e+00 -1.31979479e-01 1.16062026e+00 5.27406285e-01]
[ 1.88617985e+00 -5.92373012e-01 1.33113254e+00 9.22302838e-01]
[ 2.49201920e+00 1.70959465e+00 1.50164482e+00 1.05393502e+00]
[ 6.74501145e-01 -5.92373012e-01 1.04694540e+00 1.31719939e+00]
[ 5.53333275e-01 -5.92373012e-01 7.62758269e-01 3.95774101e-01]
[ 3.10997534e-01 -1.05276654e+00 1.04694540e+00 2.64141916e-01]
[ 2.24968346e+00 -1.31979479e-01 1.33113254e+00 1.44883158e+00]
[ 5.53333275e-01 7.88807586e-01 1.04694540e+00 1.58046376e+00]
[ 6.74501145e-01 9.82172869e-02 9.90107977e-01 7.90670654e-01]
[ 1.89829664e-01 -1.31979479e-01 5.92245988e-01 7.90670654e-01]
[ 1.28034050e+00 9.82172869e-02 9.33270550e-01 1.18556721e+00]
[ 1.03800476e+00 9.82172869e-02 1.04694540e+00 1.58046376e+00]
[ 1.28034050e+00 9.82172869e-02 7.62758269e-01 1.44883158e+00]
[-5.25060772e-02 -8.22569778e-01 7.62758269e-01 9.22302838e-01]
[ 1.15917263e+00 3.28414053e-01 1.21745768e+00 1.44883158e+00]
[ 1.03800476e+00 5.58610819e-01 1.10378283e+00 1.71209594e+00]
[ 1.03800476e+00 -1.31979479e-01 8.19595696e-01 1.44883158e+00]
[ 5.53333275e-01 -1.28296331e+00 7.05920842e-01 9.22302838e-01]
[ 7.95669016e-01 -1.31979479e-01 8.19595696e-01 1.05393502e+00]
[ 4.32165405e-01 7.88807586e-01 9.33270550e-01 1.44883158e+00]
[ 6.86617933e-02 -1.31979479e-01 7.62758269e-01 7.90670654e-01]] (150, 4)

```python
print(type(iris_data))
```


## Data Visualization

If you are trying to discuss or illustrate something to your Colleges,Co Worker, Your managers or etc. you need to SHOW them what you mean! so although we know **Data Talks Everywhere!**, without Data visualization you are just using 30-40% of Data potential. it also helps you to understand relation between datasets better (not in all case I believe!)

So let's dig deeper.

```python
# set up a figure twice as wide as it is tall
fig = plt.figure(figsize=(12,6))
# =============
# First subplot
# =============
# set up the axes for the first plot
ax = fig.add_subplot(1, 2, 1, projection='3d')

x1 = Xsk[:,0]
x2 = Xsk[:,1]

ax.scatter(x1, x2, ysk, marker='o')
ax.set_xlabel('Sepal L')
ax.set_ylabel('Sepal W"')
ax.set_zlabel('Category')
# ==============
# Second subplot
# ==============
# set up the axes for the second plot
ax = fig.add_subplot(1, 2, 2, projection='3d')

x3 = Xsk[:,2]
x4 = Xsk[:,3]

ax.scatter(x3, x4, ysk, marker='x')
ax.set_xlabel('Petal L')
ax.set_ylabel('Petal W"')
ax.set_zlabel('Category')
plt.show()
```


![png](/assets/output_11_0.png)

```python
#compare any feature with respect to all features
sn.pairplot(iris_data)
```


![png](/assets/output_12_1.png)

```python
plt.hist(ysk, 25)
#plt.show()

plt.title("Data Distribution")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
```


![png](/assets/output_13_0.png)



 
sepal.length
sepal.width
petal.length
petal.width
variety




0
5.100000
3.500000
1.400000
0.200000
Setosa


1
4.900000
3.000000
1.400000
0.200000
Setosa


2
4.700000
3.200000
1.300000
0.200000
Setosa


3
4.600000
3.100000
1.500000
0.200000
Setosa


4
5.000000
3.600000
1.400000
0.200000
Setosa


5
5.400000
3.900000
1.700000
0.400000
Setosa


6
4.600000
3.400000
1.400000
0.300000
Setosa


7
5.000000
3.400000
1.500000
0.200000
Setosa


8
4.400000
2.900000
1.400000
0.200000
Setosa


9
4.900000
3.100000
1.500000
0.100000
Setosa


10
5.400000
3.700000
1.500000
0.200000
Setosa


11
4.800000
3.400000
1.600000
0.200000
Setosa


12
4.800000
3.000000
1.400000
0.100000
Setosa


13
4.300000
3.000000
1.100000
0.100000
Setosa


14
5.800000
4.000000
1.200000
0.200000
Setosa


15
5.700000
4.400000
1.500000
0.400000
Setosa


16
5.400000
3.900000
1.300000
0.400000
Setosa


17
5.100000
3.500000
1.400000
0.300000
Setosa


18
5.700000
3.800000
1.700000
0.300000
Setosa


19
5.100000
3.800000
1.500000
0.300000
Setosa


20
5.400000
3.400000
1.700000
0.200000
Setosa


21
5.100000
3.700000
1.500000
0.400000
Setosa


22
4.600000
3.600000
1.000000
0.200000
Setosa


23
5.100000
3.300000
1.700000
0.500000
Setosa


24
4.800000
3.400000
1.900000
0.200000
Setosa


25
5.000000
3.000000
1.600000
0.200000
Setosa


26
5.000000
3.400000
1.600000
0.400000
Setosa


27
5.200000
3.500000
1.500000
0.200000
Setosa


28
5.200000
3.400000
1.400000
0.200000
Setosa


29
4.700000
3.200000
1.600000
0.200000
Setosa


30
4.800000
3.100000
1.600000
0.200000
Setosa


31
5.400000
3.400000
1.500000
0.400000
Setosa


32
5.200000
4.100000
1.500000
0.100000
Setosa


33
5.500000
4.200000
1.400000
0.200000
Setosa


34
4.900000
3.100000
1.500000
0.200000
Setosa


35
5.000000
3.200000
1.200000
0.200000
Setosa


36
5.500000
3.500000
1.300000
0.200000
Setosa


37
4.900000
3.600000
1.400000
0.100000
Setosa


38
4.400000
3.000000
1.300000
0.200000
Setosa


39
5.100000
3.400000
1.500000
0.200000
Setosa


40
5.000000
3.500000
1.300000
0.300000
Setosa


41
4.500000
2.300000
1.300000
0.300000
Setosa


42
4.400000
3.200000
1.300000
0.200000
Setosa


43
5.000000
3.500000
1.600000
0.600000
Setosa


44
5.100000
3.800000
1.900000
0.400000
Setosa


45
4.800000
3.000000
1.400000
0.300000
Setosa


46
5.100000
3.800000
1.600000
0.200000
Setosa


47
4.600000
3.200000
1.400000
0.200000
Setosa


48
5.300000
3.700000
1.500000
0.200000
Setosa


49
5.000000
3.300000
1.400000
0.200000
Setosa


50
7.000000
3.200000
4.700000
1.400000
Versicolor


51
6.400000
3.200000
4.500000
1.500000
Versicolor


52
6.900000
3.100000
4.900000
1.500000
Versicolor


53
5.500000
2.300000
4.000000
1.300000
Versicolor


54
6.500000
2.800000
4.600000
1.500000
Versicolor


55
5.700000
2.800000
4.500000
1.300000
Versicolor


56
6.300000
3.300000
4.700000
1.600000
Versicolor


57
4.900000
2.400000
3.300000
1.000000
Versicolor


58
6.600000
2.900000
4.600000
1.300000
Versicolor


59
5.200000
2.700000
3.900000
1.400000
Versicolor


60
5.000000
2.000000
3.500000
1.000000
Versicolor


61
5.900000
3.000000
4.200000
1.500000
Versicolor


62
6.000000
2.200000
4.000000
1.000000
Versicolor


63
6.100000
2.900000
4.700000
1.400000
Versicolor


64
5.600000
2.900000
3.600000
1.300000
Versicolor


65
6.700000
3.100000
4.400000
1.400000
Versicolor


66
5.600000
3.000000
4.500000
1.500000
Versicolor


67
5.800000
2.700000
4.100000
1.000000
Versicolor


68
6.200000
2.200000
4.500000
1.500000
Versicolor


69
5.600000
2.500000
3.900000
1.100000
Versicolor


70
5.900000
3.200000
4.800000
1.800000
Versicolor


71
6.100000
2.800000
4.000000
1.300000
Versicolor


72
6.300000
2.500000
4.900000
1.500000
Versicolor


73
6.100000
2.800000
4.700000
1.200000
Versicolor


74
6.400000
2.900000
4.300000
1.300000
Versicolor


75
6.600000
3.000000
4.400000
1.400000
Versicolor


76
6.800000
2.800000
4.800000
1.400000
Versicolor


77
6.700000
3.000000
5.000000
1.700000
Versicolor


78
6.000000
2.900000
4.500000
1.500000
Versicolor


79
5.700000
2.600000
3.500000
1.000000
Versicolor


80
5.500000
2.400000
3.800000
1.100000
Versicolor


81
5.500000
2.400000
3.700000
1.000000
Versicolor


82
5.800000
2.700000
3.900000
1.200000
Versicolor


83
6.000000
2.700000
5.100000
1.600000
Versicolor


84
5.400000
3.000000
4.500000
1.500000
Versicolor


85
6.000000
3.400000
4.500000
1.600000
Versicolor


86
6.700000
3.100000
4.700000
1.500000
Versicolor


87
6.300000
2.300000
4.400000
1.300000
Versicolor


88
5.600000
3.000000
4.100000
1.300000
Versicolor


89
5.500000
2.500000
4.000000
1.300000
Versicolor


90
5.500000
2.600000
4.400000
1.200000
Versicolor


91
6.100000
3.000000
4.600000
1.400000
Versicolor


92
5.800000
2.600000
4.000000
1.200000
Versicolor


93
5.000000
2.300000
3.300000
1.000000
Versicolor


94
5.600000
2.700000
4.200000
1.300000
Versicolor


95
5.700000
3.000000
4.200000
1.200000
Versicolor


96
5.700000
2.900000
4.200000
1.300000
Versicolor


97
6.200000
2.900000
4.300000
1.300000
Versicolor


98
5.100000
2.500000
3.000000
1.100000
Versicolor


99
5.700000
2.800000
4.100000
1.300000
Versicolor


100
6.300000
3.300000
6.000000
2.500000
Virginica


101
5.800000
2.700000
5.100000
1.900000
Virginica


102
7.100000
3.000000
5.900000
2.100000
Virginica


103
6.300000
2.900000
5.600000
1.800000
Virginica


104
6.500000
3.000000
5.800000
2.200000
Virginica


105
7.600000
3.000000
6.600000
2.100000
Virginica


106
4.900000
2.500000
4.500000
1.700000
Virginica


107
7.300000
2.900000
6.300000
1.800000
Virginica


108
6.700000
2.500000
5.800000
1.800000
Virginica


109
7.200000
3.600000
6.100000
2.500000
Virginica


110
6.500000
3.200000
5.100000
2.000000
Virginica


111
6.400000
2.700000
5.300000
1.900000
Virginica


112
6.800000
3.000000
5.500000
2.100000
Virginica


113
5.700000
2.500000
5.000000
2.000000
Virginica


114
5.800000
2.800000
5.100000
2.400000
Virginica


115
6.400000
3.200000
5.300000
2.300000
Virginica


116
6.500000
3.000000
5.500000
1.800000
Virginica


117
7.700000
3.800000
6.700000
2.200000
Virginica


118
7.700000
2.600000
6.900000
2.300000
Virginica


119
6.000000
2.200000
5.000000
1.500000
Virginica


120
6.900000
3.200000
5.700000
2.300000
Virginica


121
5.600000
2.800000
4.900000
2.000000
Virginica


122
7.700000
2.800000
6.700000
2.000000
Virginica


123
6.300000
2.700000
4.900000
1.800000
Virginica


124
6.700000
3.300000
5.700000
2.100000
Virginica


125
7.200000
3.200000
6.000000
1.800000
Virginica


126
6.200000
2.800000
4.800000
1.800000
Virginica


127
6.100000
3.000000
4.900000
1.800000
Virginica


128
6.400000
2.800000
5.600000
2.100000
Virginica


129
7.200000
3.000000
5.800000
1.600000
Virginica


130
7.400000
2.800000
6.100000
1.900000
Virginica


131
7.900000
3.800000
6.400000
2.000000
Virginica


132
6.400000
2.800000
5.600000
2.200000
Virginica


133
6.300000
2.800000
5.100000
1.500000
Virginica


134
6.100000
2.600000
5.600000
1.400000
Virginica


135
7.700000
3.000000
6.100000
2.300000
Virginica


136
6.300000
3.400000
5.600000
2.400000
Virginica


137
6.400000
3.100000
5.500000
1.800000
Virginica


138
6.000000
3.000000
4.800000
1.800000
Virginica


139
6.900000
3.100000
5.400000
2.100000
Virginica


140
6.700000
3.100000
5.600000
2.400000
Virginica


141
6.900000
3.100000
5.100000
2.300000
Virginica


142
5.800000
2.700000
5.100000
1.900000
Virginica


143
6.800000
3.200000
5.900000
2.300000
Virginica


144
6.700000
3.300000
5.700000
2.500000
Virginica


145
6.700000
3.000000
5.200000
2.300000
Virginica


146
6.300000
2.500000
5.000000
1.900000
Virginica


147
6.500000
3.000000
5.200000
2.000000
Virginica


148
6.200000
3.400000
5.400000
2.300000
Virginica


149
5.900000
3.000000
5.100000
1.800000
Virginica

```python
iris_data.head()
```

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}




sepal.length
sepal.width
petal.length
petal.width
variety




0
5.1
3.5
1.4
0.2
Setosa


1
4.9
3.0
1.4
0.2
Setosa


2
4.7
3.2
1.3
0.2
Setosa


3
4.6
3.1
1.5
0.2
Setosa


4
5.0
3.6
1.4
0.2
Setosa

```python
iris_data.describe()
```

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}




sepal.length
sepal.width
petal.length
petal.width




count
150.000000
150.000000
150.000000
150.000000


mean
5.843333
3.057333
3.758000
1.199333


std
0.828066
0.435866
1.765298
0.762238


min
4.300000
2.000000
1.000000
0.100000


25%
5.100000
2.800000
1.600000
0.300000


50%
5.800000
3.000000
4.350000
1.300000


75%
6.400000
3.300000
5.100000
1.800000


max
7.900000
4.400000
6.900000
2.500000

```python
iris_data.shape
```

(150, 5)

# Classification
Classification in machine learning is the process of recognition, understanding, and grouping of objects and ideas into preset categories. It requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. There are many different types of classification tasks that you may encounter in machine learning and specialized approaches to modeling that may be used for each.

## Decision Tree

A Decision Tree is a Flow Chart, and can help you make decisions based on previous experience. we will use **Confusion Matrix** in order to evaluate the accuracy of our model.

```python
d = {'Setosa': 0, 'Versicolor': 1, 'Virginica': 2}
features=['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
Xtree = iris_data[features]
ytree = iris_data['variety'].map(d)

dfStyler = iris_data.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
print(iris_data)
```

sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica

[150 rows x 5 columns]

```python
dtree = tree.DecisionTreeClassifier()
dtree.fit(Xtree, ytree)

#Plot the tree
plt.figure(figsize=(15,10))
tree.plot_tree(dtree, feature_names=features, fontsize=10)
plt.show()
```


![png](/assets/output_21_0.png)

```python
print(dtree.predict([[5.5, 4, 4, 1.5]]))
```

[1]

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
warnings.warn(

### Confusion Matrix
It is a table that is used in classification problems to assess where errors in the model were made.

The rows represent the actual classes the outcomes should have been. While the columns represent the predictions we have made. Using this table it is easy to see which predictions are wrong.

```python
clf = tree.DecisionTreeClassifier()
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Decision Tree", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()
```


![png](/assets/output_24_0.png)

### AUC - ROC Curve

In classification, there are many different evaluation metrics. The most popular is accuracy, which measures how often the model is correct. This is a great metric because it is easy to understand and getting the most correct guesses is often desired. There are some cases where you might consider using another evaluation metric.

Another common metric is AUC, area under the receiver operating characteristic (ROC) curve. The Reciever operating characteristic curve plots the true positive (TP) rate versus the false positive (FP) rate at different classification thresholds. The thresholds are different probability cutoffs that separate the two classes in binary classification. It uses probability to tell us how well a model separates the classes.

```python
clf = tree.DecisionTreeClassifier()
X_train, X_test, y_train, y_test = train_test_split(Xsk, ysk, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Decision Tree", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()
```


![png](/assets/output_26_0.png)

```python
# Binarize the /assets/output
y = label_binarize(ysk, classes = clf.classes_)
n_classes = y.shape[1]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(Xsk, y, test_size=0.33, random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(
clf
)
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
```

```python
plt.figure()
lw = 2
plt.plot(
fpr[2],
tpr[2],
color="darkorange",
lw=lw,
label="ROC curve (area = %0.2f)" % roc_auc[2],
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
#plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()
#fig.savefig('curves.png')
```


![png](/assets/output_28_0.png)

### Cross Validation

When adjusting models we are aiming to increase overall model performance on unseen data. Hyperparameter tuning can lead to much better performance on test sets. However, optimizing parameters to the test set can lead information leakage causing the model to preform worse on unseen data. To correct for this we can perform cross validation.

To better understand CV, we will be performing different methods on the iris dataset.

```python
# K-Fold Cross Validation

clf = DecisionTreeClassifier(random_state=42)

k_folds = KFold(n_splits = 5)

scores = cross_val_score(clf, Xsk, ysk, cv = k_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
```

Cross Validation Scores: [1. 1. 0.83333333 0.93333333 0.8 ]
Average CV Score: 0.9133333333333333
Number of CV Scores used in Average: 5

```python
# Stratified K-Fold

clf = DecisionTreeClassifier(random_state=42)

sk_folds = StratifiedKFold(n_splits = 5)

scores = cross_val_score(clf, Xsk, ysk, cv = sk_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
```

Cross Validation Scores: [0.96666667 0.96666667 0.9 0.93333333 1. ]
Average CV Score: 0.9533333333333334
Number of CV Scores used in Average: 5

```python
#Leave One Out
X, y = datasets.load_iris(return_X_y=True)

clf = DecisionTreeClassifier(random_state=42)

loo = LeaveOneOut()

scores = cross_val_score(clf, X, y, cv = loo)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
```

Cross Validation Scores: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.]
Average CV Score: 0.94
Number of CV Scores used in Average: 150

```python
clf = DecisionTreeClassifier(random_state=42)

lpo = LeavePOut(p=2)

scores = cross_val_score(clf, Xsk, ysk, cv = lpo)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
```

Cross Validation Scores: [1. 1. 1. ... 1. 1. 1.]
Average CV Score: 0.9382997762863534
Number of CV Scores used in Average: 11175

### Ensemble

```python
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

dtree = DecisionTreeClassifier(random_state = 22)
dtree.fit(X_train,y_train)

y_pred = dtree.predict(X_test)

print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred = dtree.predict(X_train)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred = y_pred))
```

Train data accuracy: 1.0
Test data accuracy: 0.9210526315789473

```python
from sklearn.ensemble import BaggingClassifier

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

estimator_range = [2,4,6,8,10,12,14,16,18,20]

models = []
scores = []

for n_estimators in estimator_range:

# Create bagging classifier
clf = BaggingClassifier(n_estimators = n_estimators, random_state = 22)

# Fit the model
clf.fit(X_train, y_train)

# Append the model and score to their respective list
models.append(clf)
scores.append(accuracy_score(y_true = y_test, y_pred = clf.predict(X_test)))

# Generate the plot of scores against number of estimators
plt.figure(figsize=(9,6))
plt.plot(estimator_range, scores)

# Adjust labels and font (to make visable)
plt.xlabel("n_estimators", fontsize = 18)
plt.ylabel("score", fontsize = 18)
plt.tick_params(labelsize = 16)

# Visualize plot
plt.show()
```


![png](/assets/output_36_0.png)

```python
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

clf = BaggingClassifier(n_estimators = 12, oob_score = True,random_state = 22)

clf.fit(X_train, y_train)

plt.figure(figsize=(15, 10))

plot_tree(clf.estimators_[0], feature_names = features, fontsize=14)
```

[Text(0.375, 0.9, 'petal.length <= 2.45\ngini = 0.661\nsamples = 71\nvalue = [35, 44, 33]'),
Text(0.25, 0.7, 'gini = 0.0\nsamples = 23\nvalue = [35, 0, 0]'),
Text(0.5, 0.7, 'petal.width <= 1.7\ngini = 0.49\nsamples = 48\nvalue = [0, 44, 33]'),
Text(0.25, 0.5, 'petal.length <= 5.0\ngini = 0.044\nsamples = 26\nvalue = [0, 43, 1]'),
Text(0.125, 0.3, 'gini = 0.0\nsamples = 24\nvalue = [0, 42, 0]'),
Text(0.375, 0.3, 'sepal.length <= 6.15\ngini = 0.5\nsamples = 2\nvalue = [0, 1, 1]'),
Text(0.25, 0.1, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
Text(0.5, 0.1, 'gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]'),
Text(0.75, 0.5, 'petal.length <= 4.85\ngini = 0.059\nsamples = 22\nvalue = [0, 1, 32]'),
Text(0.625, 0.3, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
Text(0.875, 0.3, 'gini = 0.0\nsamples = 21\nvalue = [0, 0, 32]')]


![png](/assets/output_37_1.png)

## SVM

```python
clf = svm.LinearSVC(max_iter=3080)
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("SVM", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()
```


![png](/assets/output_39_0.png)

```python
#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_SVM.csv")
```

## Random Forest

```python
clf = RandomForestClassifier()
```

```python
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Random Forest", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()
```


![png](/assets/output_43_0.png)

```python
#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_RF.csv")
```

## Logistic Regression

### Grid Search

```python
logit = LogisticRegression(max_iter = 10000)

C = [0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2]

scores = []

for choice in C:
logit.set_params(C=choice)
logit.fit(Xsk, ysk)
scores.append(logit.score(Xsk, ysk))

print(scores)
```

[0.9666666666666667, 0.9666666666666667, 0.9733333333333334, 0.9733333333333334, 0.98, 0.98, 0.9866666666666667, 0.9866666666666667]

```python
clf = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Logistic Regression", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()
```


![png](/assets/output_48_0.png)

```python
#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_LR.csv")
```

## Gaussian Naïve Bays

```python
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Gaussian Naïve Bays", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()
```


![png](/assets/output_51_0.png)

```python
#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_GNB.csv")
```

## KNN

```python
clf = KNeighborsClassifier(n_neighbors=1,)
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size =0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("K-NN", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()
```


![png](/assets/output_54_0.png)

```python
#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_KNN.csv")
```

# Hierarchical Clustering

Hierarchical clustering is an unsupervised learning method for clustering data points. The algorithm builds clusters by measuring the dissimilarities between data. Unsupervised learning means that a model does not have to be trained, and we do not need a "target" variable. This method can be used on any data to visualize and interpret the relationship between individual data points.

```python
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
fig = plt.figure(figsize=(15,5))

data_to_analyze = iris_data[['petal.length', 'petal.width']]

# =============
# First subplot
# =============

ax = fig.add_subplot(1, 2, 1)
groups = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
groups.fit_predict(data_to_analyze)
plt.scatter(iris_data['petal.length'] ,iris_data['petal.width'], c= groups.labels_, cmap='cool')

# =============
# Secound subplot
# =============

ax = fig.add_subplot(1, 2, 2)
data_to_analyze = list(zip(iris_data['petal.length'], iris_data['petal.width']))
linkage_data = linkage(data_to_analyze, method='ward', metric='euclidean')
dendrogram(linkage_data)
plt.show()
```


![png](/assets/output_57_0.png)

## K-means

```python
from sklearn.cluster import KMeans

inertias = []

for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data_to_analyze)
inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
```


![png](/assets/output_59_1.png)

```python
kmeans = KMeans(n_clusters=3)
kmeans.fit(data_to_analyze)

plt.scatter(iris_data['petal.length'], iris_data['petal.width'], c=kmeans.labels_, cmap='cool')
plt.show()
```


![png](/assets/output_60_0.png)