https://github.com/moligarch/ml-concepts-methods-iris-dataset

Hands-on Machine learning training on IRIS dataset by @moligarch
https://github.com/moligarch/ml-concepts-methods-iris-dataset
Last synced: 2 months ago
JSON representation
Hands-on Machine learning training on IRIS dataset by @moligarch
Host: GitHub
URL: https://github.com/moligarch/ml-concepts-methods-iris-dataset
Owner: moligarch
Created: 2023-04-02T15:45:35.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-04-04T15:13:46.000Z (about 2 years ago)
Last Synced: 2025-02-10T13:43:32.617Z (4 months ago)
Language: Jupyter Notebook
Size: 2 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Introduction

First of all: Hello Everyone!

My name is Moein Verkiani but You're going to know me as **Moligarch** or if you are Persian, I'm **Kian**!

in this file we are going to do some hands-on ML exercise on IRIS dataset.

It's good to know [What is Overfitting and how to avoid it?](overfitting.md) after finishing these practices.

Now, let's prepare our environment for further operations:

+ Import libraries

+ Modify environment variable

+ Define dataset

```python

#prepare environment

%reset -f

import os

from sklearn import datasets, tree, svm

from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.preprocessing import StandardScaler, label_binarize

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, roc_curve, auc

from sklearn.model_selection import train_test_split, KFold, cross_val_score, StratifiedKFold, LeaveOneOut,LeavePOut

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.multiclass import OneVsRestClassifier

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sn

import io

from itertools import cycle 

iris_data = pd.read_csv('iris.csv')

iris = datasets.load_iris()

Xsk = iris['data']

ysk = iris['target']

os.environ["MKL_NUM_THREADS"] = "1" 

os.environ["NUMEXPR_NUM_THREADS"] = "1" 

os.environ["OMP_NUM_THREADS"] = "1" 

```

```python

print(type(Xsk),Xsk.shape,type(ysk),ysk.shape)

#print(X,y)

```

     (150, 4)  (150,)

    

# Basic Statistics

In this part We're going to calculate some basic statistics metrics in order to understand our dataset better.

IRIS dataset is a collection of 4 features (petal/sepal lenght and width) and 1 target that contains 3 spicies:

* Setosa

* Versicolor

* Virginica

so that statistics metrics mentioned before, have better definition if we run them on these spiceis separatly.

```python

#Calculate Features mean, with respect to kind of flower

X_arr=np.array(pd.DataFrame(Xsk , columns=iris['feature_names']))

setosa_mean = [np.mean(X_arr[:50, i]) for i in range(4)]

versicolor_mean = [np.mean(X_arr[50:100, i]) for i in range(4)]

virginica_mean = [np.mean(X_arr[100:150, i]) for i in range(4)]

spicies = {'setosa': setosa_mean, 'versicolor': versicolor_mean, 'virginica': virginica_mean}

Xmean_df = pd.DataFrame(spicies, index=['sepal length', 'sepal width', 'petal length', 'petal width'])

print('Features Mean\n',Xmean_df)

```

    Features Mean

                   setosa  versicolor  virginica

    sepal length   5.006       5.936      6.588

    sepal width    3.428       2.770      2.974

    petal length   1.462       4.260      5.552

    petal width    0.246       1.326      2.026

    

```python

#Calculate Features Standard Deviation

setosa_std = [np.std(X_arr[:50, i]) for i in range(4)]

versicolor_std = [np.std(X_arr[50:100, i]) for i in range(4)]

virginica_std = [np.std(X_arr[100:150, i]) for i in range(4)]

X_std=[np.std(X_arr[:150, i]) for i in range(4)]

categ = {'Total':X_std, 'setosa': setosa_std, 'versicolor': versicolor_std, 'virginica': virginica_std}

Xstd_df=pd.DataFrame(categ, index=['sepal length', 'sepal width', 'petal length', 'petal width'])

print('Features Standard Deviation\n',Xstd_df)

```

    Features Standard Deviation

                      Total    setosa  versicolor  virginica

    sepal length  0.825301  0.348947    0.510983   0.629489

    sepal width   0.434411  0.375255    0.310644   0.319255

    petal length  1.759404  0.171919    0.465188   0.546348

    petal width   0.759693  0.104326    0.195765   0.271890

    

```python

#Calculate Features Variance

setosa_var = [np.var(X_arr[:50, i]) for i in range(4)]

versicolor_var = [np.var(X_arr[50:100, i]) for i in range(4)]

virginica_var = [np.var(X_arr[100:150, i]) for i in range(4)]

X_var=[np.var(X_arr[:150, i]) for i in range(4)]

categ = {'Total':X_var, 'setosa': setosa_var, 'versicolor': versicolor_var, 'virginica': virginica_var}

Xvar_df=pd.DataFrame(categ, index=['sepal length', 'sepal width', 'petal length', 'petal width'])

print('Features Variance\n',Xvar_df)

```

    Features Variance

                      Total    setosa  versicolor  virginica

    sepal length  0.681122  0.121764    0.261104   0.396256

    sepal width   0.188713  0.140816    0.096500   0.101924

    petal length  3.095503  0.029556    0.216400   0.298496

    petal width   0.577133  0.010884    0.038324   0.073924

    

## Scale

When your data has different values, and even different measurement units, it can be difficult to compare them.What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

The standardization method uses this formula:

```

z = (x - u) / s

```

Where `z` is the new value, `x` is the original value, `u` is the mean and `s` is the standard deviation.

sklearn do all of this with a single command:

```python

scale = StandardScaler()

scaledX = scale.fit_transform(Xsk)

print(scaledX,scaledX.shape)

```

    [[-9.00681170e-01  1.01900435e+00 -1.34022653e+00 -1.31544430e+00]

     [-1.14301691e+00 -1.31979479e-01 -1.34022653e+00 -1.31544430e+00]

     [-1.38535265e+00  3.28414053e-01 -1.39706395e+00 -1.31544430e+00]

     [-1.50652052e+00  9.82172869e-02 -1.28338910e+00 -1.31544430e+00]

     [-1.02184904e+00  1.24920112e+00 -1.34022653e+00 -1.31544430e+00]

     [-5.37177559e-01  1.93979142e+00 -1.16971425e+00 -1.05217993e+00]

     [-1.50652052e+00  7.88807586e-01 -1.34022653e+00 -1.18381211e+00]

     [-1.02184904e+00  7.88807586e-01 -1.28338910e+00 -1.31544430e+00]

     [-1.74885626e+00 -3.62176246e-01 -1.34022653e+00 -1.31544430e+00]

     [-1.14301691e+00  9.82172869e-02 -1.28338910e+00 -1.44707648e+00]

     [-5.37177559e-01  1.47939788e+00 -1.28338910e+00 -1.31544430e+00]

     [-1.26418478e+00  7.88807586e-01 -1.22655167e+00 -1.31544430e+00]

     [-1.26418478e+00 -1.31979479e-01 -1.34022653e+00 -1.44707648e+00]

     [-1.87002413e+00 -1.31979479e-01 -1.51073881e+00 -1.44707648e+00]

     [-5.25060772e-02  2.16998818e+00 -1.45390138e+00 -1.31544430e+00]

     [-1.73673948e-01  3.09077525e+00 -1.28338910e+00 -1.05217993e+00]

     [-5.37177559e-01  1.93979142e+00 -1.39706395e+00 -1.05217993e+00]

     [-9.00681170e-01  1.01900435e+00 -1.34022653e+00 -1.18381211e+00]

     [-1.73673948e-01  1.70959465e+00 -1.16971425e+00 -1.18381211e+00]

     [-9.00681170e-01  1.70959465e+00 -1.28338910e+00 -1.18381211e+00]

     [-5.37177559e-01  7.88807586e-01 -1.16971425e+00 -1.31544430e+00]

     [-9.00681170e-01  1.47939788e+00 -1.28338910e+00 -1.05217993e+00]

     [-1.50652052e+00  1.24920112e+00 -1.56757623e+00 -1.31544430e+00]

     [-9.00681170e-01  5.58610819e-01 -1.16971425e+00 -9.20547742e-01]

     [-1.26418478e+00  7.88807586e-01 -1.05603939e+00 -1.31544430e+00]

     [-1.02184904e+00 -1.31979479e-01 -1.22655167e+00 -1.31544430e+00]

     [-1.02184904e+00  7.88807586e-01 -1.22655167e+00 -1.05217993e+00]

     [-7.79513300e-01  1.01900435e+00 -1.28338910e+00 -1.31544430e+00]

     [-7.79513300e-01  7.88807586e-01 -1.34022653e+00 -1.31544430e+00]

     [-1.38535265e+00  3.28414053e-01 -1.22655167e+00 -1.31544430e+00]

     [-1.26418478e+00  9.82172869e-02 -1.22655167e+00 -1.31544430e+00]

     [-5.37177559e-01  7.88807586e-01 -1.28338910e+00 -1.05217993e+00]

     [-7.79513300e-01  2.40018495e+00 -1.28338910e+00 -1.44707648e+00]

     [-4.16009689e-01  2.63038172e+00 -1.34022653e+00 -1.31544430e+00]

     [-1.14301691e+00  9.82172869e-02 -1.28338910e+00 -1.31544430e+00]

     [-1.02184904e+00  3.28414053e-01 -1.45390138e+00 -1.31544430e+00]

     [-4.16009689e-01  1.01900435e+00 -1.39706395e+00 -1.31544430e+00]

     [-1.14301691e+00  1.24920112e+00 -1.34022653e+00 -1.44707648e+00]

     [-1.74885626e+00 -1.31979479e-01 -1.39706395e+00 -1.31544430e+00]

     [-9.00681170e-01  7.88807586e-01 -1.28338910e+00 -1.31544430e+00]

     [-1.02184904e+00  1.01900435e+00 -1.39706395e+00 -1.18381211e+00]

     [-1.62768839e+00 -1.74335684e+00 -1.39706395e+00 -1.18381211e+00]

     [-1.74885626e+00  3.28414053e-01 -1.39706395e+00 -1.31544430e+00]

     [-1.02184904e+00  1.01900435e+00 -1.22655167e+00 -7.88915558e-01]

     [-9.00681170e-01  1.70959465e+00 -1.05603939e+00 -1.05217993e+00]

     [-1.26418478e+00 -1.31979479e-01 -1.34022653e+00 -1.18381211e+00]

     [-9.00681170e-01  1.70959465e+00 -1.22655167e+00 -1.31544430e+00]

     [-1.50652052e+00  3.28414053e-01 -1.34022653e+00 -1.31544430e+00]

     [-6.58345429e-01  1.47939788e+00 -1.28338910e+00 -1.31544430e+00]

     [-1.02184904e+00  5.58610819e-01 -1.34022653e+00 -1.31544430e+00]

     [ 1.40150837e+00  3.28414053e-01  5.35408562e-01  2.64141916e-01]

     [ 6.74501145e-01  3.28414053e-01  4.21733708e-01  3.95774101e-01]

     [ 1.28034050e+00  9.82172869e-02  6.49083415e-01  3.95774101e-01]

     [-4.16009689e-01 -1.74335684e+00  1.37546573e-01  1.32509732e-01]

     [ 7.95669016e-01 -5.92373012e-01  4.78571135e-01  3.95774101e-01]

     [-1.73673948e-01 -5.92373012e-01  4.21733708e-01  1.32509732e-01]

     [ 5.53333275e-01  5.58610819e-01  5.35408562e-01  5.27406285e-01]

     [-1.14301691e+00 -1.51316008e+00 -2.60315415e-01 -2.62386821e-01]

     [ 9.16836886e-01 -3.62176246e-01  4.78571135e-01  1.32509732e-01]

     [-7.79513300e-01 -8.22569778e-01  8.07091462e-02  2.64141916e-01]

     [-1.02184904e+00 -2.43394714e+00 -1.46640561e-01 -2.62386821e-01]

     [ 6.86617933e-02 -1.31979479e-01  2.51221427e-01  3.95774101e-01]

     [ 1.89829664e-01 -1.97355361e+00  1.37546573e-01 -2.62386821e-01]

     [ 3.10997534e-01 -3.62176246e-01  5.35408562e-01  2.64141916e-01]

     [-2.94841818e-01 -3.62176246e-01 -8.98031345e-02  1.32509732e-01]

     [ 1.03800476e+00  9.82172869e-02  3.64896281e-01  2.64141916e-01]

     [-2.94841818e-01 -1.31979479e-01  4.21733708e-01  3.95774101e-01]

     [-5.25060772e-02 -8.22569778e-01  1.94384000e-01 -2.62386821e-01]

     [ 4.32165405e-01 -1.97355361e+00  4.21733708e-01  3.95774101e-01]

     [-2.94841818e-01 -1.28296331e+00  8.07091462e-02 -1.30754636e-01]

     [ 6.86617933e-02  3.28414053e-01  5.92245988e-01  7.90670654e-01]

     [ 3.10997534e-01 -5.92373012e-01  1.37546573e-01  1.32509732e-01]

     [ 5.53333275e-01 -1.28296331e+00  6.49083415e-01  3.95774101e-01]

     [ 3.10997534e-01 -5.92373012e-01  5.35408562e-01  8.77547895e-04]

     [ 6.74501145e-01 -3.62176246e-01  3.08058854e-01  1.32509732e-01]

     [ 9.16836886e-01 -1.31979479e-01  3.64896281e-01  2.64141916e-01]

     [ 1.15917263e+00 -5.92373012e-01  5.92245988e-01  2.64141916e-01]

     [ 1.03800476e+00 -1.31979479e-01  7.05920842e-01  6.59038469e-01]

     [ 1.89829664e-01 -3.62176246e-01  4.21733708e-01  3.95774101e-01]

     [-1.73673948e-01 -1.05276654e+00 -1.46640561e-01 -2.62386821e-01]

     [-4.16009689e-01 -1.51316008e+00  2.38717193e-02 -1.30754636e-01]

     [-4.16009689e-01 -1.51316008e+00 -3.29657076e-02 -2.62386821e-01]

     [-5.25060772e-02 -8.22569778e-01  8.07091462e-02  8.77547895e-04]

     [ 1.89829664e-01 -8.22569778e-01  7.62758269e-01  5.27406285e-01]

     [-5.37177559e-01 -1.31979479e-01  4.21733708e-01  3.95774101e-01]

     [ 1.89829664e-01  7.88807586e-01  4.21733708e-01  5.27406285e-01]

     [ 1.03800476e+00  9.82172869e-02  5.35408562e-01  3.95774101e-01]

     [ 5.53333275e-01 -1.74335684e+00  3.64896281e-01  1.32509732e-01]

     [-2.94841818e-01 -1.31979479e-01  1.94384000e-01  1.32509732e-01]

     [-4.16009689e-01 -1.28296331e+00  1.37546573e-01  1.32509732e-01]

     [-4.16009689e-01 -1.05276654e+00  3.64896281e-01  8.77547895e-04]

     [ 3.10997534e-01 -1.31979479e-01  4.78571135e-01  2.64141916e-01]

     [-5.25060772e-02 -1.05276654e+00  1.37546573e-01  8.77547895e-04]

     [-1.02184904e+00 -1.74335684e+00 -2.60315415e-01 -2.62386821e-01]

     [-2.94841818e-01 -8.22569778e-01  2.51221427e-01  1.32509732e-01]

     [-1.73673948e-01 -1.31979479e-01  2.51221427e-01  8.77547895e-04]

     [-1.73673948e-01 -3.62176246e-01  2.51221427e-01  1.32509732e-01]

     [ 4.32165405e-01 -3.62176246e-01  3.08058854e-01  1.32509732e-01]

     [-9.00681170e-01 -1.28296331e+00 -4.30827696e-01 -1.30754636e-01]

     [-1.73673948e-01 -5.92373012e-01  1.94384000e-01  1.32509732e-01]

     [ 5.53333275e-01  5.58610819e-01  1.27429511e+00  1.71209594e+00]

     [-5.25060772e-02 -8.22569778e-01  7.62758269e-01  9.22302838e-01]

     [ 1.52267624e+00 -1.31979479e-01  1.21745768e+00  1.18556721e+00]

     [ 5.53333275e-01 -3.62176246e-01  1.04694540e+00  7.90670654e-01]

     [ 7.95669016e-01 -1.31979479e-01  1.16062026e+00  1.31719939e+00]

     [ 2.12851559e+00 -1.31979479e-01  1.61531967e+00  1.18556721e+00]

     [-1.14301691e+00 -1.28296331e+00  4.21733708e-01  6.59038469e-01]

     [ 1.76501198e+00 -3.62176246e-01  1.44480739e+00  7.90670654e-01]

     [ 1.03800476e+00 -1.28296331e+00  1.16062026e+00  7.90670654e-01]

     [ 1.64384411e+00  1.24920112e+00  1.33113254e+00  1.71209594e+00]

     [ 7.95669016e-01  3.28414053e-01  7.62758269e-01  1.05393502e+00]

     [ 6.74501145e-01 -8.22569778e-01  8.76433123e-01  9.22302838e-01]

     [ 1.15917263e+00 -1.31979479e-01  9.90107977e-01  1.18556721e+00]

     [-1.73673948e-01 -1.28296331e+00  7.05920842e-01  1.05393502e+00]

     [-5.25060772e-02 -5.92373012e-01  7.62758269e-01  1.58046376e+00]

     [ 6.74501145e-01  3.28414053e-01  8.76433123e-01  1.44883158e+00]

     [ 7.95669016e-01 -1.31979479e-01  9.90107977e-01  7.90670654e-01]

     [ 2.24968346e+00  1.70959465e+00  1.67215710e+00  1.31719939e+00]

     [ 2.24968346e+00 -1.05276654e+00  1.78583195e+00  1.44883158e+00]

     [ 1.89829664e-01 -1.97355361e+00  7.05920842e-01  3.95774101e-01]

     [ 1.28034050e+00  3.28414053e-01  1.10378283e+00  1.44883158e+00]

     [-2.94841818e-01 -5.92373012e-01  6.49083415e-01  1.05393502e+00]

     [ 2.24968346e+00 -5.92373012e-01  1.67215710e+00  1.05393502e+00]

     [ 5.53333275e-01 -8.22569778e-01  6.49083415e-01  7.90670654e-01]

     [ 1.03800476e+00  5.58610819e-01  1.10378283e+00  1.18556721e+00]

     [ 1.64384411e+00  3.28414053e-01  1.27429511e+00  7.90670654e-01]

     [ 4.32165405e-01 -5.92373012e-01  5.92245988e-01  7.90670654e-01]

     [ 3.10997534e-01 -1.31979479e-01  6.49083415e-01  7.90670654e-01]

     [ 6.74501145e-01 -5.92373012e-01  1.04694540e+00  1.18556721e+00]

     [ 1.64384411e+00 -1.31979479e-01  1.16062026e+00  5.27406285e-01]

     [ 1.88617985e+00 -5.92373012e-01  1.33113254e+00  9.22302838e-01]

     [ 2.49201920e+00  1.70959465e+00  1.50164482e+00  1.05393502e+00]

     [ 6.74501145e-01 -5.92373012e-01  1.04694540e+00  1.31719939e+00]

     [ 5.53333275e-01 -5.92373012e-01  7.62758269e-01  3.95774101e-01]

     [ 3.10997534e-01 -1.05276654e+00  1.04694540e+00  2.64141916e-01]

     [ 2.24968346e+00 -1.31979479e-01  1.33113254e+00  1.44883158e+00]

     [ 5.53333275e-01  7.88807586e-01  1.04694540e+00  1.58046376e+00]

     [ 6.74501145e-01  9.82172869e-02  9.90107977e-01  7.90670654e-01]

     [ 1.89829664e-01 -1.31979479e-01  5.92245988e-01  7.90670654e-01]

     [ 1.28034050e+00  9.82172869e-02  9.33270550e-01  1.18556721e+00]

     [ 1.03800476e+00  9.82172869e-02  1.04694540e+00  1.58046376e+00]

     [ 1.28034050e+00  9.82172869e-02  7.62758269e-01  1.44883158e+00]

     [-5.25060772e-02 -8.22569778e-01  7.62758269e-01  9.22302838e-01]

     [ 1.15917263e+00  3.28414053e-01  1.21745768e+00  1.44883158e+00]

     [ 1.03800476e+00  5.58610819e-01  1.10378283e+00  1.71209594e+00]

     [ 1.03800476e+00 -1.31979479e-01  8.19595696e-01  1.44883158e+00]

     [ 5.53333275e-01 -1.28296331e+00  7.05920842e-01  9.22302838e-01]

     [ 7.95669016e-01 -1.31979479e-01  8.19595696e-01  1.05393502e+00]

     [ 4.32165405e-01  7.88807586e-01  9.33270550e-01  1.44883158e+00]

     [ 6.86617933e-02 -1.31979479e-01  7.62758269e-01  7.90670654e-01]] (150, 4)

    

```python

print(type(iris_data))

```

    

    

## Data Visualization

If you are trying to discuss or illustrate something to your Colleges,Co Worker, Your managers or etc. you need to SHOW them what you mean! so although we know **Data Talks Everywhere!**, without Data visualization you are just using 30-40% of Data potential. it also helps you to understand relation between datasets better (not in all case I believe!)

So let's dig deeper.

```python

# set up a figure twice as wide as it is tall

fig = plt.figure(figsize=(12,6))

# =============

# First subplot

# =============

# set up the axes for the first plot

ax = fig.add_subplot(1, 2, 1, projection='3d')

x1 = Xsk[:,0]

x2 = Xsk[:,1]

ax.scatter(x1, x2, ysk, marker='o')

ax.set_xlabel('Sepal L')

ax.set_ylabel('Sepal W"')

ax.set_zlabel('Category')

# ==============

# Second subplot

# ==============

# set up the axes for the second plot

ax = fig.add_subplot(1, 2, 2, projection='3d')

x3 = Xsk[:,2]

x4 = Xsk[:,3]

ax.scatter(x3, x4, ysk, marker='x')

ax.set_xlabel('Petal L')

ax.set_ylabel('Petal W"')

ax.set_zlabel('Category')

plt.show()

```

    

![png](/assets/output_11_0.png)

    

```python

#compare any feature with respect to all features

sn.pairplot(iris_data)

```

    

    

![png](/assets/output_12_1.png)

    

```python

plt.hist(ysk, 25)

#plt.show()

plt.title("Data Distribution")

plt.xlabel("Class")

plt.ylabel("Frequency")

plt.show()

```

    

![png](/assets/output_13_0.png)

    

  

    

       

      sepal.length

      sepal.width

      petal.length

      petal.width

      variety

    

  

  

    

      0

      5.100000

      3.500000

      1.400000

      0.200000

      Setosa

    

    

      1

      4.900000

      3.000000

      1.400000

      0.200000

      Setosa

    

    

      2

      4.700000

      3.200000

      1.300000

      0.200000

      Setosa

    

    

      3

      4.600000

      3.100000

      1.500000

      0.200000

      Setosa

    

    

      4

      5.000000

      3.600000

      1.400000

      0.200000

      Setosa

    

    

      5

      5.400000

      3.900000

      1.700000

      0.400000

      Setosa

    

    

      6

      4.600000

      3.400000

      1.400000

      0.300000

      Setosa

    

    

      7

      5.000000

      3.400000

      1.500000

      0.200000

      Setosa

    

    

      8

      4.400000

      2.900000

      1.400000

      0.200000

      Setosa

    

    

      9

      4.900000

      3.100000

      1.500000

      0.100000

      Setosa

    

    

      10

      5.400000

      3.700000

      1.500000

      0.200000

      Setosa

    

    

      11

      4.800000

      3.400000

      1.600000

      0.200000

      Setosa

    

    

      12

      4.800000

      3.000000

      1.400000

      0.100000

      Setosa

    

    

      13

      4.300000

      3.000000

      1.100000

      0.100000

      Setosa

    

    

      14

      5.800000

      4.000000

      1.200000

      0.200000

      Setosa

    

    

      15

      5.700000

      4.400000

      1.500000

      0.400000

      Setosa

    

    

      16

      5.400000

      3.900000

      1.300000

      0.400000

      Setosa

    

    

      17

      5.100000

      3.500000

      1.400000

      0.300000

      Setosa

    

    

      18

      5.700000

      3.800000

      1.700000

      0.300000

      Setosa

    

    

      19

      5.100000

      3.800000

      1.500000

      0.300000

      Setosa

    

    

      20

      5.400000

      3.400000

      1.700000

      0.200000

      Setosa

    

    

      21

      5.100000

      3.700000

      1.500000

      0.400000

      Setosa

    

    

      22

      4.600000

      3.600000

      1.000000

      0.200000

      Setosa

    

    

      23

      5.100000

      3.300000

      1.700000

      0.500000

      Setosa

    

    

      24

      4.800000

      3.400000

      1.900000

      0.200000

      Setosa

    

    

      25

      5.000000

      3.000000

      1.600000

      0.200000

      Setosa

    

    

      26

      5.000000

      3.400000

      1.600000

      0.400000

      Setosa

    

    

      27

      5.200000

      3.500000

      1.500000

      0.200000

      Setosa

    

    

      28

      5.200000

      3.400000

      1.400000

      0.200000

      Setosa

    

    

      29

      4.700000

      3.200000

      1.600000

      0.200000

      Setosa

    

    

      30

      4.800000

      3.100000

      1.600000

      0.200000

      Setosa

    

    

      31

      5.400000

      3.400000

      1.500000

      0.400000

      Setosa

    

    

      32

      5.200000

      4.100000

      1.500000

      0.100000

      Setosa

    

    

      33

      5.500000

      4.200000

      1.400000

      0.200000

      Setosa

    

    

      34

      4.900000

      3.100000

      1.500000

      0.200000

      Setosa

    

    

      35

      5.000000

      3.200000

      1.200000

      0.200000

      Setosa

    

    

      36

      5.500000

      3.500000

      1.300000

      0.200000

      Setosa

    

    

      37

      4.900000

      3.600000

      1.400000

      0.100000

      Setosa

    

    

      38

      4.400000

      3.000000

      1.300000

      0.200000

      Setosa

    

    

      39

      5.100000

      3.400000

      1.500000

      0.200000

      Setosa

    

    

      40

      5.000000

      3.500000

      1.300000

      0.300000

      Setosa

    

    

      41

      4.500000

      2.300000

      1.300000

      0.300000

      Setosa

    

    

      42

      4.400000

      3.200000

      1.300000

      0.200000

      Setosa

    

    

      43

      5.000000

      3.500000

      1.600000

      0.600000

      Setosa

    

    

      44

      5.100000

      3.800000

      1.900000

      0.400000

      Setosa

    

    

      45

      4.800000

      3.000000

      1.400000

      0.300000

      Setosa

    

    

      46

      5.100000

      3.800000

      1.600000

      0.200000

      Setosa

    

    

      47

      4.600000

      3.200000

      1.400000

      0.200000

      Setosa

    

    

      48

      5.300000

      3.700000

      1.500000

      0.200000

      Setosa

    

    

      49

      5.000000

      3.300000

      1.400000

      0.200000

      Setosa

    

    

      50

      7.000000

      3.200000

      4.700000

      1.400000

      Versicolor

    

    

      51

      6.400000

      3.200000

      4.500000

      1.500000

      Versicolor

    

    

      52

      6.900000

      3.100000

      4.900000

      1.500000

      Versicolor

    

    

      53

      5.500000

      2.300000

      4.000000

      1.300000

      Versicolor

    

    

      54

      6.500000

      2.800000

      4.600000

      1.500000

      Versicolor

    

    

      55

      5.700000

      2.800000

      4.500000

      1.300000

      Versicolor

    

    

      56

      6.300000

      3.300000

      4.700000

      1.600000

      Versicolor

    

    

      57

      4.900000

      2.400000

      3.300000

      1.000000

      Versicolor

    

    

      58

      6.600000

      2.900000

      4.600000

      1.300000

      Versicolor

    

    

      59

      5.200000

      2.700000

      3.900000

      1.400000

      Versicolor

    

    

      60

      5.000000

      2.000000

      3.500000

      1.000000

      Versicolor

    

    

      61

      5.900000

      3.000000

      4.200000

      1.500000

      Versicolor

    

    

      62

      6.000000

      2.200000

      4.000000

      1.000000

      Versicolor

    

    

      63

      6.100000

      2.900000

      4.700000

      1.400000

      Versicolor

    

    

      64

      5.600000

      2.900000

      3.600000

      1.300000

      Versicolor

    

    

      65

      6.700000

      3.100000

      4.400000

      1.400000

      Versicolor

    

    

      66

      5.600000

      3.000000

      4.500000

      1.500000

      Versicolor

    

    

      67

      5.800000

      2.700000

      4.100000

      1.000000

      Versicolor

    

    

      68

      6.200000

      2.200000

      4.500000

      1.500000

      Versicolor

    

    

      69

      5.600000

      2.500000

      3.900000

      1.100000

      Versicolor

    

    

      70

      5.900000

      3.200000

      4.800000

      1.800000

      Versicolor

    

    

      71

      6.100000

      2.800000

      4.000000

      1.300000

      Versicolor

    

    

      72

      6.300000

      2.500000

      4.900000

      1.500000

      Versicolor

    

    

      73

      6.100000

      2.800000

      4.700000

      1.200000

      Versicolor

    

    

      74

      6.400000

      2.900000

      4.300000

      1.300000

      Versicolor

    

    

      75

      6.600000

      3.000000

      4.400000

      1.400000

      Versicolor

    

    

      76

      6.800000

      2.800000

      4.800000

      1.400000

      Versicolor

    

    

      77

      6.700000

      3.000000

      5.000000

      1.700000

      Versicolor

    

    

      78

      6.000000

      2.900000

      4.500000

      1.500000

      Versicolor

    

    

      79

      5.700000

      2.600000

      3.500000

      1.000000

      Versicolor

    

    

      80

      5.500000

      2.400000

      3.800000

      1.100000

      Versicolor

    

    

      81

      5.500000

      2.400000

      3.700000

      1.000000

      Versicolor

    

    

      82

      5.800000

      2.700000

      3.900000

      1.200000

      Versicolor

    

    

      83

      6.000000

      2.700000

      5.100000

      1.600000

      Versicolor

    

    

      84

      5.400000

      3.000000

      4.500000

      1.500000

      Versicolor

    

    

      85

      6.000000

      3.400000

      4.500000

      1.600000

      Versicolor

    

    

      86

      6.700000

      3.100000

      4.700000

      1.500000

      Versicolor

    

    

      87

      6.300000

      2.300000

      4.400000

      1.300000

      Versicolor

    

    

      88

      5.600000

      3.000000

      4.100000

      1.300000

      Versicolor

    

    

      89

      5.500000

      2.500000

      4.000000

      1.300000

      Versicolor

    

    

      90

      5.500000

      2.600000

      4.400000

      1.200000

      Versicolor

    

    

      91

      6.100000

      3.000000

      4.600000

      1.400000

      Versicolor

    

    

      92

      5.800000

      2.600000

      4.000000

      1.200000

      Versicolor

    

    

      93

      5.000000

      2.300000

      3.300000

      1.000000

      Versicolor

    

    

      94

      5.600000

      2.700000

      4.200000

      1.300000

      Versicolor

    

    

      95

      5.700000

      3.000000

      4.200000

      1.200000

      Versicolor

    

    

      96

      5.700000

      2.900000

      4.200000

      1.300000

      Versicolor

    

    

      97

      6.200000

      2.900000

      4.300000

      1.300000

      Versicolor

    

    

      98

      5.100000

      2.500000

      3.000000

      1.100000

      Versicolor

    

    

      99

      5.700000

      2.800000

      4.100000

      1.300000

      Versicolor

    

    

      100

      6.300000

      3.300000

      6.000000

      2.500000

      Virginica

    

    

      101

      5.800000

      2.700000

      5.100000

      1.900000

      Virginica

    

    

      102

      7.100000

      3.000000

      5.900000

      2.100000

      Virginica

    

    

      103

      6.300000

      2.900000

      5.600000

      1.800000

      Virginica

    

    

      104

      6.500000

      3.000000

      5.800000

      2.200000

      Virginica

    

    

      105

      7.600000

      3.000000

      6.600000

      2.100000

      Virginica

    

    

      106

      4.900000

      2.500000

      4.500000

      1.700000

      Virginica

    

    

      107

      7.300000

      2.900000

      6.300000

      1.800000

      Virginica

    

    

      108

      6.700000

      2.500000

      5.800000

      1.800000

      Virginica

    

    

      109

      7.200000

      3.600000

      6.100000

      2.500000

      Virginica

    

    

      110

      6.500000

      3.200000

      5.100000

      2.000000

      Virginica

    

    

      111

      6.400000

      2.700000

      5.300000

      1.900000

      Virginica

    

    

      112

      6.800000

      3.000000

      5.500000

      2.100000

      Virginica

    

    

      113

      5.700000

      2.500000

      5.000000

      2.000000

      Virginica

    

    

      114

      5.800000

      2.800000

      5.100000

      2.400000

      Virginica

    

    

      115

      6.400000

      3.200000

      5.300000

      2.300000

      Virginica

    

    

      116

      6.500000

      3.000000

      5.500000

      1.800000

      Virginica

    

    

      117

      7.700000

      3.800000

      6.700000

      2.200000

      Virginica

    

    

      118

      7.700000

      2.600000

      6.900000

      2.300000

      Virginica

    

    

      119

      6.000000

      2.200000

      5.000000

      1.500000

      Virginica

    

    

      120

      6.900000

      3.200000

      5.700000

      2.300000

      Virginica

    

    

      121

      5.600000

      2.800000

      4.900000

      2.000000

      Virginica

    

    

      122

      7.700000

      2.800000

      6.700000

      2.000000

      Virginica

    

    

      123

      6.300000

      2.700000

      4.900000

      1.800000

      Virginica

    

    

      124

      6.700000

      3.300000

      5.700000

      2.100000

      Virginica

    

    

      125

      7.200000

      3.200000

      6.000000

      1.800000

      Virginica

    

    

      126

      6.200000

      2.800000

      4.800000

      1.800000

      Virginica

    

    

      127

      6.100000

      3.000000

      4.900000

      1.800000

      Virginica

    

    

      128

      6.400000

      2.800000

      5.600000

      2.100000

      Virginica

    

    

      129

      7.200000

      3.000000

      5.800000

      1.600000

      Virginica

    

    

      130

      7.400000

      2.800000

      6.100000

      1.900000

      Virginica

    

    

      131

      7.900000

      3.800000

      6.400000

      2.000000

      Virginica

    

    

      132

      6.400000

      2.800000

      5.600000

      2.200000

      Virginica

    

    

      133

      6.300000

      2.800000

      5.100000

      1.500000

      Virginica

    

    

      134

      6.100000

      2.600000

      5.600000

      1.400000

      Virginica

    

    

      135

      7.700000

      3.000000

      6.100000

      2.300000

      Virginica

    

    

      136

      6.300000

      3.400000

      5.600000

      2.400000

      Virginica

    

    

      137

      6.400000

      3.100000

      5.500000

      1.800000

      Virginica

    

    

      138

      6.000000

      3.000000

      4.800000

      1.800000

      Virginica

    

    

      139

      6.900000

      3.100000

      5.400000

      2.100000

      Virginica

    

    

      140

      6.700000

      3.100000

      5.600000

      2.400000

      Virginica

    

    

      141

      6.900000

      3.100000

      5.100000

      2.300000

      Virginica

    

    

      142

      5.800000

      2.700000

      5.100000

      1.900000

      Virginica

    

    

      143

      6.800000

      3.200000

      5.900000

      2.300000

      Virginica

    

    

      144

      6.700000

      3.300000

      5.700000

      2.500000

      Virginica

    

    

      145

      6.700000

      3.000000

      5.200000

      2.300000

      Virginica

    

    

      146

      6.300000

      2.500000

      5.000000

      1.900000

      Virginica

    

    

      147

      6.500000

      3.000000

      5.200000

      2.000000

      Virginica

    

    

      148

      6.200000

      3.400000

      5.400000

      2.300000

      Virginica

    

    

      149

      5.900000

      3.000000

      5.100000

      1.800000

      Virginica

    

  

```python

iris_data.head()

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      sepal.length

      sepal.width

      petal.length

      petal.width

      variety

    

  

  

    

      0

      5.1

      3.5

      1.4

      0.2

      Setosa

    

    

      1

      4.9

      3.0

      1.4

      0.2

      Setosa

    

    

      2

      4.7

      3.2

      1.3

      0.2

      Setosa

    

    

      3

      4.6

      3.1

      1.5

      0.2

      Setosa

    

    

      4

      5.0

      3.6

      1.4

      0.2

      Setosa

    

  



```python

iris_data.describe()

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      sepal.length

      sepal.width

      petal.length

      petal.width

    

  

  

    

      count

      150.000000

      150.000000

      150.000000

      150.000000

    

    

      mean

      5.843333

      3.057333

      3.758000

      1.199333

    

    

      std

      0.828066

      0.435866

      1.765298

      0.762238

    

    

      min

      4.300000

      2.000000

      1.000000

      0.100000

    

    

      25%

      5.100000

      2.800000

      1.600000

      0.300000

    

    

      50%

      5.800000

      3.000000

      4.350000

      1.300000

    

    

      75%

      6.400000

      3.300000

      5.100000

      1.800000

    

    

      max

      7.900000

      4.400000

      6.900000

      2.500000

    

  



```python

iris_data.shape

```

    (150, 5)

# Classification

Classification in machine learning is the process of recognition, understanding, and grouping of objects and ideas into preset categories. It requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. There are many different types of classification tasks that you may encounter in machine learning and specialized approaches to modeling that may be used for each.

## Decision Tree

A Decision Tree is a Flow Chart, and can help you make decisions based on previous experience. we will use **Confusion Matrix** in order to evaluate the accuracy of our model.

```python

d = {'Setosa': 0, 'Versicolor': 1, 'Virginica': 2}

features=['sepal.length', 'sepal.width', 'petal.length', 'petal.width']

Xtree = iris_data[features]

ytree = iris_data['variety'].map(d)

dfStyler = iris_data.style.set_properties(**{'text-align': 'left'})

dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

print(iris_data)

```

         sepal.length  sepal.width  petal.length  petal.width    variety

    0             5.1          3.5           1.4          0.2     Setosa

    1             4.9          3.0           1.4          0.2     Setosa

    2             4.7          3.2           1.3          0.2     Setosa

    3             4.6          3.1           1.5          0.2     Setosa

    4             5.0          3.6           1.4          0.2     Setosa

    ..            ...          ...           ...          ...        ...

    145           6.7          3.0           5.2          2.3  Virginica

    146           6.3          2.5           5.0          1.9  Virginica

    147           6.5          3.0           5.2          2.0  Virginica

    148           6.2          3.4           5.4          2.3  Virginica

    149           5.9          3.0           5.1          1.8  Virginica

    

    [150 rows x 5 columns]

    

```python

dtree = tree.DecisionTreeClassifier()

dtree.fit(Xtree, ytree)

#Plot the tree

plt.figure(figsize=(15,10))

tree.plot_tree(dtree, feature_names=features, fontsize=10)

plt.show()

```

    

![png](/assets/output_21_0.png)

    

```python

print(dtree.predict([[5.5, 4, 4, 1.5]]))

```

    [1]

    

    C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names

      warnings.warn(

    

### Confusion Matrix

It is a table that is used in classification problems to assess where errors in the model were made.

The rows represent the actual classes the outcomes should have been. While the columns represent the predictions we have made. Using this table it is easy to see which predictions are wrong.

```python

clf = tree.DecisionTreeClassifier()

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)

plt.figure(figsize = (10,7))

cfm_plot = sn.heatmap(df_cfm, annot=True)

plt.title("Decision Tree", fontsize = 22, fontweight="bold")

plt.xlabel("Predicted Label", fontsize = 22)

plt.ylabel("True Label", fontsize = 22)

sn.set(font_scale=1.4)

plt.show()

```

    

![png](/assets/output_24_0.png)

    

### AUC - ROC Curve

In classification, there are many different evaluation metrics. The most popular is accuracy, which measures how often the model is correct. This is a great metric because it is easy to understand and getting the most correct guesses is often desired. There are some cases where you might consider using another evaluation metric.

Another common metric is AUC, area under the receiver operating characteristic (ROC) curve. The Reciever operating characteristic curve plots the true positive (TP) rate versus the false positive (FP) rate at different classification thresholds. The thresholds are different probability cutoffs that separate the two classes in binary classification. It uses probability to tell us how well a model separates the classes.

```python

clf = tree.DecisionTreeClassifier()

X_train, X_test, y_train, y_test = train_test_split(Xsk, ysk, test_size=0.33, random_state=0)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)

plt.figure(figsize = (10,7))

cfm_plot = sn.heatmap(df_cfm, annot=True)

plt.title("Decision Tree", fontsize = 22, fontweight="bold")

plt.xlabel("Predicted Label", fontsize = 22)

plt.ylabel("True Label", fontsize = 22)

sn.set(font_scale=1.4)

plt.show()

```

    

![png](/assets/output_26_0.png)

    

```python

# Binarize the /assets/output

y = label_binarize(ysk, classes = clf.classes_)

n_classes = y.shape[1]

# shuffle and split training and test sets

X_train, X_test, y_train, y_test = train_test_split(Xsk, y, test_size=0.33, random_state=0)

# Learn to predict each class against the other

classifier = OneVsRestClassifier(

    clf

)

y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# Compute ROC curve and ROC area for each class

fpr = dict()

tpr = dict()

roc_auc = dict()

for i in range(n_classes):

    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])

    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area

fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())

roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

```

```python

plt.figure()

lw = 2

plt.plot(

    fpr[2],

    tpr[2],

    color="darkorange",

    lw=lw,

    label="ROC curve (area = %0.2f)" % roc_auc[2],

)

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

#plt.title("Receiver operating characteristic example")

plt.legend(loc="lower right")

plt.show()

#fig.savefig('curves.png')

```

    

![png](/assets/output_28_0.png)

    

### Cross Validation

When adjusting models we are aiming to increase overall model performance on unseen data. Hyperparameter tuning can lead to much better performance on test sets. However, optimizing parameters to the test set can lead information leakage causing the model to preform worse on unseen data. To correct for this we can perform cross validation.

To better understand CV, we will be performing different methods on the iris dataset.

```python

# K-Fold Cross Validation 

clf = DecisionTreeClassifier(random_state=42)

k_folds = KFold(n_splits = 5)

scores = cross_val_score(clf, Xsk, ysk, cv = k_folds)

print("Cross Validation Scores: ", scores)

print("Average CV Score: ", scores.mean())

print("Number of CV Scores used in Average: ", len(scores))

```

    Cross Validation Scores:  [1.         1.         0.83333333 0.93333333 0.8       ]

    Average CV Score:  0.9133333333333333

    Number of CV Scores used in Average:  5

    

```python

# Stratified K-Fold

clf = DecisionTreeClassifier(random_state=42)

sk_folds = StratifiedKFold(n_splits = 5)

scores = cross_val_score(clf, Xsk, ysk, cv = sk_folds)

print("Cross Validation Scores: ", scores)

print("Average CV Score: ", scores.mean())

print("Number of CV Scores used in Average: ", len(scores))

```

    Cross Validation Scores:  [0.96666667 0.96666667 0.9        0.93333333 1.        ]

    Average CV Score:  0.9533333333333334

    Number of CV Scores used in Average:  5

    

```python

#Leave One Out

X, y = datasets.load_iris(return_X_y=True)

clf = DecisionTreeClassifier(random_state=42)

loo = LeaveOneOut()

scores = cross_val_score(clf, X, y, cv = loo)

print("Cross Validation Scores: ", scores)

print("Average CV Score: ", scores.mean())

print("Number of CV Scores used in Average: ", len(scores))

```

    Cross Validation Scores:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

     1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

     1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.

     1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

     1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.

     1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.

     1. 1. 1. 1. 1. 1.]

    Average CV Score:  0.94

    Number of CV Scores used in Average:  150

    

```python

clf = DecisionTreeClassifier(random_state=42)

lpo = LeavePOut(p=2)

scores = cross_val_score(clf, Xsk, ysk, cv = lpo)

print("Cross Validation Scores: ", scores)

print("Average CV Score: ", scores.mean())

print("Number of CV Scores used in Average: ", len(scores))

```

    Cross Validation Scores:  [1. 1. 1. ... 1. 1. 1.]

    Average CV Score:  0.9382997762863534

    Number of CV Scores used in Average:  11175

    

### Ensemble

```python

from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

dtree = DecisionTreeClassifier(random_state = 22)

dtree.fit(X_train,y_train)

y_pred = dtree.predict(X_test)

print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred = dtree.predict(X_train)))

print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred = y_pred))

```

    Train data accuracy: 1.0

    Test data accuracy: 0.9210526315789473

    

```python

from sklearn.ensemble import BaggingClassifier

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

estimator_range = [2,4,6,8,10,12,14,16,18,20]

models = []

scores = []

for n_estimators in estimator_range:

    # Create bagging classifier

    clf = BaggingClassifier(n_estimators = n_estimators, random_state = 22)

    # Fit the model

    clf.fit(X_train, y_train)

    # Append the model and score to their respective list

    models.append(clf)

    scores.append(accuracy_score(y_true = y_test, y_pred = clf.predict(X_test)))

# Generate the plot of scores against number of estimators

plt.figure(figsize=(9,6))

plt.plot(estimator_range, scores)

# Adjust labels and font (to make visable)

plt.xlabel("n_estimators", fontsize = 18)

plt.ylabel("score", fontsize = 18)

plt.tick_params(labelsize = 16)

# Visualize plot

plt.show()

```

    

![png](/assets/output_36_0.png)

    

```python

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

clf = BaggingClassifier(n_estimators = 12, oob_score = True,random_state = 22)

clf.fit(X_train, y_train)

plt.figure(figsize=(15, 10))

plot_tree(clf.estimators_[0], feature_names = features, fontsize=14)

```

    [Text(0.375, 0.9, 'petal.length <= 2.45\ngini = 0.661\nsamples = 71\nvalue = [35, 44, 33]'),

     Text(0.25, 0.7, 'gini = 0.0\nsamples = 23\nvalue = [35, 0, 0]'),

     Text(0.5, 0.7, 'petal.width <= 1.7\ngini = 0.49\nsamples = 48\nvalue = [0, 44, 33]'),

     Text(0.25, 0.5, 'petal.length <= 5.0\ngini = 0.044\nsamples = 26\nvalue = [0, 43, 1]'),

     Text(0.125, 0.3, 'gini = 0.0\nsamples = 24\nvalue = [0, 42, 0]'),

     Text(0.375, 0.3, 'sepal.length <= 6.15\ngini = 0.5\nsamples = 2\nvalue = [0, 1, 1]'),

     Text(0.25, 0.1, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),

     Text(0.5, 0.1, 'gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]'),

     Text(0.75, 0.5, 'petal.length <= 4.85\ngini = 0.059\nsamples = 22\nvalue = [0, 1, 32]'),

     Text(0.625, 0.3, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),

     Text(0.875, 0.3, 'gini = 0.0\nsamples = 21\nvalue = [0, 0, 32]')]

    

![png](/assets/output_37_1.png)

    

## SVM

```python

clf = svm.LinearSVC(max_iter=3080)

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.33, random_state=0)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)

plt.figure(figsize = (10,7))

cfm_plot = sn.heatmap(df_cfm, annot=True)

plt.title("SVM", fontsize = 22, fontweight="bold")

plt.xlabel("Predicted Label", fontsize = 22)

plt.ylabel("True Label", fontsize = 22)

sn.set(font_scale=1.4)

plt.show()

```

    

![png](/assets/output_39_0.png)

    

```python

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)

#df = pd.DataFrame(report).transpose()

#df.to_csv("Report_SVM.csv")

```

## Random Forest

```python

clf = RandomForestClassifier()

```

```python

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)

plt.figure(figsize = (10,7))

cfm_plot = sn.heatmap(df_cfm, annot=True)

plt.title("Random Forest", fontsize = 22, fontweight="bold")

plt.xlabel("Predicted Label", fontsize = 22)

plt.ylabel("True Label", fontsize = 22)

sn.set(font_scale=1.4)

plt.show()

```

    

![png](/assets/output_43_0.png)

    

```python

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)

#df = pd.DataFrame(report).transpose()

#df.to_csv("Report_RF.csv") 

```

## Logistic Regression 

### Grid Search

```python

logit = LogisticRegression(max_iter = 10000)

C = [0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2]

scores = []

for choice in C:

  logit.set_params(C=choice)

  logit.fit(Xsk, ysk)

  scores.append(logit.score(Xsk, ysk))

print(scores)

```

    [0.9666666666666667, 0.9666666666666667, 0.9733333333333334, 0.9733333333333334, 0.98, 0.98, 0.9866666666666667, 0.9866666666666667]

    

```python

clf = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)

plt.figure(figsize = (10,7))

cfm_plot = sn.heatmap(df_cfm, annot=True)

plt.title("Logistic Regression", fontsize = 22, fontweight="bold")

plt.xlabel("Predicted Label", fontsize = 22)

plt.ylabel("True Label", fontsize = 22)

sn.set(font_scale=1.4)

plt.show()

```

    

![png](/assets/output_48_0.png)

    

```python

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)

#df = pd.DataFrame(report).transpose()

#df.to_csv("Report_LR.csv") 

```

## Gaussian Naïve Bays

```python

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)

plt.figure(figsize = (10,7))

cfm_plot = sn.heatmap(df_cfm, annot=True)

plt.title("Gaussian Naïve Bays", fontsize = 22, fontweight="bold")

plt.xlabel("Predicted Label", fontsize = 22)

plt.ylabel("True Label", fontsize = 22)

sn.set(font_scale=1.4)

plt.show()

```

    

![png](/assets/output_51_0.png)

    

```python

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)

#df = pd.DataFrame(report).transpose()

#df.to_csv("Report_GNB.csv") 

```

## KNN

```python

clf = KNeighborsClassifier(n_neighbors=1,)

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size =0.33, random_state=0)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)

plt.figure(figsize = (10,7))

cfm_plot = sn.heatmap(df_cfm, annot=True)

plt.title("K-NN", fontsize = 22, fontweight="bold")

plt.xlabel("Predicted Label", fontsize = 22)

plt.ylabel("True Label", fontsize = 22)

sn.set(font_scale=1.4)

plt.show()

```

    

![png](/assets/output_54_0.png)

    

```python

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)

#df = pd.DataFrame(report).transpose()

#df.to_csv("Report_KNN.csv") 

```

# Hierarchical Clustering

Hierarchical clustering is an unsupervised learning method for clustering data points. The algorithm builds clusters by measuring the dissimilarities between data. Unsupervised learning means that a model does not have to be trained, and we do not need a "target" variable. This method can be used on any data to visualize and interpret the relationship between individual data points.

```python

from scipy.cluster.hierarchy import dendrogram, linkage

from sklearn.cluster import AgglomerativeClustering

fig = plt.figure(figsize=(15,5))

data_to_analyze = iris_data[['petal.length', 'petal.width']]

# =============

# First subplot

# =============

ax = fig.add_subplot(1, 2, 1)

groups = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')

groups.fit_predict(data_to_analyze)

plt.scatter(iris_data['petal.length'] ,iris_data['petal.width'], c= groups.labels_, cmap='cool')

# =============

# Secound subplot

# =============

ax = fig.add_subplot(1, 2, 2)

data_to_analyze = list(zip(iris_data['petal.length'], iris_data['petal.width']))

linkage_data = linkage(data_to_analyze, method='ward', metric='euclidean')

dendrogram(linkage_data)

plt.show()

```

    

![png](/assets/output_57_0.png)

    

## K-means

```python

from sklearn.cluster import KMeans

inertias = []

for i in range(1,11):

    kmeans = KMeans(n_clusters=i)

    kmeans.fit(data_to_analyze)

    inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='o')

plt.title('Elbow method')

plt.xlabel('Number of clusters')

plt.ylabel('Inertia')

plt.show()

```

    

![png](/assets/output_59_1.png)

    

```python

kmeans = KMeans(n_clusters=3)

kmeans.fit(data_to_analyze)

plt.scatter(iris_data['petal.length'], iris_data['petal.width'], c=kmeans.labels_, cmap='cool')

plt.show()

```

    

![png](/assets/output_60_0.png)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/moligarch/ml-concepts-methods-iris-dataset

Awesome Lists containing this project

README