{"id":22039449,"url":"https://github.com/mpolinowski/python-scikitlearn-cheatsheet","last_synced_at":"2026-05-06T20:37:15.705Z","repository":{"id":170021828,"uuid":"645277749","full_name":"mpolinowski/python-scikitlearn-cheatsheet","owner":"mpolinowski","description":"SciKit Learn Machine Learning Cheat Sheet","archived":false,"fork":false,"pushed_at":"2023-06-17T11:59:37.000Z","size":21575,"stargazers_count":8,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-23T13:14:41.505Z","etag":null,"topics":["cheatsheet","data-exploration","feature-engineering","python","scikitlearn-machine-learning","sklearn"],"latest_commit_sha":null,"homepage":"https://mpolinowski.github.io/docs/Development/Python/2023-05-20-python-sklearn-cheat-sheet/2023-05-20","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mpolinowski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-25T09:53:19.000Z","updated_at":"2025-03-11T22:44:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"1014b06f-a235-4b66-81c7-e4217f13564e","html_url":"https://github.com/mpolinowski/python-scikitlearn-cheatsheet","commit_stats":null,"previous_names":["mpolinowski/python-scikitlearn-cheatsheet"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mpolinowski/python-scikitlearn-cheatsheet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-scikitlearn-cheatsheet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-scikitlearn-cheatsheet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-scikitlearn-cheatsheet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-scikitlearn-cheatsheet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mpolinowski","download_url":"https://codeload.github.com/mpolinowski/python-scikitlearn-cheatsheet/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-scikitlearn-cheatsheet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32711672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-06T19:35:05.142Z","status":"ssl_error","status_checked_at":"2026-05-06T19:35:03.996Z","response_time":117,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cheatsheet","data-exploration","feature-engineering","python","scikitlearn-machine-learning","sklearn"],"created_at":"2024-11-30T11:10:52.095Z","updated_at":"2026-05-06T20:37:15.687Z","avatar_url":"https://github.com/mpolinowski.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scikit-learn - Machine Learning in Python\n\n* Simple and efficient tools for predictive data analysis\n* Accessible to everybody, and reusable in various contexts\n* Built on NumPy, SciPy, and matplotlib\n* Open source, commercially usable - BSD license\n\n\u003e Regressions ++ Classifications ++ Clustering ++ Dimensionality Reduction ++ Model Selection ++ Pre-processing\n\n\n\u003c!-- TOC --\u003e\n\n- [scikit-learn - Machine Learning in Python](#scikit-learn---machine-learning-in-python)\n  - [Working with Missing Values](#working-with-missing-values)\n    - [Missing Indicator](#missing-indicator)\n    - [Simple Imputer](#simple-imputer)\n    - [Drop Missing Data](#drop-missing-data)\n  - [Categorical Data Preprocessing](#categorical-data-preprocessing)\n    - [Ordinal Encoder](#ordinal-encoder)\n    - [Label Encoder](#label-encoder)\n    - [OneHot  Encoder](#onehot--encoder)\n  - [Loading SK Datasets](#loading-sk-datasets)\n    - [Toy Datasets](#toy-datasets)\n    - [Real World Datasets](#real-world-datasets)\n    - [OpenML Datasets](#openml-datasets)\n  - [Supervised Learning - Regression Models](#supervised-learning---regression-models)\n    - [Simple Linear Regression](#simple-linear-regression)\n      - [Data Pre-processing](#data-pre-processing)\n      - [Model Training](#model-training)\n      - [Predictions](#predictions)\n      - [Model Evaluation](#model-evaluation)\n    - [ElasticNet Regression](#elasticnet-regression)\n      - [Dataset](#dataset)\n      - [Preprocessing](#preprocessing)\n      - [Grid Search for Hyperparameters](#grid-search-for-hyperparameters)\n      - [Model Evaluation](#model-evaluation-1)\n    - [Multiple Linear Regression](#multiple-linear-regression)\n  - [Supervised Learning - Logistic Regression Model](#supervised-learning---logistic-regression-model)\n    - [Binary Logistic Regression](#binary-logistic-regression)\n      - [Dataset](#dataset-1)\n      - [Model Fitting](#model-fitting)\n      - [Model Predictions](#model-predictions)\n      - [Model Evaluation](#model-evaluation-2)\n    - [Logistic Regression Pipelines](#logistic-regression-pipelines)\n      - [Dataset Preprocessing](#dataset-preprocessing)\n    - [Pipeline](#pipeline)\n      - [Cross Validation](#cross-validation)\n        - [Train | Test Split](#train--test-split)\n        - [Model Fitting](#model-fitting-1)\n        - [Model Evaluation](#model-evaluation-3)\n        - [Adjusting Hyper Parameter](#adjusting-hyper-parameter)\n      - [Train | Validation | Test Split](#train--validation--test-split)\n        - [Model Fitting and Evaluation](#model-fitting-and-evaluation)\n        - [Adjusting Hyper Parameter](#adjusting-hyper-parameter-1)\n      - [k-fold Cross Validation](#k-fold-cross-validation)\n        - [Train-Test Split](#train-test-split)\n        - [Model Scoring](#model-scoring)\n        - [Adjusting Hyper Parameter](#adjusting-hyper-parameter-2)\n        - [Model Fitting and Final Evaluation](#model-fitting-and-final-evaluation)\n      - [Cross Validate](#cross-validate)\n        - [Dataset (re-import)](#dataset-re-import)\n        - [Model Scoring](#model-scoring-1)\n        - [Adjusting Hyper Parameter](#adjusting-hyper-parameter-3)\n        - [Model Fitting and Final Evaluation](#model-fitting-and-final-evaluation-1)\n      - [Grid Search](#grid-search)\n        - [Hyperparameter Search](#hyperparameter-search)\n        - [Model Evaluation](#model-evaluation-4)\n  - [Supervised Learning - KNN Algorithm](#supervised-learning---knn-algorithm)\n    - [Dataset](#dataset-2)\n    - [Data Pre-processing](#data-pre-processing-1)\n    - [Model Fitting](#model-fitting-2)\n  - [Supervised Learning - Decision Tree Classifier](#supervised-learning---decision-tree-classifier)\n    - [Dataset](#dataset-3)\n    - [Preprocessing](#preprocessing-1)\n    - [Model Fitting](#model-fitting-3)\n    - [Evaluation](#evaluation)\n  - [Supervised Learning - Random Forest Classifier](#supervised-learning---random-forest-classifier)\n    - [Dataset](#dataset-4)\n    - [Preprocessing](#preprocessing-2)\n    - [Model Fitting](#model-fitting-4)\n    - [Evaluation](#evaluation-1)\n    - [Random Forest Hyperparameter Tuning](#random-forest-hyperparameter-tuning)\n      - [Testing Hyperparameters](#testing-hyperparameters)\n      - [Grid-Search Cross-Validation](#grid-search-cross-validation)\n    - [Random Forest Classifier 1 - Penguins](#random-forest-classifier-1---penguins)\n      - [Feature Importance](#feature-importance)\n      - [Model Evaluation](#model-evaluation-5)\n    - [Random Forest Classifier - Banknote Authentication](#random-forest-classifier---banknote-authentication)\n      - [Grid Search for Hyperparameters](#grid-search-for-hyperparameters-1)\n      - [Model Training and Evaluation](#model-training-and-evaluation)\n      - [Optimizations](#optimizations)\n    - [Random Forest Regressor](#random-forest-regressor)\n      - [vs Linear Regression](#vs-linear-regression)\n      - [vs Polynomial Regression](#vs-polynomial-regression)\n      - [vs KNeighbors Regression](#vs-kneighbors-regression)\n      - [vs Decision Tree Regression](#vs-decision-tree-regression)\n      - [vs Support Vector Regression](#vs-support-vector-regression)\n      - [vs Gradient Boosting Regression](#vs-gradient-boosting-regression)\n      - [vs Ada Boosting Regression](#vs-ada-boosting-regression)\n      - [Finally, Random Forrest Regression](#finally-random-forrest-regression)\n  - [Supervised Learning - SVC Model](#supervised-learning---svc-model)\n    - [Dataset](#dataset-5)\n      - [Preprocessing](#preprocessing-3)\n      - [Model Training](#model-training-1)\n      - [Model Evaluation](#model-evaluation-6)\n    - [Margin Plots for Support Vector Classifier](#margin-plots-for-support-vector-classifier)\n      - [SVC with a Linear Kernel](#svc-with-a-linear-kernel)\n      - [SVC with a Radial Basis Function Kernel](#svc-with-a-radial-basis-function-kernel)\n      - [SVC with a Sigmoid Kernel](#svc-with-a-sigmoid-kernel)\n      - [SVC with a Polynomial Kernel](#svc-with-a-polynomial-kernel)\n    - [Grid Search for Support Vector Classifier](#grid-search-for-support-vector-classifier)\n    - [Support Vector Regression](#support-vector-regression)\n      - [Base Model Run](#base-model-run)\n      - [Grid Search for better Hyperparameter](#grid-search-for-better-hyperparameter)\n    - [Example Task - Wine Fraud](#example-task---wine-fraud)\n      - [Data Exploration](#data-exploration)\n      - [Regression Model](#regression-model)\n  - [Supervised Learning - Boosting Methods](#supervised-learning---boosting-methods)\n    - [Dataset Exploration](#dataset-exploration)\n    - [Adaptive Boosting](#adaptive-boosting)\n      - [Feature Exploration](#feature-exploration)\n      - [Optimizing Hyperparameters](#optimizing-hyperparameters)\n    - [Gradient Boosting](#gradient-boosting)\n      - [Gridsearch for best Hyperparameter](#gridsearch-for-best-hyperparameter)\n      - [Feature Importance](#feature-importance-1)\n  - [Supervised Learning - Naive Bayes NLP](#supervised-learning---naive-bayes-nlp)\n    - [Feature Extraction](#feature-extraction)\n      - [CountVectorizer \\\u0026 TfidfTransformer](#countvectorizer--tfidftransformer)\n      - [TfidfVectorizer](#tfidfvectorizer)\n      - [Dataset Exploration](#dataset-exploration-1)\n      - [Data Preprocessing](#data-preprocessing)\n      - [TFIDF Vectorizer](#tfidf-vectorizer)\n      - [Model Comparison](#model-comparison)\n      - [Model Deployment](#model-deployment)\n    - [Text Classification](#text-classification)\n      - [Data Exploration](#data-exploration-1)\n      - [Top 30 Features by Label](#top-30-features-by-label)\n      - [Data Preprocessing](#data-preprocessing-1)\n      - [Model Training](#model-training-2)\n  - [Unsupervised Learning - KMeans Clustering](#unsupervised-learning---kmeans-clustering)\n    - [Dataset Exploration](#dataset-exploration-2)\n    - [Dataset Preprocessing](#dataset-preprocessing-1)\n    - [Model Training](#model-training-3)\n    - [Choosing a K Value](#choosing-a-k-value)\n      - [Re-fitting the Model](#re-fitting-the-model)\n    - [Example 1 : Color Quantization](#example-1--color-quantization)\n    - [Example 2 : Country Clustering](#example-2--country-clustering)\n      - [Dataset Exploration](#dataset-exploration-3)\n      - [Dataset Preprocessing](#dataset-preprocessing-2)\n      - [Model Training](#model-training-4)\n      - [Model Evaluation](#model-evaluation-7)\n      - [Plotly Choropleth Map](#plotly-choropleth-map)\n  - [Unsupervised Learning - Agglomerative Clustering](#unsupervised-learning---agglomerative-clustering)\n    - [Dataset Preprocessing](#dataset-preprocessing-3)\n    - [Assigning Cluster Labels](#assigning-cluster-labels)\n      - [Known Number of Clusters](#known-number-of-clusters)\n      - [Unknown Number of Clusters](#unknown-number-of-clusters)\n  - [Unsupervised Learning - Density-based Spatial Clustering (DBSCAN)](#unsupervised-learning---density-based-spatial-clustering-dbscan)\n    - [DBSCAN vs KMeans](#dbscan-vs-kmeans)\n    - [DBSCAN Hyperparameter Tuning](#dbscan-hyperparameter-tuning)\n      - [Elbow Plot](#elbow-plot)\n    - [Realworld Dataset](#realworld-dataset)\n      - [Dataset Exploration](#dataset-exploration-4)\n      - [Data Preprocessing](#data-preprocessing-2)\n      - [Model Hyperparameter Tuning](#model-hyperparameter-tuning)\n  - [Dimension Reduction - Principal Component Analysis (PCA)](#dimension-reduction---principal-component-analysis-pca)\n    - [Dataset Preprocessing](#dataset-preprocessing-4)\n    - [Model Fitting](#model-fitting-5)\n    - [Dataset 2](#dataset-2)\n      - [Dataset 2 Preprocessing](#dataset-2-preprocessing)\n      - [Model Fitting](#model-fitting-6)\n\n\u003c!-- /TOC --\u003e\n\n```python\nimport matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimport numpy as np\nimport pandas as pd\nimport plotly.express as px\nimport seaborn as sns\nfrom sklearn import svm\nfrom sklearn.cluster import KMeans\nfrom sklearn.datasets import load_iris, load_wine, fetch_20newsgroups, fetch_openml\nfrom sklearn.impute import MissingIndicator, SimpleImputer\nfrom sklearn.ensemble import (\n    RandomForestClassifier,\n    RandomForestRegressor,\n    GradientBoostingRegressor,\n    AdaBoostRegressor,\n    GradientBoostingClassifier,\n    AdaBoostClassifier\n)\nfrom sklearn.feature_extraction.text import (\n    CountVectorizer,\n    TfidfTransformer,\n    TfidfVectorizer\n)\nfrom sklearn.linear_model import (\n    LinearRegression,\n    LogisticRegression,\n    Ridge,\n    ElasticNet\n)\nfrom sklearn.metrics import (\n    mean_absolute_error,\n    mean_squared_error,\n    classification_report,\n    confusion_matrix,\n    ConfusionMatrixDisplay,\n    accuracy_score\n)\nfrom sklearn.model_selection import (\n    train_test_split,\n    GridSearchCV,\n    cross_val_score,\n    cross_validate\n)\nfrom sklearn.naive_bayes import MultinomialNB\nfrom sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor\nfrom sklearn.pipeline import Pipeline, make_pipeline\nfrom sklearn.preprocessing import (\n    MinMaxScaler,\n    StandardScaler,\n    OrdinalEncoder,\n    LabelEncoder,\n    OneHotEncoder,\n    PolynomialFeatures\n)\nfrom sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor\n```\n\n## Working with Missing Values\n\n```python\nX_missing = pd.DataFrame(\n    np.array([5,2,3,np.NaN,np.NaN,4,-3,2,1,8,np.NaN,4,10,np.NaN,5]).reshape(5,3)\n)\nX_missing.columns = ['f1','f2','f3']\n\nX_missing\n```\n\n|  | f1 | f2 | f3 |\n| -- | -- | -- | -- |\n| 0 | 5.0 | 2.0 | 3.0 |\n| 1 | NaN | NaN | 4.0 |\n| 2 | -3.0 | 2.0 | 1.0 |\n| 3 | 8.0 | NaN | 4.0 |\n| 4 | 10.0 | NaN | 5.0 |\n\n```python\nX_missing.isnull().sum()\n\n# f1    1\n# f2    3\n# f3    0\n# dtype: int64\n```\n\n### Missing Indicator\n\n```python\nindicator = MissingIndicator(missing_values=np.NaN)\nindicator = indicator.fit_transform(X_missing)\nindicator = pd.DataFrame(indicator, columns=['a1', 'a2'])\nindicator\n```\n\n|  | a1 | a2 |\n| -- | -- | -- |\n| 0 | False | False |\n| 1 | True | True |\n| 2 | False | False |\n| 3 | False | True |\n| 4 | False | True |\n\n\n### Simple Imputer\n\n```python\nimputer_mean = SimpleImputer(missing_values=np.NaN, strategy='mean')\nX_filled_mean = pd.DataFrame(imputer_mean.fit_transform(X_missing))\nX_filled_mean.columns = ['f1','f2','f3']\nX_filled_mean\n```\n\n|  | f1 | f2 | f3 |\n| -- | -- | -- | -- |\n| 0 | 5.0 | 2.0 | 3.0 |\n| 1 | 5.0 | 2.0 | 4.0 |\n| 2 | -3.0 | 2.0 | 1.0 |\n| 3 | 8.0 | 2.0 | 4.0 |\n| 4 | 10.0 | 2.0 | 5.0 |\n\n```python\nimputer_median = SimpleImputer(missing_values=np.NaN, strategy='median')\nX_filled_median = pd.DataFrame(imputer_median.fit_transform(X_missing))\nX_filled_median.columns = ['f1','f2','f3']\nX_filled_median\n```\n\n|  | f1 | f2 | f3 |\n| -- | -- | -- | -- |\n| 0 | 5.0 | 2.0 | 3.0 |\n| 1 | 6.5 | 2.0 | 4.0 |\n| 2 | -3.0 | 2.0 | 1.0 |\n| 3 | 8.0 | 2.0 | 4.0 |\n| 4 | 10.0 | 2.0 | 5.0 |\n\n```python\nimputer_median = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')\nX_filled_median = pd.DataFrame(imputer_median.fit_transform(X_missing))\nX_filled_median.columns = ['f1','f2','f3']\nX_filled_median\n```\n\n|  | f1 | f2 | f3 |\n| -- | -- | -- | -- |\n| 0 | 5.0 | 2.0 | 3.0 |\n| 1 | -3.0 | 2.0 | 4.0 |\n| 2 | -3.0 | 2.0 | 1.0 |\n| 3 | 8.0 | 2.0 | 4.0 |\n| 4 | 10.0 | 2.0 | 5.0 |\n\n\n### Drop Missing Data\n\n```python\nX_missing_dropped = X_missing.dropna(axis=1)\nX_missing_dropped\n```\n\n|  | f3 |\n| -- | -- |\n| 0 | 3.0 |\n| 1 | 4.0 |\n| 2 | 1.0 |\n| 3 | 4.0 |\n| 4 | 5.0 |\n\n```python\nX_missing_dropped = X_missing.dropna(axis=0).reset_index()\nX_missing_dropped\n```\n\n|   | f1 | f2 | f3 |\n| -- | -- | -- | -- |\n| 0 | 5.0 | 2.0 | 3.0 |\n| 1 | -3.0 | 2.0 | 1.0 |\n\n\n## Categorical Data Preprocessing\n\n```python\nX_cat_df = pd.DataFrame(\n    np.array([\n        ['M', 'O-', 'medium'],\n        ['M', 'O-', 'high'],\n        ['F', 'O+', 'high'],\n        ['F', 'AB', 'low'],\n        ['F', 'B+', 'medium']\n    ])\n)\n\nX_cat_df.columns = ['f1','f2','f3']\n\nX_cat_df\n```\n\n|  | f1 | f2 | f3 |\n| -- | -- | -- | -- |\n| 0 | M | O- | medium |\n| 1 | M | O- | high |\n| 2 | F | O+ | high |\n| 3 | F | AB | low |\n| 4 | F | B+ | medium |\n\n\n### Ordinal Encoder\n\n```python\nencoder_ord = OrdinalEncoder(dtype='int')\n\nX_cat_df.f3 = encoder_ord.fit_transform(X_cat_df.f3.values.reshape(-1, 1))\nX_cat_df\n```\n\n|  | f1 | f2 | f3 |\n| -- | -- | -- | -- |\n| 0 | M | O- | 2 |\n| 1 | M | O- | 0 |\n| 2 | F | O+ | 0 |\n| 3 | F | AB | 1 |\n| 4 | F | B+ | 2 |\n\n\n### Label Encoder\n\n```python\nencoder_lab = LabelEncoder()\nX_cat_df['f2'] = encoder_lab.fit_transform(X_cat_df['f2'])\nX_cat_df\n```\n\n|  | f1 | f2 | f3 |\n| -- | -- | -- | -- |\n| 0 | M | 3 | 2 |\n| 1 | M | 3 | 0 |\n| 2 | F | 2 | 0 |\n| 3 | F | 0 | 1 |\n| 4 | F | 1 | 2 |\n\n\n### OneHot  Encoder\n\n```python\nencoder_oh = OneHotEncoder(dtype='int')\n\nonehot_df = pd.DataFrame(\n    encoder_oh.fit_transform(X_cat_df[['f1']])\n    .toarray(),\n    columns=['F', 'M']\n)\n\nonehot_df['f2'] = X_cat_df.f2\nonehot_df['f3'] = X_cat_df.f3\nonehot_df\n```\n\n|   | F | M | f2 | f3 |\n| -- | -- | -- | -- | -- |\n| 0 | 0 | 1 | 3 | 2 |\n| 1 | 0 | 1 | 3 | 0 |\n| 2 | 1 | 0 | 2 | 0 |\n| 3 | 1 | 0 | 0 | 1 |\n| 4 | 1 | 0 | 1 | 2 |\n\n\n## Loading SK Datasets\n\n### Toy Datasets\n\n|  |  |  |\n| -- | -- | -- |\n| load_iris(*[, return_X_y, as_frame]) | classification | Load and return the iris dataset. |\n| load_diabetes(*[, return_X_y, as_frame, scaled]) | regression | Load and return the diabetes dataset. |\n| load_digits(*[, n_class, return_X_y, as_frame]) | classification |  Load and return the digits dataset. |\n| load_linnerud(*[, return_X_y, as_frame]) | multi-output regression | Load and return the physical exercise Linnerud dataset. |\n| load_wine(*[, return_X_y, as_frame]) | classification | Load and return the wine dataset. | \n| load_breast_cancer(*[, return_X_y, as_frame]) | classification | Load and return the breast cancer wisconsin dataset. |\n\n```python\niris_ds = load_iris()\niris_data = iris_ds.data\ncol_names = iris_ds.feature_names\ntarget_names = iris_ds.target_names\n\nprint(\n    'Iris Dataset',\n    '\\n * Data array: ',\n    iris_data.shape,\n    '\\n * Column names: ',\n    col_names,\n    '\\n * Target names: ',\n    target_names\n)\n\n# Iris Dataset \n#  * Data array:  (150, 4) \n#  * Column names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] \n#  * Target names:  ['setosa' 'versicolor' 'virginica']\n```\n\n```python\niris_df = pd.DataFrame(data=iris_data, columns=col_names)\n\niris_df.head()\n```\n\n|  | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) |\n| -- | -- | -- | -- | -- |\n| 0 | 5.1 | 3.5 | 1.4 | 0.2 |\n| 1 | 4.9 | 3.0 | 1.4 | 0.2 |\n| 2 | 4.7 | 3.2 | 1.3 | 0.2 |\n| 3 | 4.6 | 3.1 | 1.5 | 0.2 |\n| 4 | 5.0 | 3.6 | 1.4 | 0.2 |\n\n\n### Real World Datasets\n\n|  |  |  |\n| -- | -- | -- |\n| fetch_olivetti_faces(*[, data_home, ...]) | classification | Load the Olivetti faces data-set from AT\u0026T. |\n| fetch_20newsgroups(*[, data_home, subset, ...]) | classification | Load the filenames and data from the 20 newsgroups dataset. |\n| fetch_20newsgroups_vectorized(*[, subset, ...]) | classification | Load and vectorize the 20 newsgroups dataset. |\n| fetch_lfw_people(*[, data_home, funneled, ...]) | classification | Load the Labeled Faces in the Wild (LFW) people dataset. |\n| fetch_lfw_pairs(*[, subset, data_home, ...]) | classification | Load the Labeled Faces in the Wild (LFW) pairs dataset. |\n| fetch_covtype(*[, data_home, ...]) | classification | Load the covertype dataset. |\n| fetch_rcv1(*[, data_home, subset, ...]) | classification | Load the RCV1 multilabel dataset. |\n| fetch_kddcup99(*[, subset, data_home, ...]) | classification | Load the kddcup99 dataset. |\n| fetch_california_housing(*[, data_home, ...]) | regression | Load the California housing dataset. |\n\n```python\nnewsgroups_train = fetch_20newsgroups(subset='train')\ntrain_data = newsgroups_train.data\ncol_names = newsgroups_train.filenames.shape\ntarget_names = newsgroups_train.target.shape\n\nprint(\n    'Newsgroup - Train Subset',\n    '\\n * Data array: ',\n    len(train_data),\n    '\\n * Column names: ',\n    col_names,\n    '\\n * Target names: ',\n    target_names\n)\n\n# Newsgroup - Train Subset \n#  * Data array:  11314 \n#  * Column names:  (11314,) \n#  * Target names:  (11314,)\n```\n\n```python\nprint('Target Names: ', newsgroups_train.target_names)\n\n# Target Names:  ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n```\n\n### OpenML Datasets\n\n* [openml.org](https://openml.org/search?type=data\u0026sort=runs\u0026status=active)\n* [Mice Protein Dataset](https://openml.org/search?type=data\u0026status=active\u0026id=40966)\n\n```python\nmice_ds = fetch_openml(name='miceprotein', version=4, parser=\"auto\")\n```\n\n```python\nprint(\n    'Mice Protein Dataset',\n    '\\n * Data Shape: ',\n    mice_ds.data.shape,\n    '\\n * Target Shape: ',\n    mice_ds.target.shape,\n    '\\n * Target Names: ',\n    np.unique(mice_ds.target)\n)\n\n# Mice Protein Dataset \n#  * Data Shape:  (1080, 77) \n#  * Target Shape:  (1080,) \n#  * Target Names:  ['c-CS-m' 'c-CS-s' 'c-SC-m' 'c-SC-s' 't-CS-m' 't-CS-s' 't-SC-m' 't-SC-s']\n\n```\n\n```python\nprint(mice_ds.DESCR)\n```\n\n## Supervised Learning - Regression Models\n\n### Simple Linear Regression\n\n```python\niris_df.plot(\n    figsize=(12,5),\n    kind='scatter',\n    x='sepal length (cm)',\n    y='sepal width (cm)',\n    title='Iris Dataset :: Sepal Width\u0026Height'\n)\n\nprint(iris_df.corr())\n```\n\n\u003e The __Sepal Width__ has very little correlation to all other metrics but itself. While the other three correlate nicely:\n\n|  | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) |\n| -- | -- | -- | -- | -- |\n| sepal length (cm) | 1.000000 | -0.117570 | 0.871754 | 0.817941 |\n| sepal width (cm) | -0.117570 | 1.000000 | -0.428440 | -0.366126 |\n| petal length (cm) | 0.871754 | -0.428440 | 1.000000 | 0.962865 |\n| petal width (cm) | 0.817941 | -0.366126 | 0.962865 | 1.000000 |\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_01.webp)\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_02.webp)\n\n\n#### Data Pre-processing\n\n```python\niris_df['petal length (cm)'][:1]\n# 0    1.4\n# Name: petal length (cm), dtype: float64\n```\n\n```python\niris_df['petal length (cm)'].values.reshape(-1,1)[:1]\n# array([[1.4]])\n```\n\n```python\n# scikit expects a 2s imput =\u003e remove index\nX = iris_df['petal length (cm)'].values.reshape(-1,1)\ny = iris_df['petal width (cm)'].values.reshape(-1,1)\n```\n\n```python\n# train/test split\nX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)\nprint(X_train.shape, X_test.shape)\n# (120, 1) (30, 1) 80:20 split\n```\n\n#### Model Training\n\n```python\nregressor = LinearRegression()\nregressor.fit(X_train,y_train)\n\nintercept = regressor.intercept_\nslope = regressor.coef_\n\nprint(' Intercept: ', intercept, '\\n Slope: ', slope)\n#  Intercept:  [-0.35135666] \n#  Correlation Coeficient:  [[0.41310505]]\n```\n\n#### Predictions\n\n```python\ny_pred = regressor.predict([X_test[0]])\nprint(' Prediction: ', y_pred, '\\n True Value: ', y_test[0])\n#  Prediction:  [[0.22699041]] \n#  True Value:  [0.2]\n```\n\n```python\ndef predict(value):\n    return (slope*value + intercept)[0][0]\n```\n\n```python\nprint('Prediction: ', predict(X_test[0]))\n# Prediction:  [[0.22699041]]\n```\n\n```python\niris_df['petal width (cm) prediction'] = iris_df['petal length (cm)'].apply(predict)\nprint(' Prediction: ', iris_df['petal width (cm) prediction'][0], '\\n True Value: ', iris_df['petal width (cm)'][0])\n#  Prediction:  0.22699041280334376 \n#  True Value:  0.2\n```\n\n```python\niris_df.head(10)\n```\n\n|   | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | petal width (cm) prediction |\n| -- | -- | -- | -- | -- | -- |\n| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.226990 |\n| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.226990 |\n| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.185680 |\n| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.268301 |\n| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.226990 |\n| 5 | 5.4 | 3.9 | 1.7 | 0.4 | 0.350922 |\n| 6 | 4.6 | 3.4 | 1.4 | 0.3 | 0.226990 |\n| 7 | 5.0 | 3.4 | 1.5 | 0.2 | 0.268301 |\n| 8 | 4.4 | 2.9 | 1.4 | 0.2 | 0.226990 |\n| 9 | 4.9 | 3.1 | 1.5 | 0.1 | 0.268301 |\n\n```python\niris_df.plot(\n    figsize=(12,5),\n    kind='scatter',\n    x='petal width (cm)',\n    y='petal width (cm) prediction',\n    # no value in colorizing..just looks pretty\n    c='petal width (cm) prediction',\n    colormap='summer',\n    title='Iris Dataset - Sepal Width True vs Prediction'\n)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_03.webp)\n\n\n#### Model Evaluation\n\n```python\nmae = mean_absolute_error(\n    iris_df['petal width (cm)'],\n    iris_df['petal width (cm) prediction']\n)\n\nmse = mean_squared_error(\n    iris_df['petal width (cm)'],\n    iris_df['petal width (cm) prediction']\n)\n\nrmse = np.sqrt(mse)\n\nprint(' MAE: ', mae, '\\n MSE: ', mse, '\\n RMSE: ', rmse)\n\n#  MAE:  0.1569441318761155 \n#  MSE:  0.04209214667485277 \n#  RMSE:  0.2051637070118708\n```\n\n### ElasticNet Regression\n#### Dataset\n\n```python\n!wget https://raw.githubusercontent.com/Satish-Vennapu/DataScience/main/AMES_Final_DF.csv -P datasets\n```\n\n```python\names_df = pd.read_csv('datasets/AMES_Final_DF.csv')\names_df.head(5).transpose()\n```\n\n|  | 0 | 1 | 2 | 3 | 4 |\n| -- | -- | -- | -- | -- | -- |\n| Lot Frontage | 141.0 | 80.0 | 81.0 | 93.0 | 74.0 |\n| Lot Area | 31770.0 | 11622.0 | 14267.0 | 11160.0 | 13830.0 |\n| Overall Qual | 6.0 | 5.0 | 6.0 | 7.0 | 5.0 |\n| Overall Cond | 5.0 | 6.0 | 6.0 | 5.0 | 5.0 |\n| Year Built | 1960.0 | 1961.0 | 1958.0 | 1968.0 | 1997.0 |\n| ... |\n| Sale Condition_AdjLand | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |\n| Sale Condition_Alloca | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |\n| Sale Condition_Family | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |\n| Sale Condition_Normal | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |\n| Sale Condition_Partial | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |\n_274 rows × 5 columns_\n\n```python\n# the target value is:\names_df['SalePrice']\n```\n\n|  |  |\n| -- | -- |\n|0 | 215000 |\n|1 | 105000 |\n|2 | 172000 |\n|3 | 244000 |\n|4 | 189900 |\n|    ...    |\n|2920 | 142500 |\n|2921 | 131000 |\n|2922 | 132000 |\n|2923 | 170000 |\n|2924 | 188000 |\n_Name: SalePrice, Length: 2925, dtype: int64_\n\n\n#### Preprocessing\n\n```python\n# remove target column from training dataset\nX_ames = ames_df.drop('SalePrice', axis=1)\ny_ames = ames_df['SalePrice']\n\nprint(X_ames.shape, y_ames.shape)\n# (2925, 273) (2925,)\n```\n\n```python\n# train/test split\nX_ames_train, X_ames_test, y_ames_train, y_ames_test = train_test_split(\n    X_ames,\n    y_ames,\n    test_size=0.1,\n    random_state=101\n)\n\nprint(X_ames_train.shape, X_ames_test.shape)\n# (2632, 273) (293, 273)\n```\n\n```python\n# normalize feature set\nscaler = StandardScaler()\nX_ames_train_scaled = scaler.fit_transform(X_ames_train)\n\nX_ames_test_scaled = scaler.transform(X_ames_test)\n```\n\n#### Grid Search for Hyperparameters\n\n```python\nbase_ames_elastic_net_model = ElasticNet(max_iter=int(1e4))\n```\n\n```python\nparam_grid = {\n    'alpha': [50, 75, 100, 125, 150],\n    'l1_ratio':[0.2, 0.4, 0.6, 0.8, 1.0]\n}\n```\n\n```python\ngrid_ames_model = GridSearchCV(\n    estimator=base_ames_elastic_net_model,\n    param_grid=param_grid,\n    scoring='neg_mean_squared_error',\n    cv=5, verbose=1\n)\n\ngrid_ames_model.fit(X_ames_train_scaled, y_ames_train)\n\nprint(\n    'Results:\\nBest Estimator: ',\n    grid_ames_model.best_estimator_,\n    '\\nBest Hyperparameter: ',\n    grid_ames_model.best_params_\n)\n```\n\n__Results__:\n* Best Estimator:  `ElasticNet(alpha=125, l1_ratio=1.0, max_iter=10000)`\n* Best Hyperparameter:  `{'alpha': 125, 'l1_ratio': 1.0}`\n\n\n#### Model Evaluation\n\n```python\ny_ames_pred = grid_ames_model.predict(X_ames_test_scaled)\n\nprint(\n    'MAE: ',\n    mean_absolute_error(y_ames_test, y_ames_pred),\n    'MSE: ',\n    mean_squared_error(y_ames_test, y_ames_pred),\n    'RMSE: ',\n    np.sqrt(mean_squared_error(y_ames_test, y_ames_pred))\n)\n\n# MAE:  14185.506207185055 MSE:  422714457.5190704 RMSE:  20560.020854052418\n```\n\n```python\n# average SalePrize\nnp.mean(ames_df['SalePrice'])\n# 180815.53743589742\n\nrel_error_avg = mean_absolute_error(y_ames_test, y_ames_pred) * 100 / np.mean(ames_df['SalePrice'])\nprint('Pridictions are on average off by: ', rel_error_avg.round(2), '%')\n# Pridictions are on average off by:  7.85 %\n```\n\n```python\nplt.figure(figsize=(10,4))\n\nplt.scatter(y_ames_test,y_ames_pred, c='mediumspringgreen', s=3)\nplt.axline((0, 0), slope=1, color='dodgerblue', linestyle=(':'))\n\nplt.title('Prediction Accuracy :: MAE:'+ str(mean_absolute_error(y_ames_test, y_ames_pred).round(2)) + 'US$')\nplt.xlabel('True Sales Price')\nplt.ylabel('Predicted Sales Price')\nplt.savefig('assets/Scikit_Learn_11.webp', bbox_inches='tight')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_11.webp)\n\n\n### Multiple Linear Regression\n\nAbove I used the `petal width` and `length` to create a linear regression model. But as explored earlier we can also use the `sepal length` (only the `sepal width` does not show a linear correlation):\n\n```python\nprint(iris_df.corr())\n```\n\n|  | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) |\n| -- | -- | -- | -- | -- |\n| sepal length (cm) | 1.000000 | -0.117570 | 0.871754 | 0.817941 |\n| sepal width (cm) | -0.117570 | 1.000000 | -0.428440 | -0.366126 |\n| petal length (cm) | 0.871754 | -0.428440 | 1.000000 | 0.962865 |\n| petal width (cm) | 0.817941 | -0.366126 | 0.962865 | 1.000000 |\n\n```python\nX_multi = iris_df[['petal length (cm)', 'sepal length (cm)']]\ny = iris_df['petal width (cm)']\n```\n\n```python\nregressor_multi = LinearRegression()\nregressor_multi.fit(X_multi, y)\n\nintercept_multi = regressor_multi.intercept_\nslope_multi = regressor_multi.coef_\n\nprint(' Intercept: ', intercept_multi, '\\n Slope: ', slope_multi)\n\n#  Intercept:  -0.00899597269816943 \n#  Slope:  [ 0.44937611 -0.08221782]\n```\n\n```python\ndef predict_multi(petal_length, sepal_length):\n    return (slope_multi[0]*petal_length + slope_multi[1]*sepal_length + intercept_multi)\n```\n\n```python\ny_pred = predict_multi(\n    iris_df['petal length (cm)'][0],\n    iris_df['sepal length (cm)'][0]\n)\n\nprint(' Prediction: ', y_pred, '\\n True value: ', iris_df['petal width (cm)'][0])\n#  Prediction:  0.20081970121763193 \n#  True value:  0.2\n```\n\n```python\niris_df['petal width (cm) prediction (multi)'] = (\n    (\n        slope_multi[0] * iris_df['petal length (cm)']\n    ) + (\n        slope_multi[1] * iris_df['sepal length (cm)']\n    ) + (\n        intercept_multi\n    ) \n)\n```\n\n```python\niris_df.head(10)\n```\n\n|    | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | petal width (cm) prediction | petal width (cm) prediction (multi) |\n| -- | -- | -- | -- | -- | -- | -- |\n| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.226990 | 0.200820 |\n| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.226990 | 0.217263 |\n| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.185680 | 0.188769 |\n| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.268301 | 0.286866 |\n| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.226990 | 0.209041 |\n| 5 | 5.4 | 3.9 | 1.7 | 0.4 | 0.350922 | 0.310967 |\n| 6 | 4.6 | 3.4 | 1.4 | 0.3 | 0.226990 | 0.241929 |\n| 7 | 5.0 | 3.4 | 1.5 | 0.2 | 0.268301 | 0.253979 |\n| 8 | 4.4 | 2.9 | 1.4 | 0.2 | 0.226990 | 0.258372 |\n| 9 | 4.9 | 3.1 | 1.5 | 0.1 | 0.268301 | 0.262201 |\n\n```python\niris_df.plot(\n    figsize=(12,5),\n    kind='scatter',\n    x='petal width (cm)',\n    y='petal width (cm) prediction (multi)',\n    c='petal width (cm) prediction',\n    colormap='summer',\n    title='Iris Dataset - Sepal Width True vs Prediction (multi)'\n)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_04.webp)\n\n```python\nmae_multi = mean_absolute_error(\n    iris_df['petal width (cm)'],\n    iris_df['petal width (cm) prediction (multi)']\n)\n\nmse_multi = mean_squared_error(\n    iris_df['petal width (cm)'],\n    iris_df['petal width (cm) prediction (multi)']\n)\n\nrmse_multi = np.sqrt(mse_multi)\n\nprint(' MAE_Multi: ', mae_multi,' MAE: ', mae, '\\n MSE_Multi: ', mse_multi, ' MSE: ', mse, '\\n RMSE_Multi: ', rmse_multi, ' RMSE: ', rmse)\n```\n\nThe accuracy of the model was improved by adding an additional, correlating value:\n\n|          | Multi Regression   | Single Regression |\n| --       | --                 | --                |\n| Mean Absolute Error | 0.15562108079300102 | 0.1569441318761155 |\n| Mean Squared Error | 0.04096208526408982 | 0.04209214667485277 |\n| Root Mean Squared Error | 0.20239092189149646 | 0.2051637070118708 |\n\n\n## Supervised Learning - Logistic Regression Model\n\n### Binary Logistic Regression\n\n#### Dataset\n\n```python\nnp.random.seed(666)\n\n# generate 10 index values between 0-10\nx_data_logistic_binary = np.random.randint(10, size=(10)).reshape(-1, 1)\n# generate binary category for values above\ny_data_logistic_binary = np.random.randint(2, size=10)\n```\n\n#### Model Fitting\n\n```python\nlogistic_binary_model = LogisticRegression(\n    solver='liblinear',\n    C=10.0,\n    random_state=0\n)\n\nlogistic_binary_model.fit(x_data_logistic_binary, y_data_logistic_binary)\n\nintercept_logistic_binary = logistic_binary_model.intercept_\nslope_logistic_binary = logistic_binary_model.coef_\n\nprint(' Intercept: ', intercept_logistic_binary, '\\n Slope: ', slope_logistic_binary)\n\n#  Intercept:  [-0.4832956] \n#  Slope:  [[0.11180522]]\n```\n\n#### Model Predictions\n\n```python\nprob_pred_logistic_binary = logistic_binary_model.predict_proba(x_data_logistic_binary)\ny_pred_logistic_binary = logistic_binary_model.predict(x_data_logistic_binary)\n\n\nprint('Prediction Probabilities: ', prob_pred[:1])\n\nunique, counts = np.unique(y_pred_logistic_binary, return_counts=True)\nprint('Classes: ', unique, '| Number of Class Instances: ', counts)\n\n# probabilities e.g. below -\u003e 58% certainty that the first element is class 0\n\n# Prediction Probabilities:  [[0.58097284 0.41902716]]\n# Classes:  [0 1] | Number of Class Instances:  [5 5]\n```\n\n#### Model Evaluation\n\n```python\nconf_mtx = confusion_matrix(y_data_logistic_binary, y_pred_logistic_binary)\nconf_mtx\n\n# [2, 3] [TP, FP]\n# [3, 2] [FN, TN]\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/confusion-matrix.webp)\n\n```python\nreport = classification_report(y_data_logistic_binary, y_pred_logistic_binary)\nprint(report)\n```\n\n|              | precision | recall | f1-score | support |\n|    --        |  --  |  --  |  --  |  --  |\n| 0            | 0.40 | 0.40 | 0.40 |   5  |\n| 1            | 0.40 | 0.40 | 0.40 |   5  |\n| accuracy     |      |      | 0.40 |  10  |\n| macro avg    | 0.40 | 0.40 | 0.40 |  10  |\n| weighted avg | 0.40 | 0.40 | 0.40 |  10  |\n\n\n### Logistic Regression Pipelines\n\n#### Dataset Preprocessing\n\n```python\niris_ds = load_iris()\n\n# train/test split\nX_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(\n    iris_ds.data,\n    iris_ds.target,\n    test_size=0.2,\n    random_state=42\n)\nprint(X_train_iris.shape, X_test_iris.shape)\n# (120, 4) (30, 4)\n```\n\n### Pipeline\n\n```python\npipe_iris = Pipeline([\n    ('minmax', MinMaxScaler()),\n    ('log_reg', LogisticRegression()),\n])\n\npipe_iris.fit(X_train_iris, y_train_iris)\n```\n\n```python\niris_score = pipe_iris.score(X_test_iris, y_test_iris)\nprint('Prediction Accuracy: ', iris_score.round(4)*100, '%')\n# Prediction Accuracy:  96.67 %\n```\n\n#### Cross Validation\n\n##### Train | Test Split\n\n```python\n!wget https://raw.githubusercontent.com/reisanar/datasets/master/Advertising.csv -P datasets\n```\n\n```python\nadv_df = pd.read_csv('datasets/Advertising.csv')\nadv_df.head(5)\n```\n\n|  | TV | Radio | Newspaper | Sales |\n| -- | -- | -- | -- | -- |\n| 0 | 230.1 | 37.8 | 69.2 | 22.1 |\n| 1 | 44.5 | 39.3 | 45.1 | 10.4 |\n| 2 | 17.2 | 45.9 | 69.3 | 9.3 |\n| 3 | 151.5 | 41.3 | 58.5 | 18.5 |\n| 4 | 180.8 | 10.8 | 58.4 | 12.9 |\n\n```python\n# Split ds into features and targets\nX_adv = adv_df.drop('Sales', axis=1)\ny_adv = adv_df['Sales']\n```\n\n```python\n# 70:30 train/test split\nX_adv_train, X_adv_test, y_adv_train, y_adv_test = train_test_split(\n    X_adv, y_adv, test_size=0.3, random_state=666\n)\n\nprint(X_adv_train.shape, y_adv_train.shape)\n# (140, 3) (140,)\n```\n\n```python\n# normalize features\nscaler_adv = StandardScaler()\nscaler_adv.fit(X_adv_train)\n\nX_adv_train = scaler_adv.transform(X_adv_train)\nX_adv_test = scaler_adv.transform(X_adv_test)\n```\n\n##### Model Fitting\n\n```python\nmodel_adv1 = Ridge(\n    alpha=100.0\n)\n\nmodel_adv1.fit(X_adv_train, y_adv_train)\n```\n\n##### Model Evaluation\n\n```python\ny_adv_pred = model_adv1.predict(X_adv_test)\n\nmean_squared_error(y_adv_test, y_adv_pred)\n# 6.528575771818745\n```\n\n##### Adjusting Hyper Parameter\n\n```python\nmodel_adv2 = Ridge(\n    alpha=1.0\n)\n\nmodel_adv2.fit(X_adv_train, y_adv_train)\n```\n\n```python\ny_adv_pred2 = model_adv2.predict(X_adv_test)\nmean_squared_error(y_adv_test, y_adv_pred2)\n# 2.3319016551123535\n```\n\n#### Train | Validation | Test Split\n\n```python\n# 70:30 train/temp split\nX_adv_train, X_adv_temp, y_adv_train, y_adv_temp = train_test_split(\n    X_adv, y_adv, test_size=0.3, random_state=666\n)\n\n# 50:50 test/val split\nX_adv_test, X_adv_val, y_adv_test, y_adv_val = train_test_split(\n    X_adv_temp, y_adv_temp, test_size=0.5, random_state=666\n)\n\nprint(X_adv_train.shape, X_adv_test.shape, X_adv_val.shape)\n# (140, 3) (30, 3) (30, 3)\n```\n\n```python\n# normalize features\nscaler_adv = StandardScaler()\nscaler_adv.fit(X_adv_train)\n\nX_adv_train = scaler_adv.transform(X_adv_train)\nX_adv_test = scaler_adv.transform(X_adv_test)\nX_adv_val = scaler_adv.transform(X_adv_val)\n```\n\n##### Model Fitting and Evaluation\n\n```python\nmodel_adv3 = Ridge(\n    alpha=100.0\n)\n\nmodel_adv3.fit(X_adv_train, y_adv_train)\n```\n\n```python\n# do evaluation with the validation set\ny_adv_pred3 = model_adv3.predict(X_adv_val)\nmean_squared_error(y_adv_val, y_adv_pred3)\n# 7.136230975501291\n```\n\n##### Adjusting Hyper Parameter\n\n```python\nmodel_adv4 = Ridge(\n    alpha=1.0\n)\n\nmodel_adv4.fit(X_adv_train, y_adv_train)\n\ny_adv_pred4 = model_adv4.predict(X_adv_val)\nmean_squared_error(y_adv_val, y_adv_pred4)\n# 2.6393803874124435\n```\n\n```python\n# only once you are certain that you have the best performance\n# do a final evaluation with the test set\ny_adv4_final_pred = model_adv4.predict(X_adv_test)\nmean_squared_error(y_adv_test, y_adv4_final_pred)\n# 2.024422922812264\n```\n\n#### k-fold Cross Validation\n\nDo a train/test split and segment the training set by k-folds (e.g. 5-10) and use each of those segments once to validate a training step. The resulting error is the average of all k errors.\n\n##### Train-Test Split\n\n```python\n# 70:30 train/temp split\nX_adv_train, X_adv_test, y_adv_train, y_adv_test = train_test_split(\n    X_adv, y_adv, test_size=0.3, random_state=666\n)\n```\n\n```python\n# normalize features\nscaler_adv = StandardScaler()\nscaler_adv.fit(X_adv_train)\n\nX_adv_train = scaler_adv.transform(X_adv_train)\nX_adv_test = scaler_adv.transform(X_adv_test)\n```\n\n##### Model Scoring\n\n```python\nmodel_adv5 = Ridge(\n    alpha=100.0\n)\n```\n\n```python\n# do a 5-fold cross-eval\nscores = cross_val_score(\n    estimator=model_adv5,\n    X=X_adv_train,\n    y=y_adv_train,\n    scoring='neg_mean_squared_error',\n    cv=5\n)\n\n# take the mean of all five neg. error values\nabs(scores.mean())\n# 8.688107513529168\n```\n\n##### Adjusting Hyper Parameter\n\n```python\nmodel_adv6 = Ridge(\n    alpha=1.0\n)\n```\n\n```python\n# do a 5-fold cross-eval\nscores = cross_val_score(\n    estimator=model_adv6,\n    X=X_adv_train,\n    y=y_adv_train,\n    scoring='neg_mean_squared_error',\n    cv=5\n)\n\n# take the mean of all five neg. error values\nabs(scores.mean())\n# 3.3419582340688576\n```\n\n##### Model Fitting and Final Evaluation\n\n```python\nmodel_adv6.fit(X_adv_train, y_adv_train)\n\ny_adv6_final_pred = model_adv6.predict(X_adv_test)\nmean_squared_error(y_adv_test, y_adv6_final_pred)\n# 2.3319016551123535\n```\n\n#### Cross Validate\n\n\n##### Dataset (re-import)\n\n```python\nadv_df = pd.read_csv('datasets/Advertising.csv')\nX_adv = adv_df.drop('Sales', axis=1)\ny_adv = adv_df['Sales']\n```\n\n```python\n# 70:30 train/test split\nX_adv_train, X_adv_test, y_adv_train, y_adv_test = train_test_split(\n    X_adv, y_adv, test_size=0.3, random_state=666\n)\n```\n\n```python\n# normalize features\nscaler_adv = StandardScaler()\nscaler_adv.fit(X_adv_train)\n\nX_adv_train = scaler_adv.transform(X_adv_train)\nX_adv_test = scaler_adv.transform(X_adv_test)\n```\n\n##### Model Scoring\n\n```python\nmodel_adv7 = Ridge(\n    alpha=100.0\n)\n```\n\n```python\nscores = cross_validate(\n    model_adv7,\n    X_adv_train,\n    y_adv_train,\n    scoring=[\n        'neg_mean_squared_error',\n        'neg_mean_absolute_error'\n    ],\n    cv=10\n)\n```\n\n```python\nscores_df = pd.DataFrame(scores)\nscores_df\n```\n\n|   | fit_time | score_time | test_neg_mean_squared_error | test_neg_mean_absolute_error |\n| -- | -- | -- | -- | -- |\n| 0 | 0.016399 | 0.000749 | -12.539147 | -2.851864 |\n| 1 | 0.000684 | 0.000452 | -2.806466 | -1.423516 |\n| 2 | 0.000937 | 0.000782 | -11.142227 | -2.740332 |\n| 3 | 0.001060 | 0.000633 | -7.237347 | -2.196963 |\n| 4 | 0.001045 | 0.000738 | -11.313985 | -2.690813 |\n| 5 | 0.000650 | 0.000510 | -3.169169 | -1.526568 |\n| 6 | 0.000698 | 0.000429 | -6.578249 | -1.727616 |\n| 7 | 0.000600 | 0.000423 | -5.740245 | -1.640964 |\n| 8 | 0.000565 | 0.000463 | -10.268075 | -2.415688 |\n| 9 | 0.000562 | 0.000487 | -10.641669 | -1.974407 |\n\n```python\nabs(scores_df.mean())\n```\n\n| | |\n| -- | -- |\n| fit_time                    |    0.002320 |\n| score_time                  |    0.000566 |\n| test_neg_mean_squared_error   | 8.143658 |\n| test_neg_mean_absolute_error  | 2.118873 |\n_dtype: float64_\n\n\n##### Adjusting Hyper Parameter\n\n```python\nmodel_adv8 = Ridge(\n    alpha=1.0\n)\n```\n\n```python\nscores = cross_validate(\n    model_adv8,\n    X_adv_train,\n    y_adv_train,\n    scoring=[\n        'neg_mean_squared_error',\n        'neg_mean_absolute_error'\n    ],\n    cv=10\n)\n\nabs(pd.DataFrame(scores).mean())\n```\n\n| | |\n| -- | -- |\n| fit_time                    |    0.001141 |\n| score_time                  |    0.000777 |\n| test_neg_mean_squared_error   | 3.272673 |\n| test_neg_mean_absolute_error  | 1.345709 |\n_dtype: float64_\n\n\n##### Model Fitting and Final Evaluation\n\n```python\nmodel_adv8.fit(X_adv_train, y_adv_train)\n\ny_adv8_final_pred = model_adv8.predict(X_adv_test)\nmean_squared_error(y_adv_test, y_adv8_final_pred)\n# 2.3319016551123535\n```\n\n#### Grid Search\n\nLoop through a set of hyperparameters to find an optimum.\n\n\n##### Hyperparameter Search\n\n```python\nbase_elastic_net_model = ElasticNet()\n```\n\n```python\nparam_grid = {\n    'alpha': [0.1, 1, 5, 10, 50, 100],\n    'l1_ratio':[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]\n}\n```\n\n```python\ngrid_model = GridSearchCV(\n    estimator=base_elastic_net_model,\n    param_grid=param_grid,\n    scoring='neg_mean_squared_error',\n    cv=5, verbose=2\n)\n\ngrid_model.fit(X_adv_train, y_adv_train)\n\nprint(\n    'Results:\\nBest Estimator: ',\n    grid_model.best_estimator_,\n    '\\nBest Hyperparameter: ',\n    grid_model.best_params_\n)\n```\n\n__Results__:\n* Best Estimator:  `ElasticNet(alpha=0.1, l1_ratio=1.0)`\n* Best Hyperparameter:  `{'alpha': 0.1, 'l1_ratio': 1.0}`\n\n```python\ngridcv_results = pd.DataFrame(grid_model.cv_results_)\n```\n\n|  | mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_alpha | param_l1_ratio | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score |\n| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |\n| 0 | 0.001156 | 0.000160 | 0.000449 | 0.000038 | 0.1 | 0.1 | {'alpha': 0.1, 'l1_ratio': 0.1} | -1.924119 | -3.384152 | -3.588444 | -3.703040 | -5.091974 | -3.538346 | 1.007264 | 6 |\n| 1 | 0.001144 | 0.000181 | 0.000407 | 0.000091 | 0.1 | 0.3 | {'alpha': 0.1, 'l1_ratio': 0.3} | -1.867117 | -3.304382 | -3.561106 | -3.623188 | -5.061781 | -3.483515 | 1.016000 | 5 |\n| 2 | 0.000623 | 0.000026 | 0.000272 | 0.000052 | 0.1 | 0.5 | {'alpha': 0.1, 'l1_ratio': 0.5} | -1.812633 | -3.220727 | -3.539711 | -3.547572 | -5.043259 | -3.432780 | 1.028406 | 4 |\n| 3 | 0.000932 | 0.000165 | 0.000321 | 0.000060 | 0.1 | 0.7 | {'alpha': 0.1, 'l1_ratio': 0.7} | -1.750153 | -3.144120 | -3.525226 | -3.477228 | -5.034008 | -3.386147 | 1.046722 | 3 |\n| 4 | 0.000725 | 0.000106 | 0.000259 | 0.000024 | 0.1 | 0.9 | {'alpha': 0.1, 'l1_ratio': 0.9} | -1.693440 | -3.075686 | -3.518777 | -3.413393 | -5.029683 | -3.346196 | 1.065195 | 2 |\n| 5 | 0.000654 | 0.000053 | 0.000274 | 0.000026 | 0.1 | 1.0 | {'alpha': 0.1, 'l1_ratio': 1.0} | -1.667506 | -3.044928 | -3.518866 | -3.384363 | -5.031297 | -3.329392 | 1.075006 | 1 |\n| 6 | 0.000595 | 0.000016 | 0.000244 | 0.000002 | 1 | 0.1 | {'alpha': 1, 'l1_ratio': 0.1} | -8.575470 | -11.021534 | -8.212152 | -6.808719 | -10.792072 | -9.081990 | 1.604192 | 12 |\n| 7 | 0.000591 | 0.000018 | 0.000244 | 0.000002 | 1 | 0.3 | {'alpha': 1, 'l1_ratio': 0.3} | -8.131855 | -10.448423 | -7.774620 | -6.179358 | -10.071728 | -8.521197 | 1.569173 | 11 |\n| 8 | 0.000628 | 0.000049 | 0.000266 | 0.000023 | 1 | 0.5 | {'alpha': 1, 'l1_ratio': 0.5} | -7.519809 | -9.562473 | -7.261824 | -5.453399 | -9.213320 | -7.802165 | 1.481785 | 10 |\n| 9 | 0.000594 | 0.000015 | 0.000243 | 0.000002 | 1 | 0.7 | {'alpha': 1, 'l1_ratio': 0.7} | -6.614835 | -8.351711 | -6.702104 | -4.698977 | -8.230616 | -6.919649 | 1.329741 | 9 |\n| 10 | 0.000714 | 0.000108 | 0.000268 | 0.000033 | 1 | 0.9 | {'alpha': 1, 'l1_ratio': 0.9} | -5.537250 | -6.887828 | -6.148400 | -4.106124 | -7.101573 | -5.956235 | 1.078430 | 8 |\n| 11 | 0.000649 | 0.000067 | 0.000263 | 0.000028 | 1 | 1.0 | {'alpha': 1, 'l1_ratio': 1.0} | -4.932027 | -6.058207 | -5.892529 | -3.798441 | -6.472871 | -5.430815 | 0.959804 | 7 |\n| 12 | 0.000645 | 0.000042 | 0.000264 | 0.000040 | 5 | 0.1 | {'alpha': 5, 'l1_ratio': 0.1} | -21.863798 | -25.767488 | -18.768865 | -12.608680 | -23.207907 | -20.443347 | 4.520904 | 13 |\n| 13 | 0.000617 | 0.000030 | 0.000281 | 0.000038 | 5 | 0.3 | {'alpha': 5, 'l1_ratio': 0.3} | -23.626694 | -27.439028 | -20.266203 | -12.788078 | -24.609195 | -21.745840 | 5.031493 | 14 |\n| 14 | 0.000599 | 0.000011 | 0.000249 | 0.000013 | 5 | 0.5 | {'alpha': 5, 'l1_ratio': 0.5} | -26.202964 | -29.867138 | -22.527913 | -13.423857 | -26.835934 | -23.771561 | 5.675911 | 15 |\n| 15 | 0.000588 | 0.000013 | 0.000276 | 0.000035 | 5 | 0.7 | {'alpha': 5, 'l1_ratio': 0.7} | -27.768946 | -33.428462 | -23.506474 | -14.599984 | -29.112276 | -25.683228 | 6.382379 | 17 |\n| 16 | 0.000580 | 0.000003 | 0.000271 | 0.000001 | 5 | 0.9 | {'alpha': 5, 'l1_ratio': 0.9} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 17 | 0.000591 | 0.000011 | 0.000259 | 0.000021 | 5 | 1.0 | {'alpha': 5, 'l1_ratio': 1.0} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 18 | 0.000632 | 0.000028 | 0.000250 | 0.000012 | 10 | 0.1 | {'alpha': 10, 'l1_ratio': 0.1} | -26.179546 | -30.396420 | -22.386698 | -14.596498 | -27.292337 | -24.170300 | 5.429322 | 16 |\n| 19 | 0.000593 | 0.000020 | 0.000239 | 0.000001 | 10 | 0.3 | {'alpha': 10, 'l1_ratio': 0.3} | -28.704426 | -33.379967 | -24.561645 | -15.634153 | -29.883725 | -26.432783 | 6.090062 | 18 |\n| 20 | 0.000595 | 0.000036 | 0.000245 | 0.000013 | 10 | 0.5 | {'alpha': 10, 'l1_ratio': 0.5} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 21 | 0.000610 | 0.000053 | 0.000258 | 0.000015 | 10 | 0.7 | {'alpha': 10, 'l1_ratio': 0.7} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 22 | 0.000597 | 0.000022 | 0.000248 | 0.000015 | 10 | 0.9 | {'alpha': 10, 'l1_ratio': 0.9} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 23 | 0.000623 | 0.000057 | 0.000305 | 0.000076 | 10 | 1.0 | {'alpha': 10, 'l1_ratio': 1.0} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 24 | 0.000602 | 0.000016 | 0.000252 | 0.000013 | 50 | 0.1 | {'alpha': 50, 'l1_ratio': 0.1} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 25 | 0.000577 | 0.000009 | 0.000238 | 0.000001 | 50 | 0.3 | {'alpha': 50, 'l1_ratio': 0.3} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 26 | 0.000607 | 0.000046 | 0.000245 | 0.000010 | 50 | 0.5 | {'alpha': 50, 'l1_ratio': 0.5} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 27 | 0.000569 | 0.000004 | 0.000259 | 0.000012 | 50 | 0.7 | {'alpha': 50, 'l1_ratio': 0.7} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 28 | 0.000582 | 0.000022 | 0.000244 | 0.000011 | 50 | 0.9 | {'alpha': 50, 'l1_ratio': 0.9} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 29 | 0.000603 | 0.000041 | 0.000251 | 0.000015 | 50 | 1.0 | {'alpha': 50, 'l1_ratio': 1.0} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 30 | 0.000670 | 0.000106 | 0.000251 | 0.000013 | 100 | 0.1 | {'alpha': 100, 'l1_ratio': 0.1} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 31 | 0.000764 | 0.000179 | 0.000343 | 0.000054 | 100 | 0.3 | {'alpha': 100, 'l1_ratio': 0.3} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 32 | 0.000623 | 0.000077 | 0.000244 | 0.000007 | 100 | 0.5 | {'alpha': 100, 'l1_ratio': 0.5} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 33 | 0.000817 | 0.000156 | 0.000329 | 0.000076 | 100 | 0.7 | {'alpha': 100, 'l1_ratio': 0.7} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 34 | 0.000590 | 0.000017 | 0.000242 | 0.000004 | 100 | 0.9 | {'alpha': 100, 'l1_ratio': 0.9} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n| 35 | 0.000595 | 0.000027 | 0.000242 | 0.000007 | 100 | 1.0 | {'alpha': 100, 'l1_ratio': 1.0} | -29.868949 | -34.423737 | -25.623955 | -16.750237 | -31.056181 | -27.544612 | 6.087093 | 19 |\n\n```python\ngridcv_results[\n    [\n        'param_alpha',\n        'param_l1_ratio'\n    ]\n].plot(title='Grid Search Hyperparameter :: Parameter', figsize=(12,8))\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_08.webp)\n\n```python\ngridcv_results[\n    [\n        'mean_fit_time',\n        'std_fit_time',\n        'mean_score_time'\n    ]\n].plot(title='Grid Search Hyperparameter :: Timing', figsize=(12,8))\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_09.webp)\n\n```python\ngridcv_results[\n    [\n        'split0_test_score',\n        'split1_test_score',\n        'split2_test_score',\n        'split3_test_score',\n        'split4_test_score',\n        'mean_test_score',\n        'std_test_score',\n       'rank_test_score'\n    ]\n].plot(title='Grid Search Hyperparameter :: Parameter', figsize=(12,8))\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_10.webp)\n\n\n##### Model Evaluation\n\n```python\ny_grid_pred = grid_model.predict(X_adv_test)\n\nmean_squared_error(y_adv_test, y_grid_pred)\n# 2.380865536033581\n```\n\n## Supervised Learning - KNN Algorithm\n\n### Dataset\n\n```python\nwine = load_wine()\nprint(wine.data.shape)\nprint(wine.feature_names)\nprint(wine.data[:1])\n\n# (178, 13)\n# ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']\n# [[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00\n#   2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]]\n```\n\n```python\nwine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)\nwine_df.head(2).T\n```\n\n|  | 0 | 1 |\n| -- | -- | -- |\n| alcohol | 14.23 | 13.20 |\n| malic_acid | 1.71 | 1.78 |\n| ash | 2.43 | 2.14 |\n| alcalinity_of_ash | 15.60 | 11.20 |\n| magnesium | 127.00 | 100.00 |\n| total_phenols | 2.80 | 2.65 |\n| flavanoids | 3.06 | 2.76 |\n| nonflavanoid_phenols | 0.28 | 0.26 |\n| proanthocyanins | 2.29 | 1.28 |\n| color_intensity | 5.64 | 4.38 |\n| hue | 1.04 | 1.05 |\n| od280/od315_of_diluted_wines | 3.92 | 3.40 |\n| proline | 1065.00 | 1050.00 |\n\n\n### Data Pre-processing\n\n```python\n# normalization\nscaler = MinMaxScaler()\nscaler.fit(wine.data)\nwine_norm = scaler.fit_transform(wine.data)\n```\n\n```python\n# train/test split\nX_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(\n    wine_norm,\n    wine.target,\n    test_size=0.3\n)\n\nprint(X_train_wine.shape, X_test_wine.shape)\n# (124, 13) (54, 13)\n```\n\n### Model Fitting\n\n```python\n# model for k=3\nknn = KNeighborsClassifier(n_neighbors=3)\nknn.fit(X_train_wine, y_train_wine)\n\ny_pred_wine_knn3 = knn.predict(X_test_wine)\nprint('Accuracy Score: ', (accuracy_score(y_test_wine, y_pred_wine_knn3)*100).round(2), '%')\n# Accuracy Score:  98.15 %\n```\n\n```python\n# model for k=5\nknn = KNeighborsClassifier(n_neighbors=5)\nknn.fit(X_train_wine, y_train_wine)\n\ny_pred_wine_knn5 = knn.predict(X_test_wine)\nprint('Accuracy Score: ', (accuracy_score(y_test_wine, y_pred_wine_knn5)*100).round(2), '%')\n# Accuracy Score:  98.15 %\n```\n\n```python\n# model for k=7\nknn = KNeighborsClassifier(n_neighbors=7)\nknn.fit(X_train_wine, y_train_wine)\n\ny_pred_wine_knn7 = knn.predict(X_test_wine)\nprint('Accuracy Score: ', (accuracy_score(y_test_wine, y_pred_wine_knn7)*100).round(2), '%')\n# Accuracy Score:  96.3 %\n```\n\n```python\n# model for k=9\nknn = KNeighborsClassifier(n_neighbors=7)\nknn.fit(X_train_wine, y_train_wine)\n\ny_pred_wine_knn7 = knn.predict(X_test_wine)\nprint('Accuracy Score: ', (accuracy_score(y_test_wine, y_pred_wine_knn7)*100).round(2), '%')\n# Accuracy Score:  96.3 %\n```\n\n## Supervised Learning - Decision Tree Classifier\n\n* Does not require normalization\n* Is not sensitive to missing values\n\n### Dataset\n\n```python\n!wget https://gist.githubusercontent.com/Dviejopomata/ea5869ba4dcff84f8c294dc7402cd4a9/raw/4671f90b8b04ba4db9d67acafaa4c0827cd233c2/bill_authentication.csv -P datasets\n```\n\n```python\nbill_auth_df = pd.read_csv('datasets/bill_authentication.csv')\nbill_auth_df.head(3)\n```\n\n|   | Variance | Skewness | Curtosis | Entropy | Class |\n|  -- | -- | -- | -- | -- | -- |\n| 0 | 3.6216 | 8.6661 | -2.8073 | -0.44699 | 0 |\n| 1 | 4.5459 | 8.1674 | -2.4586 | -1.46210 | 0 |\n| 2 | 3.8660 | -2.6383 | 1.9242 | 0.10645 | 0 |\n\n\n### Preprocessing\n\n```python\n# remove target feature from training set\nX_bill = bill_auth_df.drop('Class', axis=1)\ny_bill = bill_auth_df['Class']\n```\n\n```python\nX_train_bill, X_test_bill, y_train_bill, y_test_bill = train_test_split(X_bill, y_bill, test_size=0.2)\n```\n\n### Model Fitting\n\n```python\ntree_classifier = DecisionTreeClassifier()\n\ntree_classifier.fit(X_train_bill, y_train_bill)\n```\n\n### Evaluation\n\n```python\ny_pred_bill = tree_classifier.predict(X_test_bill)\n```\n\n```python\nconf_mtx_bill = confusion_matrix(y_test_bill, y_pred_bill)\nconf_mtx_bill\n\n# array([[150,   2],\n#        [  4, 119]])\n```\n\n```python\nconf_mtx_bill_plot = ConfusionMatrixDisplay(\n    confusion_matrix=conf_mtx_bill,\n    display_labels=[False,True]\n)\n\nconf_mtx_bill_plot.plot()\nplt.show()\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_05.webp)\n\n```python\nreport_bill = classification_report(\n    y_test_bill, y_pred_bill\n)\nprint(report_bill)\n```\n\n|              | precision | recall | f1-score | support |\n|    --        |  --  |  --  |  --  |  --  |\n| 0            | 0.97 | 0.99 | 0.98 |   152  |\n| 1            | 0.98 | 0.97 | 0.98 |   123  |\n| accuracy     |      |      | 0.98 |  275  |\n| macro avg    | 0.98 | 0.98 | 0.98 |  275  |\n| weighted avg | 0.98 | 0.98 | 0.98 |  275  |\n\n\n## Supervised Learning - Random Forest Classifier\n\n* Does not require normalization\n* Is not sensitive to missing values\n* Low risk of overfitting\n* Efficient with large datasets\n* High accuracy\n\n### Dataset\n\n```python\n!wget https://raw.githubusercontent.com/xjcjiacheng/data-analysis/master/heart%20disease%20UCI/heart.csv -P datasets\n```\n\n```python\nheart_df = pd.read_csv('datasets/heart.csv')\nheart_df.head(5)\n```\n\n|   | age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |\n| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |\n| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |\n| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |\n| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |\n| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |\n| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |\n\n\n### Preprocessing\n\n```python\n# remove target feature from training set\nX_heart = heart_df.drop('target', axis=1)\ny_heart = heart_df['target']\n```\n\n```python\nX_train_heart, X_test_heart, y_train_heart, y_test_heart = train_test_split(\n    X_heart,\n    y_heart,\n    test_size=0.2,\n    random_state=0\n)\n```\n\n### Model Fitting\n\n```python\nforest_classifier = RandomForestClassifier(n_estimators=10, criterion='entropy')\n\nforest_classifier.fit(X_train_heart, y_train_heart)\n```\n\n### Evaluation\n\n```python\ny_pred_heart = forest_classifier.predict(X_test_heart)\n```\n\n```python\nconf_mtx_heart = confusion_matrix(y_test_heart, y_pred_heart)\nconf_mtx_heart\n\n# array([[24,  3],\n#        [ 5, 29]])\n```\n\n```python\nconf_mtx_heart_plot = ConfusionMatrixDisplay(\n    confusion_matrix=conf_mtx_heart,\n    display_labels=[False,True]\n)\n\nconf_mtx_heart_plot.plot()\nplt.show()\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_06.webp)\n\n```python\nreport_heart = classification_report(\n    y_test_heart, y_pred_heart\n)\nprint(report_heart)\n```\n\n|              | precision | recall | f1-score | support |\n|    --        |  --  |  --  |  --  |  --  |\n| 0            | 0.83 | 0.89 | 0.86 |   27  |\n| 1            | 0.91 | 0.85 | 0.88 |   34  |\n| accuracy     |      |      | 0.87 |  61  |\n| macro avg    | 0.87 | 0.87 | 0.87 |  61  |\n| weighted avg | 0.87 | 0.87 | 0.87 |  61  |\n\n\n### Random Forest Hyperparameter Tuning\n\n\n#### Testing Hyperparameters\n\n```python\nrdnfor_classifier = RandomForestClassifier(\n    n_estimators=2,\n    min_samples_split=2,\n    min_samples_leaf=1,\n    criterion='entropy'\n)\nrdnfor_classifier.fit(X_train_heart, y_train_heart)\n```\n\n```python\nrdnfor_pred = rdnfor_classifier.predict(X_test_heart)\nprint('Accuracy Score: ', accuracy_score(y_test_heart, rdnfor_pred).round(4)*100, '%')\n\n# Accuracy Score:  73.77 %\n```\n\n#### Grid-Search Cross-Validation\n\nTry a set of values for selected Hyperparameter to find the optimal configuration.\n\n```python\nparam_grid = {\n    'n_estimators': [5, 25, 50, 75,100, 125],\n    'min_samples_split': [1,2,3],\n    'min_samples_leaf': [1,2,3],\n    'criterion': ['gini', 'entropy', 'log_loss'],\n    'max_features' : ['sqrt', 'log2']\n}\n\ngrid_search = GridSearchCV(\n    estimator = rdnfor_classifier,\n    param_grid = param_grid\n)\n\ngrid_search.fit(X_train_heart, y_train_heart)\n```\n\n```python\nprint('Best Parameter: ', grid_search.best_params_)\n# Best Parameter:  {\n# 'criterion': 'entropy',\n# 'max_features': 'sqrt',\n# 'min_samples_leaf': 2,\n# 'min_samples_split': 1,\n# 'n_estimators': 25\n# }\n```\n\n```python\nrdnfor_classifier_optimized = RandomForestClassifier(\n    n_estimators=25,\n    min_samples_split=1,\n    min_samples_leaf=2,\n    criterion='entropy',\n    max_features='sqrt'\n)\n\nrdnfor_classifier_optimized.fit(X_train_heart, y_train_heart)\n```\n\n```python\nrdnfor_pred_optimized = rdnfor_classifier_optimized.predict(X_test_heart)\nprint('Accuracy Score: ', accuracy_score(y_test_heart, rdnfor_pred_optimized).round(4)*100, '%')\n\n# Accuracy Score:  85.25 %\n```\n\n### Random Forest Classifier 1 - Penguins\n\n```python\n!wget https://github.com/remijul/dataset/raw/master/penguins_size.csv -P datasets\n```\n\n```python\npeng_df = pd.read_csv('datasets/penguins_size.csv')\npeng_df = peng_df.dropna()\npeng_df.head(5)\n```\n\n|   | species | island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex |\n| -- | -- | -- | -- | -- | -- | -- | -- |\n| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |\n| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |\n| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |\n| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |\n| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | MALE |\n\n```python\n# drop labels and encode string values\nX_peng = pd.get_dummies(peng_df.drop('species', axis=1),drop_first=True)\ny_peng = peng_df['species']\n```\n\n```python\n# train/test split\nX_peng_train, X_peng_test, y_peng_train, y_peng_test = train_test_split(\n    X_peng,\n    y_peng,\n    test_size=0.3,\n    random_state=42\n)\n```\n\n```python\n# creating the model\nrfc_peng = RandomForestClassifier(\n    n_estimators=10,\n    max_features='sqrt',\n    random_state=42\n)\n```\n\n```python\n# model training and running predictions\nrfc_peng.fit(X_peng_train, y_peng_train)\npeng_pred = rfc_peng.predict(X_peng_test)\nprint('Accuracy Score: ',accuracy_score(y_peng_test, peng_pred, normalize=True).round(4)*100, '%')\n# Accuracy Score:  98.02 %\n```\n\n#### Feature Importance\n\n```python\n# feature importance for classification\npeng_index = ['importance']\npeng_data_columns = pd.Series(X_peng.columns)\npeng_importance_array = rfc_peng.feature_importances_\npeng_importance_df = pd.DataFrame(peng_importance_array, peng_data_columns, peng_index)\npeng_importance_df\n```\n\n|  | importance |\n| -- | -- |\n| culmen_length_mm | 0.288928 |\n| culmen_depth_mm | 0.111021 |\n| flipper_length_mm | 0.357994 |\n| body_mass_g | 0.025477 |\n| island_Dream | 0.178498 |\n| island_Torgersen | 0.031042 |\n| sex_FEMALE | 0.004716 |\n| sex_MALE | 0.002324 |\n\n```python\npeng_importance_df.sort_values(\n    by='importance',\n    ascending=False\n).plot(\n    kind='barh',\n    title='Feature Importance for Species Classification',\n    figsize=(12,4)\n)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_28.webp)\n\n\n#### Model Evaluation\n\n```python\nreport_peng = classification_report(y_peng_test, peng_pred)\nprint(report_peng)\n```\n\n|  | precision | recall | f1-score | support |\n| -- | -- | -- | -- | -- |\n| Adelie | 0.98 | 0.98 | 0.98 | 49 |\n| Chinstrap | 0.94 | 0.94 | 0.94 | 18 |\n| Gentoo | 1.00 | 1.00 | 1.00 |  34 |\n|     accuracy |      |      | 0.98 | 101 |\n|    macro avg | 0.97 | 0.97 | 0.97 | 101 |\n| weighted avg | 0.98 | 0.98 | 0.98 | 101 |\n\n```python\nconf_mtx_peng = confusion_matrix(y_peng_test, peng_pred)\n\nconf_mtx_peng_plot = ConfusionMatrixDisplay(\n    confusion_matrix=conf_mtx_peng\n)\n\nconf_mtx_peng_plot.plot(cmap='plasma')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_27.webp)\n\n\n### Random Forest Classifier - Banknote Authentication\n\n```python\n!wget https://github.com/jbrownlee/Datasets/raw/master/banknote_authentication.csv -P datasets\n```\n\n```python\nmoney_df = pd.read_csv('datasets/data-banknote-authentication.csv')\nmoney_df.head(5)\n```\n\n|  | Variance_Wavelet | Skewness_Wavelet | Curtosis_Wavelet | Image_Entropy | Class |\n| -- | -- | -- | -- | -- | -- |\n| 0 | 3.62160 | 8.6661 | -2.8073 | -0.44699 | 0 |\n| 1 | 4.54590 | 8.1674 | -2.4586 | -1.46210 | 0 |\n| 2 | 3.86600 | -2.6383 | 1.9242 | 0.10645 | 0 |\n| 3 | 3.45660 | 9.5228 | -4.0112 | -3.59440 | 0 |\n| 4 | 0.32924 | -4.4552 | 4.5718 | -0.98880 | 0 |\n\n```python\nsns.pairplot(money_df, hue='Class', palette='winter')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_29.webp)\n\n```python\n# drop label for training\nX_money = money_df.drop('Class', axis=1)\ny_money = money_df['Class']\nprint(X_money.shape, y_money.shape)\n```\n\n```python\nX_money_train, X_money_test, y_money_train, y_money_test = train_test_split(\n    X_money,\n    y_money,\n    test_size=0.15,\n    random_state=42\n)\n```\n\n#### Grid Search for Hyperparameters\n\n```python\nrfc_money_base = RandomForestClassifier(oob_score=True)\n```\n\n```python\nparam_grid = {\n    'n_estimators': [64, 96, 128, 160, 192],\n    'max_features': [2,3,4],\n    'bootstrap': [True, False]\n}\n```\n\n```python\ngrid_money = GridSearchCV(rfc_money_base, param_grid) \ngrid_money.fit(X_money_train, y_money_train)\ngrid_money.best_params_\n# {'bootstrap': True, 'max_features': 2, 'n_estimators': 96}\n```\n\n#### Model Training and Evaluation\n\n```python\nrfc_money = RandomForestClassifier(\n    bootstrap=True,\n    max_features=2,\n    n_estimators=96,\n    oob_score=True\n)\nrfc_money.fit(X_money_train, y_money_train)\nprint('Out-of-Bag Score: ', rfc_money.oob_score_.round(4)*100, '%')\n# Out-of-Bag Score:  99.14 %\n```\n\n```python\nmoney_pred = rfc_money.predict(X_money_test)\nmoney_report = classification_report(y_money_test, money_pred)\nprint(money_report)\n```\n\n|  | precision | recall | f1-score | support |\n| -- | -- | -- | -- | -- |\n|     0 | 0.99 | 1.00 | 1.00 | 111 |\n|     1 | 1.00 | 0.99 | 0.99 |  95 |\n|    accuracy |  |  | 1.00 | 206 |\n|   macro avg | 1.00 | 0.99 | 1.00 | 206 |\n|weighted avg | 1.00 | 1.00 | 1.00 | 206 |\n\n```python\nconf_mtx_money = confusion_matrix(y_money_test, money_pred)\n\nconf_mtx_money_plot = ConfusionMatrixDisplay(\n    confusion_matrix=conf_mtx_money\n)\n\nconf_mtx_money_plot.plot(cmap='plasma')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_30.webp)\n\n\n#### Optimizations\n\n```python\n# verify number of estimators found by grid search\nerrors = []\nmissclassifications = []\n\nfor n in range(1,200):\n    rfc = RandomForestClassifier(n_estimators=n, max_features=2)\n    rfc.fit(X_money_train, y_money_train)\n    preds = rfc.predict(X_money_test)\n    \n    err = 1 - accuracy_score(y_money_test, preds)\n    errors.append(err)\n    \n    n_missed = np.sum(preds != y_money_test)\n    missclassifications.append(n_missed)\n```\n\n```python\nplt.figure(figsize=(12,4))\nplt.title('Errors as a Function of n_estimators')\nplt.xlabel('Estimators')\nplt.ylabel('Error Score')\nplt.plot(range(1,200), errors)\n# there is no noteable improvement above ~10 estimators\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_31.webp)\n\n```python\nplt.figure(figsize=(12,4))\nplt.title('Misclassifications as a Function of n_estimators')\nplt.xlabel('Estimators')\nplt.ylabel('Misclassifications')\nplt.plot(range(1,200), missclassifications)\n# and the same for misclassifications\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_32.webp)\n\n\n### Random Forest Regressor\n\nComparing different regression models to a random forrest regression model.\n\n```python\n# dataset\n!wget https://github.com/vineetsingh028/Rock_Density_Prediction/raw/master/rock_density_xray.csv -P datasets\n```\n\n```python\nrock_df = pd.read_csv('datasets/rock_density_xray.csv')\nrock_df.columns = ['Signal', 'Density']\nrock_df.head(5)\n```\n\n|   | Signal | Density |\n| -- | -- | -- |\n| 0 | 72.945124 | 2.456548 |\n| 1 | 14.229877 | 2.601719 |\n| 2 | 36.597334 | 1.967004 |\n| 3 | 9.578899 | 2.300439 |\n| 4 | 21.765897 | 2.452374 |\n\n```python\nplt.figure(figsize=(12,5))\nplt.title('X-Ray Bounce Signal Strength vs Rock Density')\nsns.scatterplot(data=rock_df, x='Signal', y='Density')\n# the signal vs density plot follows a sine wave - spoiler alert: simpler algorithm\n# will fail trying to fit this dataset...\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_33.webp)\n\n```python\n# train-test split\nX_rock = rock_df['Signal'].values.reshape(-1,1)\ny_rock = rock_df['Density']\n\nX_rock_train, X_rock_test, y_rock_train, y_rock_test = train_test_split(\n    X_rock,\n    y_rock,\n    test_size=0.1,\n    random_state=42\n)\n```\n\n```python\n# normalization\nscaler = StandardScaler()\nX_rock_train_scaled = scaler.fit_transform(X_rock_train)\nX_rock_test_scaled = scaler.transform(X_rock_test)\n```\n\n#### vs Linear Regression\n\n```python\nlr_rock = LinearRegression()\nlr_rock.fit(X_rock_train_scaled, y_rock_train)\n```\n\n```python\nlr_rock_preds = lr_rock.predict(X_rock_test_scaled)\n\nmae = mean_absolute_error(y_rock_test, lr_rock_preds)\nrmse = np.sqrt(mean_squared_error(y_rock_test, lr_rock_preds))\nmean_abs = y_rock_test.mean()\navg_error = mae * 100 / mean_abs\n\nprint('MAE: ', mae.round(2), 'RMSE: ', rmse.round(2), 'Relative Avg. Error: ', avg_error.round(2), '%')\n# MAE:  0.24 RMSE:  0.3 Relative Avg. Error:  10.93 %\n```\n\n```python\n# visualize predictions\nplt.figure(figsize=(12,5))\nplt.plot(X_rock_test, lr_rock_preds, c='mediumspringgreen')\nsns.scatterplot(data=rock_df, x='Signal', y='Density', c='dodgerblue')\nplt.title('Linear Regression Predictions')\nplt.show()\n# the returned error appears small because the linear regression returns an average\n# but it cannot fit a linear line to the contours of the underlying sine wave function\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_34.webp)\n\n\n#### vs Polynomial Regression\n\n```python\n# helper function\ndef run_model(model, X_train, y_train, X_test, y_test, df):\n    \n    # FIT MODEL\n    model.fit(X_train, y_train)\n    \n    # EVALUATE\n    y_preds = model.predict(X_test)\n    mae = mean_absolute_error(y_test, y_preds)\n    rmse = np.sqrt(mean_squared_error(y_test, y_preds))\n    mean_abs = y_test.mean()\n    avg_error = mae * 100 / mean_abs\n    print('MAE: ', mae.round(2), 'RMSE: ', rmse.round(2), 'Relative Avg. Error: ', avg_error.round(2), '%')\n    \n    # PLOT RESULTS\n    signal_range = np.arange(0,100)\n    output = model.predict(signal_range.reshape(-1,1))\n    \n    \n    plt.figure(figsize=(12,5))\n    sns.scatterplot(data=df, x='Signal', y='Density', c='dodgerblue')\n    plt.plot(signal_range,output, c='mediumspringgreen')\n    plt.title('Regression Predictions')\n    plt.show()\n```\n\n```python\n# test helper on previous linear regression\nrun_model(\n    model=lr_rock,\n    X_train=X_rock_train,\n    y_train=y_rock_train,\n    X_test=X_rock_test,\n    y_test=y_rock_test,\n    df=rock_df\n)\n```\n\n\u003e MAE:  0.24 RMSE:  0.3 Relative Avg. Error:  10.93 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_35.webp)\n\n```python\n# build polynomial model\npipe_poly = make_pipeline(\n    PolynomialFeatures(degree=6),\n    LinearRegression()\n)\n```\n\n```python\n# run model\nrun_model(\n    model=pipe_poly,\n    X_train=X_rock_train,\n    y_train=y_rock_train,\n    X_test=X_rock_test,\n    y_test=y_rock_test,\n    df=rock_df\n)\n# with a HARD LIMIT of 0-100 for the xray signal a 6th degree polinomial is a good fit\n```\n\n\u003e MAE:  0.13 RMSE:  0.14 Relative Avg. Error:  5.7 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_36.webp)\n\n\n#### vs KNeighbors Regression\n\n```python\n# build polynomial model\nk_values=[1,5,10,25]\n\nfor k in k_values:\n    model = KNeighborsRegressor(n_neighbors=k)\n    print(model)\n    \n    # run model\n    run_model(\n        model,\n        X_train=X_rock_train,\n        y_train=y_rock_train,\n        X_test=X_rock_test,\n        y_test=y_rock_test,\n        df=rock_df\n    )\n```\n\u003e KNeighborsRegressor(n_neighbors=1)\n\u003e\n\u003e MAE:  0.12 RMSE:  0.17 Relative Avg. Error:  5.47 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_37.webp)\n\n\u003e KNeighborsRegressor()\n\u003e\n\u003e MAE:  0.13 RMSE:  0.15 Relative Avg. Error:  5.9 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_38.webp)\n\n\u003e KNeighborsRegressor(n_neighbors=10)\n\u003e\n\u003e MAE:  0.12 RMSE:  0.14 Relative Avg. Error:  5.44 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_39.webp)\n\n\u003e KNeighborsRegressor(n_neighbors=25)\n\u003e\n\u003e MAE:  0.14 RMSE:  0.16 Relative Avg. Error:  6.18 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_40.webp)\n\n\n#### vs Decision Tree Regression\n\n```python\ntree_model = DecisionTreeRegressor()\n\n# run model\nrun_model(\n    model=tree_model,\n    X_train=X_rock_train,\n    y_train=y_rock_train,\n    X_test=X_rock_test,\n    y_test=y_rock_test,\n    df=rock_df\n)\n```\n\n\u003e MAE:  0.12 RMSE:  0.17 Relative Avg. Error:  5.47 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_41.webp)\n\n\n#### vs Support Vector Regression\n\n```python\nsvr_rock = svm.SVR()\n\nparam_grid = {\n    'C': [0.01,0.1,1,5,10,100, 1000],\n    'gamma': ['auto', 'scale']\n}\n\nrock_grid = GridSearchCV(svr_rock, param_grid)\n```\n\n```python\n# run model\nrun_model(\n    model=rock_grid,\n    X_train=X_rock_train,\n    y_train=y_rock_train,\n    X_test=X_rock_test,\n    y_test=y_rock_test,\n    df=rock_df\n)\n```\n\n\u003e MAE:  0.13 RMSE:  0.14 Relative Avg. Error:  5.75 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_42.webp)\n\n\n#### vs Gradient Boosting Regression\n\n```python\ngbr_rock = GradientBoostingRegressor()\n\n# run model\nrun_model(\n    model=gbr_rock,\n    X_train=X_rock_train,\n    y_train=y_rock_train,\n    X_test=X_rock_test,\n    y_test=y_rock_test,\n    df=rock_df\n)\n```\n\n\u003e MAE:  0.13 RMSE:  0.15 Relative Avg. Error:  5.76 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_44.webp)\n\n\n#### vs Ada Boosting Regression\n\n```python\nabr_rock = AdaBoostRegressor()\n\n# run model\nrun_model(\n    model=abr_rock,\n    X_train=X_rock_train,\n    y_train=y_rock_train,\n    X_test=X_rock_test,\n    y_test=y_rock_test,\n    df=rock_df\n)\n```\n\n\u003e MAE:  0.13 RMSE:  0.14 Relative Avg. Error:  5.67 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_45.webp)\n\n\n#### Finally, Random Forrest Regression\n\n```python\nrfr_rock = RandomForestRegressor(n_estimators=10)\n\n# run model\nrun_model(\n    model=rfr_rock,\n    X_train=X_rock_train,\n    y_train=y_rock_train,\n    X_test=X_rock_test,\n    y_test=y_rock_test,\n    df=rock_df\n)\n```\n\n\u003e MAE:  0.11 RMSE:  0.14 Relative Avg. Error:  5.1 %\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_43.webp)\n\n\u003c!-- #region --\u003e\n## Supervised Learning - SVC Model\n\n__Support Vector Machines__ (`SVM`s) are a set of supervised learning methods used for classification, regression and outliers detection.\n\n\n* Effective in high dimensional spaces.\n* Still effective in cases where number of dimensions is greater than the number of samples.\n\u003c!-- #endregion --\u003e\n\n### Dataset\n\n* [Three different varieties of the wheat - Kaggle.com](https://www.kaggle.com/datasets/dongeorge/seed-from-uci)\n\nMeasurements of geometrical properties of kernels belonging to three different varieties of wheat:\n\n* __A__: Area,\n* __P__: Perimeter,\n* __C__ = 4piA/P^2: Compactness,\n* __LK__: Length of kernel,\n* __WK__: Width of kernel,\n* __A\\_Coef__: Asymmetry coefficient\n* __LKG__: Length of kernel groove.\n\n```python\n!wget https://raw.githubusercontent.com/prasertcbs/basic-dataset/master/Seed_Data.csv -P datasets\n```\n\n```python\nwheat_df = pd.read_csv('datasets/Seed_Data.csv')\nwheat_df.head(5)\n```\n\n|   | A | P | C | LK | WK | A_Coef | LKG | target |\n| -- | -- | -- | -- | -- | -- | -- | -- | -- |\n| 0 | 15.26 | 14.84 | 0.8710 | 5.763 | 3.312 | 2.221 | 5.220 | 0 |\n| 1 | 14.88 | 14.57 | 0.8811 | 5.554 | 3.333 | 1.018 | 4.956 | 0 |\n| 2 | 14.29 | 14.09 | 0.9050 | 5.291 | 3.337 | 2.699 | 4.825 | 0 |\n| 3 | 13.84 | 13.94 | 0.8955 | 5.324 | 3.379 | 2.259 | 4.805 | 0 |\n| 4 | 16.14 | 14.99 | 0.9034 | 5.658 | 3.562 | 1.355 | 5.175 | 0 |\n\n```python\nwheat_df.info()\n\n# \u003cclass 'pandas.core.frame.DataFrame'\u003e\n# RangeIndex: 210 entries, 0 to 209\n# Data columns (total 8 columns):\n#  #   Column  Non-Null Count  Dtype  \n# ---  ------  --------------  -----  \n#  0   A       210 non-null    float64\n#  1   P       210 non-null    float64\n#  2   C       210 non-null    float64\n#  3   LK      210 non-null    float64\n#  4   WK      210 non-null    float64\n#  5   A_Coef  210 non-null    float64\n#  6   LKG     210 non-null    float64\n#  7   target  210 non-null    int64  \n# dtypes: float64(7), int64(1)\n# memory usage: 13.2 KB\n```\n\n#### Preprocessing\n\n```python\n# remove target feature from training set\nX_wheat = wheat_df.drop('target', axis=1)\ny_wheat = wheat_df['target']\n\nprint(X_wheat.shape, y_wheat.shape)\n# (210, 7) (210,)\n```\n\n```python\n# train/test split\nX_train_wheat, X_test_wheat, y_train_wheat, y_test_wheat = train_test_split(\n    X_wheat,\n    y_wheat,\n    test_size=0.2,\n    random_state=42\n)\n```\n\n```python\n# normalization\nsc_wheat = StandardScaler()\nX_train_wheat=sc_wheat.fit_transform(X_train_wheat)\nX_test_wheat=sc_wheat.fit_transform(X_test_wheat)\n```\n\n#### Model Training\n\n```python\n# SVM classifier fitting\nclf_wheat = svm.SVC()\nclf_wheat.fit(X_train_wheat, y_train_wheat)\n```\n\n#### Model Evaluation\n\n```python\n# Predictions\ny_wheat_pred = clf_wheat.predict(X_test_wheat)\n```\n\n```python\nprint(\n    'Accuracy Score: ',\n    accuracy_score(y_test_wheat, y_wheat_pred, normalize=True).round(4)*100, '%'\n)\n# Accuracy Score:  90.48 %\n```\n\n```python\nreport_wheat = classification_report(\n    y_test_wheat, y_wheat_pred\n)\nprint(report_wheat)\n```\n\n|               | precision | recall | f1-score | support | \n| -- | -- | -- | -- | -- |\n|           0  | 0.82 | 0.82 | 0.82 | 11 |\n|           1  | 1.00 | 0.93 | 0.96 | 14 |\n|            2 | 0.89 | 0.94 | 0.91 | 17 |\n|   accuracy   |      |      | 0.90 | 42 |\n|    macro avg | 0.90 | 0.90 | 0.90 | 42 |\n| weighted avg | 0.91 | 0.90 | 0.91 | 42 |\n\n```python\nconf_mtx_wheat = confusion_matrix(y_test_wheat, y_wheat_pred)\nconf_mtx_wheat\n\n# array([[ 9,  0,  2],\n#        [ 1, 13,  0],\n#        [ 1,  0, 16]])\n```\n\n```python\nconf_mtx_wheat_plot = ConfusionMatrixDisplay(\n    confusion_matrix=conf_mtx_wheat\n)\n\nconf_mtx_wheat_plot.plot()\nplt.show()\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_07.webp)\n\n\n### Margin Plots for Support Vector Classifier\n\n```python\n# get dataset\n!wget https://github.com/alpeshraj/mouse_viral_study/raw/main/mouse_viral_study.csv -P datasets\n```\n\n```python\nmice_df = pd.read_csv('datasets/mouse_viral_study.csv')\nmice_df.head(5)\n```\n\n|   | Med_1_mL | Med_2_mL | Virus Present |\n| -- | -- | -- | -- |\n| 0 | 6.508231 | 8.582531 | 0 |\n| 1 | 4.126116 | 3.073459 | 1 |\n| 2 | 6.427870 | 6.369758 | 0 |\n| 3 | 3.672953 | 4.905215 | 1 |\n| 4 | 1.580321 | 2.440562 | 1 |\n\n```python\nsns.scatterplot(data=mice_df, x='Med_1_mL',y='Med_2_mL',hue='Virus Present', palette='winter')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_12.webp)\n\n```python\n# visualizing a hyperplane to separate the two features\nsns.scatterplot(data=mice_df, x='Med_1_mL',y='Med_2_mL',hue='Virus Present', palette='winter')\n\nx = np.linspace(0,10,100)\nm = -1\nb = 11\ny = m*x + b\n\nplt.plot(x,y,c='fuchsia')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_13.webp)\n\n\n#### SVC with a Linear Kernel\n\n```python\n# using a support vector classifier to calculate maximize the margin between both classes\n\ny_vir = mice_df['Virus Present']\nX_vir = mice_df.drop('Virus Present',axis=1)\n\n# kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}\n# the smaller the C value the more feature vectors will be inside the margin\nmodel_vir = svm.SVC(kernel='linear', C=1000)\n\nmodel_vir.fit(X_vir, y_vir)\n```\n\n```python\n# import helper function\nfrom helper.svm_margin_plot import plot_svm_boundary\n```\n\n```python\nplot_svm_boundary(model_vir, X_vir, y_vir)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_14.webp)\n\n```python\n# the smaller the C value the more feature vectors will be inside the margin\nmodel_vir_low_reg = svm.SVC(kernel='linear', C=0.005)\nmodel_vir_low_reg.fit(X_vir, y_vir)\nplot_svm_boundary(model_vir_low_reg, X_vir, y_vir)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_15.webp)\n\n\n#### SVC with a Radial Basis Function Kernel\n\n```python\nmodel_vir_rbf = svm.SVC(kernel='rbf', C=1)\nmodel_vir_rbf.fit(X_vir, y_vir)\nplot_svm_boundary(model_vir_rbf, X_vir, y_vir)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_16.webp)\n\n```python\n# # gamma : {'scale', 'auto'} or float, default='scale'\n# - if ``gamma='scale'`` (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,\n# - if 'auto', uses 1 / n_features\n# - if float, must be non-negative.\nmodel_vir_rbf_auto_gamma = svm.SVC(kernel='rbf', C=1, gamma='auto')\nmodel_vir_rbf_auto_gamma.fit(X_vir, y_vir)\nplot_svm_boundary(model_vir_rbf_auto_gamma, X_vir, y_vir)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_17.webp)\n\n\n#### SVC with a Sigmoid Kernel\n\n```python\nmodel_vir_sigmoid = svm.SVC(kernel='sigmoid', gamma='scale')\nmodel_vir_sigmoid.fit(X_vir, y_vir)\nplot_svm_boundary(model_vir_sigmoid, X_vir, y_vir)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_18.webp)\n\n\n#### SVC with a Polynomial Kernel\n\n```python\nmodel_vir_poly = svm.SVC(kernel='poly', C=1, degree=2)\nmodel_vir_poly.fit(X_vir, y_vir)\nplot_svm_boundary(model_vir_poly, X_vir, y_vir)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_19.webp)\n\n\n### Grid Search for Support Vector Classifier\n\n```python\nsvm_base_model = svm.SVC()\n\nparam_grid = {\n    'C':[0.01, 0.1, 1],\n    'kernel': ['linear', 'rbf']\n}\n```\n\n```python\ngrid = GridSearchCV(svm_base_model, param_grid) \ngrid.fit(X_vir, y_vir)\n```\n\n```python\ngrid.best_params_\n# {'C': 0.01, 'kernel': 'linear'}\n```\n\n### Support Vector Regression\n\n```python\n# dataset\n!wget https://github.com/fsdhakan/ML/raw/main/cement_slump.csv -P datasets\n```\n\n```python\ncement_df = pd.read_csv('datasets/cement_slump.csv')\ncement_df.head(5)\n```\n\n|    | Cement | Slag | Fly ash | Water | SP | Coarse Aggr. | Fine Aggr. | SLUMP(cm) | FLOW(cm) | Compressive Strength (28-day)(Mpa) |\n| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |\n| 0 | 273.0 | 82.0 | 105.0 | 210.0 | 9.0 | 904.0 | 680.0 | 23.0 | 62.0 | 34.99 |\n| 1 | 163.0 | 149.0 | 191.0 | 180.0 | 12.0 | 843.0 | 746.0 | 0.0 | 20.0 | 41.14 |\n| 2 | 162.0 | 148.0 | 191.0 | 179.0 | 16.0 | 840.0 | 743.0 | 1.0 | 20.0 | 41.81 |\n| 3 | 162.0 | 148.0 | 190.0 | 179.0 | 19.0 | 838.0 | 741.0 | 3.0 | 21.5 | 42.08 |\n| 4 | 154.0 | 112.0 | 144.0 | 220.0 | 10.0 | 923.0 | 658.0 | 20.0 | 64.0 | 26.82 |\n\n```python\nplt.figure(figsize=(8,8))\nsns.heatmap(cement_df.corr(), annot=True, cmap='viridis')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_20.webp)\n\n```python\n# drop labels\nX_cement = cement_df.drop('Compressive Strength (28-day)(Mpa)', axis=1)\ny_cement = cement_df['Compressive Strength (28-day)(Mpa)']\n```\n\n```python\n # train/test split\n    X_train_cement, X_test_cement, y_train_cement, y_test_cement = train_test_split(\n     X_cement,\n     y_cement,\n     test_size=0.3,\n     random_state=42\n )\n```\n\n```python\n# normalize\nscaler = StandardScaler()\nX_train_cement_scaled = scaler.fit_transform(X_train_cement)\nX_test_cement_scaled = scaler.transform(X_test_cement)\n```\n\n#### Base Model Run\n\n```python\nbase_model_cement = svm.SVR()\n```\n\n```python\nbase_model_cement.fit(X_train_cement_scaled, y_train_cement)\n\nbase_model_predictions = base_model_cement.predict(X_test_cement_scaled)\n```\n\n```python\nmae = mean_absolute_error(y_test_cement, base_model_predictions)\nrmse = mean_squared_error(y_test_cement, base_model_predictions)\nmean_abs = y_test_cement.mean()\navg_error = mae * 100 / mean_abs\n\nprint('MAE: ', mae.round(2), 'RMSE: ', rmse.round(2), 'Relative Avg. Error: ', avg_error.round(2), '%')\n```\n\n| MAE | RMSE |  Relative Avg. Error |\n| -- | -- | -- |\n| 4.68 | 36.95 | 12.75 % |\n\n\n#### Grid Search for better Hyperparameter\n\n```python\nparam_grid = {\n    'C': [0.001,0.01,0.1,0.5,1],\n    'kernel': ['linear', 'rbf', 'poly'],\n    'gamma': ['scale', 'auto'],\n    'degree': [2,3,4],\n    'epsilon': [0,0.01,0.1,0.5,1,2]\n}\n```\n\n```python\ncement_grid = GridSearchCV(base_model_cement, param_grid)\ncement_grid.fit(X_train_cement_scaled, y_train_cement)\n```\n\n```python\ncement_grid.best_params_\n# {'C': 1, 'degree': 2, 'epsilon': 2, 'gamma': 'scale', 'kernel': 'linear'}\n```\n\n```python\ncement_grid_predictions = cement_grid.predict(X_test_cement_scaled)\n```\n\n```python\nmae_grid = mean_absolute_error(y_test_cement, cement_grid_predictions)\nrmse_grid = mean_squared_error(y_test_cement, cement_grid_predictions)\nmean_abs = y_test_cement.mean()\navg_error_grid = mae_grid * 100 / mean_abs\n\nprint('MAE: ', mae_grid.round(2), 'RMSE: ', rmse_grid.round(2), 'Relative Avg. Error: ', avg_error_grid.round(2), '%')\n```\n\n| MAE | RMSE |  Relative Avg. Error |\n| -- | -- | -- |\n| 1.85 | 5.2 | 5.05 % |\n\n\n### Example Task - Wine Fraud\n\n#### Data Exploration\n\n```python\n# dataset\n!wget https://github.com/CAPGAGA/Fraud-in-Wine/raw/main/wine_fraud.csv -P datasets\n```\n\n```python\nwine_df = pd.read_csv('datasets/wine_fraud.csv')\nwine_df.head(5)\n```\n\n|  | fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | type |\n| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |\n| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | Legit | red |\n| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | Legit | red |\n| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | Legit | red |\n| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | Legit | red |\n| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | Legit | red |\n\n```python\nwine_df.value_counts('quality')\n```\n\n| quality | |\n| -- | -- |\n| Legit | 6251 |\n| Fraud  | 246 |\n_dtype: int64_\n\n```python\nwine_df['quality'].value_counts().plot(\n    kind='bar',\n    figsize=(10,5), \n    title='Wine - Quality distribution')\n\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_21.webp)\n\n```python\nplt.figure(figsize=(10, 5))\nplt.title('Wine - Quality distribution by Type')\n\nsns.countplot(\n    data=wine_df,\n    x='quality',\n    hue='type',\n    palette='winter'\n)\n\nplt.savefig('assets/Scikit_Learn_22.webp', bbox_inches='tight')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_22.webp)\n\n```python\nwine_df_white = wine_df[wine_df['type'] == 'white']\nwine_df_red = wine_df[wine_df['type'] == 'red']\n```\n\n```python\n# fraud percentage by wine type\nlegit_white_wines = wine_df_white.value_counts('quality')[0]\nfraud_white_wines = wine_df_white.value_counts('quality')[1]\nwhite_fraud_percentage = fraud_white_wines * 100 / (legit_white_wines + fraud_white_wines)\n\nlegit_red_wines = wine_df_red.value_counts('quality')[0]\nfraud_red_wines = wine_df_red.value_counts('quality')[1]\nred_fraud_percentage = fraud_red_wines * 100 / (legit_red_wines + fraud_red_wines)\n\nprint(\n    'Fraud Percentage: \\nWhite Wines: ',\n    white_fraud_percentage.round(2),\n    '% \\nRed Wines: ',\n    red_fraud_percentage.round(2),\n    '%'\n)\n```\n\n| Fraud Percentage: | |\n| -- | -- |\n| White Wines: | 3.74 % |\n| Red Wines: | 3.94 % |\n\n```python\n# make features numeric\nfeature_map = {\n    'Legit': 0,\n    'Fraud': 1,\n    'red': 0,\n    'white': 1\n}\n\nwine_df['quality_enc'] = wine_df['quality'].map(feature_map)\nwine_df['type_enc'] = wine_df['type'].map(feature_map)\nwine_df[['quality', 'quality_enc', 'type', 'type_enc']]\n```\n\n|  | quality | quality_enc | type | type_enc |\n| -- | -- | -- | -- | -- |\n| 0 | Legit | 0 | red | 0 |\n| 1 | Legit | 0 | red | 0 |\n| 2 | Legit | 0 | red | 0 |\n| 3 | Legit | 0 | red | 0 |\n| 4 | Legit | 0 | red | 0 |\n| ... |\n| 6492 | Legit | 0 | white | 1 |\n| 6493 | Legit | 0 | white | 1 |\n| 6494 | Legit | 0 | white | 1 |\n| 6495 | Legit | 0 | white | 1 |\n| 6496 | Legit | 0 | white | 1 |\n_6497 rows × 4 columns_\n\n```python\n# find correlations\nwine_df.corr(numeric_only=True)\n```\n\n|  | fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality_enc | type_enc |\n| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |\n| fixed acidity | 1.000000 | 0.219008 | 0.324436 | -0.111981 | 0.298195 | -0.282735 | -0.329054 | 0.458910 | -0.252700 | 0.299568 | -0.095452 | 0.021794 | -0.486740 |\n| volatile acidity | 0.219008 | 1.000000 | -0.377981 | -0.196011 | 0.377124 | -0.352557 | -0.414476 | 0.271296 | 0.261454 | 0.225984 | -0.037640 | 0.151228 | -0.653036 |\n| citric acid | 0.324436 | -0.377981 | 1.000000 | 0.142451 | 0.038998 | 0.133126 | 0.195242 | 0.096154 | -0.329808 | 0.056197 | -0.010493 | -0.061789 | 0.187397 |\n| residual sugar | -0.111981 | -0.196011 | 0.142451 | 1.000000 | -0.128940 | 0.402871 | 0.495482 | 0.552517 | -0.267320 | -0.185927 | -0.359415 | -0.048756 | 0.348821 |\n| chlorides | 0.298195 | 0.377124 | 0.038998 | -0.128940 | 1.000000 | -0.195045 | -0.279630 | 0.362615 | 0.044708 | 0.395593 | -0.256916 | 0.034499 | -0.512678 |\n| free sulfur dioxide | -0.282735 | -0.352557 | 0.133126 | 0.402871 | -0.195045 | 1.000000 | 0.720934 | 0.025717 | -0.145854 | -0.188457 | -0.179838 | -0.085204 | 0.471644 |\n| total sulfur dioxide | -0.329054 | -0.414476 | 0.195242 | 0.495482 | -0.279630 | 0.720934 | 1.000000 | 0.032395 | -0.238413 | -0.275727 | -0.265740 | -0.035252 | 0.700357 |\n| density | 0.458910 | 0.271296 | 0.096154 | 0.552517 | 0.362615 | 0.025717 | 0.032395 | 1.000000 | 0.011686 | 0.259478 | -0.686745 | 0.016351 | -0.390645 |\n| pH | -0.252700 | 0.261454 | -0.329808 | -0.267320 | 0.044708 | -0.145854 | -0.238413 | 0.011686 | 1.000000 | 0.192123 | 0.121248 | 0.020107 | -0.329129 |\n| sulphates | 0.299568 | 0.225984 | 0.056197 | -0.185927 | 0.395593 | -0.188457 | -0.275727 | 0.259478 | 0.192123 | 1.000000 | -0.003029 | -0.034046 | -0.487218 |\n| alcohol | -0.095452 | -0.037640 | -0.010493 | -0.359415 | -0.256916 | -0.179838 | -0.265740 | -0.686745 | 0.121248 | -0.003029 | 1.000000 | -0.051141 | 0.032970 |\n| quality_enc | 0.021794 | 0.151228 | -0.061789 | -0.048756 | 0.034499 | -0.085204 | -0.035252 | 0.016351 | 0.020107 | -0.034046 | -0.051141 | 1.000000 | -0.004598 |\n| type_enc | -0.486740 | -0.653036 | 0.187397 | 0.348821 | -0.512678 | 0.471644 | 0.700357 | -0.390645 | -0.329129 | -0.487218 | 0.032970 | -0.004598 | 1.000000 |\n\n```python\nplt.figure(figsize=(12,8))\nsns.heatmap(wine_df.corr(numeric_only=True), annot=True, cmap='viridis')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_23.webp)\n\n```python\n# how does the quality correlate to measurements\nwine_df.corr(numeric_only=True)['quality_enc']\n```\n\n| Quality Correlstion | |\n| -- | -- |\n| fixed acidity        |  0.021794 |\n| volatile acidity     |  0.151228 |\n| citric acid          | -0.061789 |\n| residual sugar       | -0.048756 |\n| chlorides            |  0.034499 |\n| free sulfur dioxide  | -0.085204 |\n| total sulfur dioxide | -0.035252 |\n| density              |  0.016351 |\n| pH                   |  0.020107 |\n| sulphates            | -0.034046 |\n| alcohol              | -0.051141 |\n| quality_enc          |  1.000000 |\n| type_enc             | -0.004598 |\n_Name: quality_enc, dtype: float64_\n\n```python\nwine_df.corr(numeric_only=True)['quality_enc'][:-2].sort_values().plot(\n    figsize=(12,5),\n    kind='bar',\n    title='Correlation of Measurements to Quality'\n)\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_24.webp)\n\n\n#### Regression Model\n\n```python\n# separate target + remove string values\nX_wine = wine_df.drop(['quality_enc', 'quality', 'type'], axis=1)\ny_wine = wine_df['quality']\n\nprint(X_wine.shape, y_wine.shape)\n```\n\n```python\n# train-test split\nX_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(\n    X_wine,\n    y_wine,\n    test_size=0.1,\n    random_state=42\n)\n```\n\n```python\n# normalization\nscaler = StandardScaler()\nX_wine_train_scaled = scaler.fit_transform(X_wine_train)\nX_wine_test_scaled = scaler.transform(X_wine_test)\n```\n\n```python\n# create the SVC model using class_weight to balance out the\n# dataset that heavily leaning towards non-frauds\nsvc_wine_base = svm.SVC(\n    kernel='rbf',\n    class_weight='balanced'\n)\n```\n\n```python\n# grid search\nparam_grid = {\n    'C': [0.5, 1, 1.5, 2, 2.5],\n    'gamma' : ['scale', 'auto']\n}\n\nwine_grid = GridSearchCV(svc_wine_base, param_grid)\nwine_grid.fit(X_wine_train_scaled, y_wine_train)\nprint('Best Params: ', wine_grid.best_params_)\n# Best Params:  {'C': 2.5, 'gamma': 'auto'}\n```\n\n```python\ny_wine_pred = wine_grid.predict(X_wine_test_scaled)\n```\n\n```python\nprint(\n    'Accuracy Score: ',\n    accuracy_score(y_wine_test, y_wine_pred, normalize=True).round(4)*100, '%'\n)\n# Accuracy Score:  84.77 %\n```\n\n```python\nreport_wine = classification_report(\n    y_wine_test, y_wine_pred\n)\nprint(report_wine)\n```\n\n|         | precision | recall | f1-score | support |\n| --      | -- | -- | -- | -- |\n| Fraud   | 0.16 | 0.68 | 0.26 |  25 |\n| Legit   | 0.99 | 0.85 | 0.92 | 625 |\n|     accuracy |      |      | 0.85 | 650 |\n|    macro avg | 0.57 | 0.77 | 0.59 | 650 |\n| weighted avg | 0.95 | 0.85 | 0.89 | 650 |\n\n```python\nconf_mtx_wine = confusion_matrix(y_wine_test, y_wine_pred)\nconf_mtx_wine\n\n# array([[ 17,   8],\n#        [ 91, 534]])\n```\n\n```python\nconf_mtx_wine_plot = ConfusionMatrixDisplay(\n    confusion_matrix=conf_mtx_wine\n)\n\nconf_mtx_wine_plot.plot(cmap='plasma')\n```\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_25.webp)\n\n```python\n# expand grid search\nparam_grid = {\n    'C': [1000, 1050, 1100, 1050, 1200],\n    'gamma' : ['scale', 'auto']\n}\n\nwine_grid = GridSearchCV(svc_wine_base, param_grid)\nwine_grid.fit(X_wine_train_scaled, y_wine_train)\nprint('Best Params: ', wine_grid.best_params_)\n# Best Params:  {'C': 1100, 'gamma': 'scale'}\n```\n\n```python\ny_wine_pred = wine_grid.predict(X_wine_test_scaled)\nprint('Accuracy Score: ',accuracy_score(y_wine_test, y_wine_pred, normalize=True).round(4)*100, '%')\n# Accuracy Score:  94.31 %\nreport_wine = classification_report(y_wine_test, y_wine_pred)\nprint(report_wine)\nconf_mtx_wine = confusion_matrix(y_wine_test, y_wine_pred)\n\nconf_mtx_wine_plot = ConfusionMatrixDisplay(\n    confusion_matrix=conf_mtx_wine\n)\n\nconf_mtx_wine_plot.plot(cmap='plasma')\n```\n\n\u003c!-- #region --\u003e\n|         | precision | recall | f1-score | support |\n| --      | -- | -- | -- | -- |\n| Fraud   | 0.29 | 0.32 | 0.30 |  25 |\n| Legit   | 0.97 | 0.97 | 0.97 | 625 |\n|     accuracy |      |      | 0.85 | 650 |\n|    macro avg | 0.63 | 0.64 | 0.64 | 650 |\n| weighted avg | 0.95 | 0.94 | 0.94 | 650 |\n\n\n![scikit-learn - Machine Learning in Python](https://github.com/mpolinowski/python-scikitlearn-cheatsheet/raw/master/assets/Scikit_Learn_26.webp)\n\u003c!-- #endregion --\u003e\n\n## Supervised Learning - Boosting Methods\n\n```python\n# dataset - label mushrooms as poisonous or eatable\n!wget https://github.com/semnan-university-ai/Mushroom/raw/main/Mushroom.csv -P datasets\n```\n\n### Dataset Exploration\n\n```python\nshroom_df = pd.read_csv('datasets/mushrooms.csv')\nshroom_df.head(5).transpose()\n```\n\n\u003c!-- #region --\u003e\n[Mushroom Data Set](https://archive.ics.uci.edu/ml/datasets/mushroom)\n\n1. __cap-shape__: bell = `b`, conical = `c`, convex = `x`, flat = `f`,  knobbed = `k`, sunken = `s`\n2. __cap-surface__: fibrous = `f`, grooves = `g`, scaly = `y`, smooth = `s`\n3. __cap-color__: brown = `n`, buff = `b`, cinnamon = `c`, gray = `g`, green = `r`,  pink = `p`, purple = `u`, red = `e`, white = `w`, yellow = `y`\n4. __bruises?__: bruises = `t`, no = `f`\n5. __odor__: almond = `a`, anise = `l`, creosote = `c`, fishy = `y`, foul = `f`,  musty = `m`, none = `n`, pungent = `p`, spicy = `s`\n6. __gill-attachment__: attached = `a`, descending = `d`, free = `f`, notched = `n`\n7. __gill-spacing__: close = `c`, crowded = `w`, distant = `d`\n8. __gill-size__: broad = `b`, narrow = `n`\n9. __gill-color__: black = `k`, brown = `n`, buff = `b`, chocolate = `h`, gray = `g`,  green = `r`, orange = `o`, pink = `p`, purple = `u`, red = `e`,  white = `w`, yellow = `y`\n10. __stalk-shape__: enlarging = `e`, tapering = `t`\n11. __stalk-root__: bulbous = `b`, club = `c`, cup = `u`, equal = `e`,  rhizomorphs = `z`, rooted = `r`, missing = `?`\n12. __stalk-surface-above-ring__: fibrous = `f`, scaly = `y`, silky = `k`, smooth = `s`\n13. __stalk-surface-below-ring__: fibrous = `f`, scaly = `y`, silky = `k`, smooth = `s`\n14. __stalk-color-above-ring__: brown = `n`, buff = `b`, cinnamon = `c`, gray = `g`, orange = `o`,  pink = `p`, red = `e`, white = `w`, yellow = `y`\n15. __stalk-color-below-ring__: brown = `n`, buff = `b`, cinnamon = `c`, gray = `g`, orange = `o`,  pink = `p`, red = `e`, white = `w`, yellow = `y`\n16. __veil-type__: partial = `p`, universal = `u`\n17. __veil-color__: brown = `n`, orange = `o`, white = `w`, yellow = `y`\n18. __ring-number__: none = `n`, one = `o`, two = `t`\n19. __ring-type__: cobwebby = `c`, evanescent = `e`, flaring = `f`, large = `l`,  none = `n`, pendant = `p`, sheathing = `s`, zone = `z`\n20. __spore-print-color__: black = `k`, brown = `n`, buff = `b`, chocolate = `h`, green = `r`,  orange = `o`, purple = `u`, white = `w`, yellow = `y`\n21. __population__: abundant = `a`, clustered = `c`, numerous = `n`,  scattered = `s`, several = `v`, solitary = `y\n22. __habitat__: grasses = `g`, leaves = `l`, meadows = `m`, paths = `p`,  urban = `u`, waste = `w`, woods = `d`\n\n\n|  | 0 | 1 | 2 | 3 | 4 |\n| -- | -- | -- | -- | -- | -- |\n| class | p | e | e | p | e |\n| cap-shape | x | x | b | x | x |\n| cap-surface | s | s | s | y | s |\n| cap-color | n | y | w | w | g |\n| bruises | t | t | t | t | f |\n| odor | p | a | l | p | n |\n| gill-attachment | f | f | f | f | f |\n| gill-spacing | c | c | c | c | w |\n| gill-size | n | b | b | n | b |\n| gill-color | k | k | n | n | k |\n| stalk-shape | e | e | e | e | t |\n| stalk-root | e | c | c | e | e |\n| stalk-surface-above-ring | s | s | s | s | s |\n| stalk-surface-below-ring | s | s | s | s | s |\n| stalk-color-above-ring | w | w | w | w | w |\n| stalk-color-below-ring | w | w | w | w | w |\n| veil-type | p | p | p | p | p |\n| veil-color | w | w | w | w | w |\n| ring-number | o | o | o | o | o |\n| ring-type | p | p | p | p | e |\n| spore-print-color | k | n | n | k | n |\n| population | s | n | n | s | a |\n| habitat | u | g | m | u | g |\n\u003c!-- #endregion --\u003e\n\n```python\nshroom_df.isnull().sum()\n```\n\n| | |\n| -- | -- |\n| class | 0 |\n| cap-shape | 0 |\n| cap-surface | 0 |\n| cap-color | 0 |\n| bruises | 0 |\n| odor | 0 |\n| gill-attachment | 0 |\n| gill-spacing | 0 |\n| gill-size | 0 |\n| gill-color | 0 |\n| stalk-shape | 0 |\n| stalk-root | 0 |\n| stalk-surface-above-ring | 0 |\n| stalk-surface-below-ring | 0 |\n| stalk-color-above-ring | 0 |\n| stalk-color-below-ring | 0 |\n| veil-type | 0 |\n| veil-color | 0 |\n| ring-number | 0 |\n| ring-type | 0 |\n| spore-print-color | 0 |\n| population | 0 |\n| habitat | 0 |\n_dtype: int64_\n\n```python\nfeature_df = shroom_df.describe().transpose().reset_index(\n    names=['feature']\n).sort_values(\n    'unique', ascending=False\n)\n```\n\n|    | feature | count | unique | top | freq |\n| -- | -- | -- | -- | -- | -- |\n| 9 | gill-color | 8124 | 12 | b | 1728 |\n| 3 | cap-color | 8124 | 10 | n | 2284 |\n| 20 | spore-print-color | 8124 | 9 | w | 2388 |\n| 5 | odor | 8124 | 9 | n | 3528 |\n| 15 | stalk-color-below-ring | 8124 | 9 | w | 4384 |\n| 14 | stalk-color-above-ring | 8124 | 9 | w | 4464 |\n| 22 | habitat | 8124 | 7 | d | 3148 |\n| 1 | cap-shape | 8124 | 6 | x | 3656 |\n| 21 | population | 8124 | 6 | v | 4040 |\n| 19 | ring-type | 8124 | 5 | p | 3968 |\n| 11 | stalk-root | 8124 | 5 | b | 3776 |\n| 12 | stalk-surface-above-ring | 8124 | 4 | s | 5176 |\n| 13 | stalk-surface-below-ring | 8124 | 4 | s | 4936 |\n| 17 | veil-color | 8124 | 4 | w | 7924 |\n| 2 | cap-surface | 8124 | 4 | y | 3244 |\n| 18 | ring-number | 8124 | 3 | o | 7488 |\n| 10 | stalk-shape | 8124 | 2 | t | 4608 |\n| 8 | gill-size | 8124 | 2 | b | 5612 |\n| 7 | gill-spacing | 8124 | 2 ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmpolinowski%2Fpython-scikitlearn-cheatsheet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmpolinowski%2Fpython-scikitlearn-cheatsheet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmpolinowski%2Fpython-scikitlearn-cheatsheet/lists"}