{"id":16133641,"url":"https://github.com/gabboraron/intro_to_machine_learning-kaggle","last_synced_at":"2026-04-12T23:38:45.472Z","repository":{"id":131753109,"uuid":"451227161","full_name":"gabboraron/Intro_to_Machine_Learning-Kaggle","owner":"gabboraron","description":"kaggle course about machine learning basics","archived":false,"fork":false,"pushed_at":"2022-02-11T13:46:42.000Z","size":79,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-12T21:34:13.221Z","etag":null,"topics":["kaggle","kaggle-courses","machine-learning","model-validation","random-forest","underfitting-overfitting"],"latest_commit_sha":null,"homepage":"https://www.kaggle.com/learn/intro-to-machine-learning","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gabboraron.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-23T20:49:03.000Z","updated_at":"2024-01-04T16:39:50.000Z","dependencies_parsed_at":"2023-06-05T21:30:13.337Z","dependency_job_id":null,"html_url":"https://github.com/gabboraron/Intro_to_Machine_Learning-Kaggle","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabboraron%2FIntro_to_Machine_Learning-Kaggle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabboraron%2FIntro_to_Machine_Learning-Kaggle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabboraron%2FIntro_to_Machine_Learning-Kaggle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabboraron%2FIntro_to_Machine_Learning-Kaggle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gabboraron","download_url":"https://codeload.github.com/gabboraron/Intro_to_Machine_Learning-Kaggle/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247502473,"owners_count":20949266,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kaggle","kaggle-courses","machine-learning","model-validation","random-forest","underfitting-overfitting"],"created_at":"2024-10-09T22:45:15.036Z","updated_at":"2025-10-29T11:10:29.348Z","avatar_url":"https://github.com/gabboraron.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Intro to Machine Learning - kaggle\n\u003e ## Summary from article *[How to build your own Neural Network from scratch in Python](https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6)* \n\u003e \n\u003e The output `ŷ` of a simple 2-layer Neural Network is: ![ŷ=r(W2r(W1x+b1)+b2)](https://miro.medium.com/max/355/1*E1_l8PGamc2xTNS87XGNcA.png) where `W` are weights and `b` are biases, and only these effects the output `ŷ`. The right values for the weights and biases determines the strength of the predictions. The process of fine-tuning the weights and biases from the input data is known as training. \n\u003e \n\u003e In each iteration of the training:\n\u003e - Calculating the predicted output `ŷ`, known as `feedforward`\n\u003e - Updating the weights and biases, known as `backpropagation`\n\u003e \n\u003e A way to evaluate the “goodness” of our predictions is the loss function.\n\u003e \n\u003e There are many available loss functions, and the nature of our problem should dictate our choice of loss function. In this tutorial, we’ll use a simple sum-of-sqaures error as our loss function: ![Sum_of_squares_error = sum(from i=1; to n)(y-ŷ)^2](https://miro.medium.com/max/300/1*iNa1VLdaeqwUAxpNXs3jwQ.png) That is, the sum-of-squares error is simply the sum of the difference between each predicted value and the actual value. The difference is squared so that we measure the absolute value of the difference. \n\u003e \n\u003e **Our goal in training is to find the best set of weights and biases that minimizes the loss function.**\n\u003e \n\u003e We need to find a way to propagate the error back, and to update our weights and biases. In order to know the appropriate amount to adjust the weights and biases by, we need to know the derivative of the loss function with respect to the weights and biases. Recall from calculus that the derivative of a function is simply the slope of the function. If we have the derivative, we can simply update the weights and biases by increasing/reducing with it. This is known as gradient descent. However, we can’t directly calculate the derivative of the loss function with respect to the weights and biases because the equation of the loss function does not contain the weights and biases. Therefore, we need the chain rule to help us calculate it. \n\u003e ![Loss(y,ŷ) = sum(from i=1; to n)((y-ŷ)^2) =\u003e (d Loss(y,ŷ) / d W) = 2(y-ŷ) * z(1-z) * x ](https://miro.medium.com/max/700/1*7zxb2lfWWKaVxnmq2o69Mw.png)\n\u003e \n\u003e Our Neural Network should learn the ideal set of weights to represent this function. Note that it isn’t exactly trivial for us to work out the weights just by inspection alone.\n\u003e \n\u003e code: [github.com/jamesloyys/ ... neural_network_backprop-py](https://gist.github.com/jamesloyys/ff7a7bb1540384f709856f9cdcdee70d#file-neural_network_backprop-py)\n\n\n## basics\n```Python\nfrom learntools.core import binder\nbinder.bind(globals())\nfrom learntools.machine_learning.ex2 import *\nprint(\"Setup Complete\")\n\nimport pandas as pd\n# Path of the file to read\niowa_file_path = '../input/home-data-for-ml-course/train.csv'\n\n# Fill in the line below to read the file into a variable home_data\nhome_data = pd.read_csv(iowa_file_path)\n\n# Call line below with no argument to check that you've loaded the data correctly\nstep_1.check()\n```\n- `X.describe()` - to view summary statistics of the dataset `X`\n- `X.head()` - to get the fst few rows of dataset, only not empty!\n- `round(argument)` - round the argument to nearest \n- `home_data[\"LotArea\"].mean()` - get the mean of column *LotArea*\n\n\n## Your First Machine Learning Model\nto get columns of dataset:\n\n````Python\nimport pandas as pd\n\nmelbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'\nmelbourne_data = pd.read_csv(melbourne_file_path) \nmelbourne_data.columns\n\n# dropna drops missing values (think of na as \"not available\")\nmelbourne_data = melbourne_data.dropna(axis=0)\n````\n\nWe'll use the `.` dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called `y`. So the code we need to save the house prices in the Melbourne data is\n```Python\ny = melbourne_data.Price\n```\n\nSometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features. We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).  By convention, this data is called `X`.\n\n````Python\nmelbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']\n\nX = melbourne_data[melbourne_features]\n````\n\n### Building Your Model\nUse [sci-kit](https://scikit-learn.org/stable/) to create your models. \n\nBuilding the model in following steps:\n1. Define: What type of model will it be?\n2. Fit: Capture patterns\n3. Predict: \n4. Evaluate: determinate the prediction accuracy\n\nexample of defining a decision tree:\n\n````Python\nfrom sklearn.tree import DecisionTreeRegressor\n\n# Define model. Specify a number for random_state to ensure same results each run\nmelbourne_model = DecisionTreeRegressor(random_state=1)\n\n# Fit model\nmelbourne_model.fit(X, y)\n````\n\u003e Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.\n````Python\n# Making predictions for the fst 5 houses (X.head())\nprint(melbourne_model.predict(X.head()))\n````\n\n## Model Validation\nYou've built a model. But how good is it?\n\nYou'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.\n\nThere are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error (also called MAE)**. Let's break down this metric starting with the last word, error. This is: `error=actual−predicted`\n\n```Python\n# Data Loading Code Hidden Here\nimport pandas as pd\n\n# Load data\nmelbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'\nmelbourne_data = pd.read_csv(melbourne_file_path) \n# Filter rows with missing price values\nfiltered_melbourne_data = melbourne_data.dropna(axis=0)\n# Choose target and features\ny = filtered_melbourne_data.Price\nmelbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', \n                        'YearBuilt', 'Lattitude', 'Longtitude']\nX = filtered_melbourne_data[melbourne_features]\n\nfrom sklearn.tree import DecisionTreeRegressor\n# Define model\nmelbourne_model = DecisionTreeRegressor()\n# Fit model\nmelbourne_model.fit(X, y)\n\n\n# calculate the mean absolute error\nfrom sklearn.metrics import mean_absolute_error\n\npredicted_home_prices = melbourne_model.predict(X)\nmean_absolute_error(y, predicted_home_prices)\n```\n\n\u003e Imagine that, in the large real estate market, door color is unrelated to home price.\n\u003e \n\u003e However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.\n\u003e \n\u003e Since this pattern was derived from the training data, the model will appear accurate in the training data.\n\u003e \n\u003e But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.\n\u003e \n\u003e The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.\n\nThe `scikit-learn` library has a function `train_test_split` to break up the data into two pieces. We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate `mean_absolute_error`.\n\n````Python\nfrom sklearn.model_selection import train_test_split\n\n# split data into training and validation data, for both features and target\n# The split is based on a random number generator. Supplying a numeric value to\n# the random_state argument guarantees we get the same split every time we\n# run this script.\ntrain_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)\n# Define model\nmelbourne_model = DecisionTreeRegressor()\n# Fit model\nmelbourne_model.fit(train_X, train_y)\n\n# get predicted prices on validation data\nval_predictions = melbourne_model.predict(val_X)\nprint(mean_absolute_error(val_y, val_predictions))\n````\n\n## Underfitting and Overfitting\nAt the end of this step, you will understand the concepts of underfitting and overfitting, and you will be able to apply these ideas to make your models more accurate.\n\nSet parameters for [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). The most important thng is to set the tree's depth; *a tree's depth is a measure of how many splits it makes before coming to a prediction*. \n\n\u003e If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 2^10 groups of houses by the time we get to the 10th level. That's 1024 leaves.\n\u003e \n\u003e This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.\n\u003e \n\u003e At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.\n\nThe `max_leaf_nodes` argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.\n\nWe can use a utility function to help compare MAE scores from different values for `max_leaf_nodes`:\n\n```Python\nfrom sklearn.metrics import mean_absolute_error\nfrom sklearn.tree import DecisionTreeRegressor\n\ndef get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):\n    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)\n    model.fit(train_X, train_y)\n    preds_val = model.predict(val_X)\n    mae = mean_absolute_error(val_y, preds_val)\n    return(mae)\n```\n\n```Python\n# Data Loading Code Runs At This Point\nimport pandas as pd\n    \n# Load data\nmelbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'\nmelbourne_data = pd.read_csv(melbourne_file_path) \n# Filter rows with missing values\nfiltered_melbourne_data = melbourne_data.dropna(axis=0)\n# Choose target and features\ny = filtered_melbourne_data.Price\nmelbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']\nX = filtered_melbourne_data[melbourne_features]\n\nfrom sklearn.model_selection import train_test_split\n# split data into training and validation data, for both features and target\ntrain_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)\n\n\n# compare MAE with differing values of max_leaf_nodes\nfor max_leaf_nodes in [5, 50, 500, 5000]:\n    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)\n    print(\"Max leaf nodes: %d  \\t\\t Mean Absolute Error:  %d\" %(max_leaf_nodes, my_mae))\n```\nfile: [exercise-underfitting-and-overfitting.ipynb](https://github.com/gabboraron/Intro_to_Machine_Learning-Kaggle/blob/main/exercise-underfitting-and-overfitting.ipynb); Kaggle version [kaggle.com/sndorburian](https://www.kaggle.com/sndorburian/exercise-underfitting-and-overfitting)\n\n## Random Forests\nA deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.\n \nThe random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.\n\nWe will use the variables from `train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)`\n\nWe build a [random forest](https://en.wikipedia.org/wiki/Random_forest) model similarly to how we built a decision tree in scikit-learn - this time using the [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) class instead of `DecisionTreeRegressor`.\n\n```Python\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.metrics import mean_absolute_error\n\nforest_model = RandomForestRegressor(random_state=1)\nforest_model.fit(train_X, train_y)\nmelb_preds = forest_model.predict(val_X)\nprint(mean_absolute_error(val_y, melb_preds))\n```\n\n\u003e There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.\n\n## Conclusion\nAll in one example: [Exercise: Housing Prices Competition: exercise-machine-learning-competitions.ipynb](https://github.com/gabboraron/Intro_to_Machine_Learning-Kaggle/blob/main/exercise-machine-learning-competitions.ipynb)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgabboraron%2Fintro_to_machine_learning-kaggle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgabboraron%2Fintro_to_machine_learning-kaggle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgabboraron%2Fintro_to_machine_learning-kaggle/lists"}