{"id":20733385,"url":"https://github.com/moindalvs/gradient_boosting_algorithms_from_scratch","last_synced_at":"2025-10-30T10:40:20.059Z","repository":{"id":48663455,"uuid":"516778635","full_name":"MoinDalvs/Gradient_Boosting_Algorithms_From_Scratch","owner":"MoinDalvs","description":"4 Boosting Algorithms You Should Know – GBM, XGBoost, LightGBM \u0026 CatBoost","archived":false,"fork":false,"pushed_at":"2022-08-27T15:15:42.000Z","size":1130,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-14T09:57:50.935Z","etag":null,"topics":["boosting-algorithms","catboost-algorithm","data-science","decision-trees","gradient-boosting","lightbgm","random-forest","xgboost-algorithm"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MoinDalvs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-22T14:22:17.000Z","updated_at":"2025-04-12T10:18:02.000Z","dependencies_parsed_at":"2023-01-16T17:00:26.027Z","dependency_job_id":null,"html_url":"https://github.com/MoinDalvs/Gradient_Boosting_Algorithms_From_Scratch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MoinDalvs/Gradient_Boosting_Algorithms_From_Scratch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoinDalvs%2FGradient_Boosting_Algorithms_From_Scratch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoinDalvs%2FGradient_Boosting_Algorithms_From_Scratch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoinDalvs%2FGradient_Boosting_Algorithms_From_Scratch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoinDalvs%2FGradient_Boosting_Algorithms_From_Scratch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MoinDalvs","download_url":"https://codeload.github.com/MoinDalvs/Gradient_Boosting_Algorithms_From_Scratch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoinDalvs%2FGradient_Boosting_Algorithms_From_Scratch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281793537,"owners_count":26562612,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-30T02:00:06.501Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["boosting-algorithms","catboost-algorithm","data-science","decision-trees","gradient-boosting","lightbgm","random-forest","xgboost-algorithm"],"created_at":"2024-11-17T05:25:07.866Z","updated_at":"2025-10-30T10:40:20.029Z","avatar_url":"https://github.com/MoinDalvs.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 0.1 Table of Contents\u003ca class=\"anchor\" id=\"0.1\"\u003e\u003c/a\u003e\n1. [Quick Introduction to Boosting (What is Boosting?)](#1)\n    - 1.1 [Gradient Boosting Machine (GBM)](#1.1)\n    - 1.2 [What is boosting?](#1.2)\n    - 1.3 [Improvements to Basic Gradient Boosting](#1.3)\n    - 1.4 [Summary](#1.4)\n    - 1.5 [Maths Intuition (Regression)](#1.5)\n    - 1.6 [Maths Intuition (Classification)](#1.6)\n2. [XGBM (Extreme Gradient Boosting Machine)](#2)\n    - 2.1 [XGBoost Features](#2.1)\n    - 2.2 [XGBM Optimizations](#2.2)\n    - 2.3 [System Features](#2.3)\n    - 2.4 [Algorithm Features](#2.4)\n    - 2.5 [Weak Learner Tree Splitting](#2.5)\n    - 2.6 [XGBoost Training Features](#2.6)\n    - 2.7 [XGBoost Algorithm — Parameters](#2.7)\n3. [LightGBM](#3)\n    - 3.1 [What are split points?](#3.1)\n    - 3.2 [How are the optimum split points created?](#3.2)\n    - 3.3 [Structural Differences](#3.3)\n    - 3.4 [What is Gradient-based One-Side Sampling (GOSS)?](#3.4)\n    - 3.5 [What is EFB(Exclusive Feature Bundling)?](#3.1)\n    - 3.6 [Advantages of Light GBM](#3.6)\n    - 3.7 [Performance comparison](#3.7)\n    - 3.8 [Tuning Parameters of Light GBM](#3.8)\n4. [CatBoost](#4)\n    - 4.1 [Compared computational efficiency](#4.1)\n    - 4.2 [Advantages of CatBoost](#4.2)\n\n## 1) Quick Introduction to Boosting (What is Boosting?)\u003ca class=\"anchor\" id=\"1\"\u003e\u003c/a\u003e\n### Picture this scenario:\n\nYou’ve built a linear regression model that gives you a decent 77% accuracy on the validation dataset. Next, you decide to expand your portfolio by building a k-Nearest Neighbour (KNN) model and a decision tree model on the same dataset. These models gave you an accuracy of 62% and 89% on the validation set respectively.\n\nIt’s obvious that all three models work in completely different ways. For instance, the linear regression model tries to capture linear relationships in the data while the decision tree model attempts to capture the non-linearity in the data.\n![image](https://user-images.githubusercontent.com/99672298/180591154-566b9fb9-14f7-403c-abe7-f280cb18a0c5.png)\\\nHow about, instead of using any one of these models for making the final predictions, we use a combination of all of these models?\n\nI’m thinking of an average of the predictions from these models. By doing this, we would be able to capture more information from the data, right?\n\nThat’s primarily the idea behind ensemble learning. And where does boosting come in?\n\nBoosting is one of the techniques that uses the concept of ensemble learning. A boosting algorithm combines multiple simple models (also known as weak learners or base estimators) to generate the final output.\n\n### An Ensemble of Weak Learners\nWhen training a new error-predicting model to predict a model’s current errors, we regularize its complexity to prevent overfitting. This regularized model will have ‘errors’ when predicting the original model’s ‘errors’. It might not necessarily predict 2. Since the new improved model’s prediction depends on the new error-predicting model’s prediction, it too will have errors albeit lower than before.\n\nTo mitigate this, we perform 2 measures. First, we reduce our reliance or trust on any single error predictor by applying a small weight, η (typically between 0 to 0.1) to its output. Then, instead of stopping after 1 iteration of improvement, we repeat the process multiple times, learning new error predictors for newly formed improved models until the accuracy or error is satisfactory. This is summed up using the equations below where x is an input.\n\n+ improved_model(x) = current_model(x) + η × error_prediction_model(x) \n+ current_model(x) = improved_model(x) \nRepeat above 2 steps till satisfactory\\\nTypically, the error-predicting model predicts the negative gradient and so, we use addition instead of a subtraction. After every iteration, a new predictor accounting for the errors of the previous model will be learned and added into the ensemble. The number of iterations to perform and η are hyperparameters.\n\n![image](https://user-images.githubusercontent.com/99672298/180615088-41f249ac-1993-4afa-9fd5-45622d44d45c.png)\n\n## 1.1 Gradient Boosting Machine (GBM)\u003ca class=\"anchor\" id=\"1.1\"\u003e\u003c/a\u003e\n\n[Table of Content](#0.1)\n\nA Gradient Boosting Machine or GBM is an ensemble machine learning algorithm that can be used for classification or regression predictive modeling problems, which combines the predictions from multiple decision trees to generate the final predictions. Keep in mind that all the weak learners in a gradient boosting machine are decision trees. The main objective of Gradient Boost is to minimize the loss function by adding weak learners using a gradient descent optimization algorithm. The generalization allowed arbitrary differentiable loss functions to be used, expanding the technique beyond binary classification problems to support regression, multi-class classification, and more.\n\nThe models that form the ensemble, also known as base learners, could be either from the same learning algorithm or different learning algorithms. Bagging and boosting are two widely used ensemble learners. Though these two techniques can be used with several statistical models, the most predominant usage has been with decision trees.\n\nBut if we are using the same algorithm, then how is using a hundred decision trees better than using a single decision tree? How do different decision trees capture different signals/information from the data?\n#### Bagging\nWhile decision trees are one of the most easily interpretable models, they exhibit highly variable behavior. Consider a single training dataset that we randomly split into two parts. Now, let’s use each part to train a decision tree in order to obtain two models.\n\nWhen we fit both these models, they would yield different results. Decision trees are said to be associated with high variance due to this behavior. Bagging or boosting aggregation helps to reduce the variance in any learner. Several decision trees which are generated in parallel, form the base learners of bagging technique. Data sampled with replacement is fed to these learners for training. The final prediction is the averaged output from all the learners.\n\nHere is the trick – the nodes in every decision tree take a different subset of features for selecting the best split. This means that the individual trees aren’t all the same and hence they are able to capture different signals from the data.\n\n![21 07 2022_12 45 18_REC](https://user-images.githubusercontent.com/99672298/180593714-ed5f2021-1aa2-4f30-861b-86af44fb1980.png)\n\n#### Boosting\nAdditionaly in boosting, the trees are built sequentially such that each subsequent tree aims to reduce the errors of the previous tree. Each tree learns from its predecessors and updates the residual errors. Hence, the tree that grows next in the sequence will learn from an updated version of the residuals.\n\nThe base learners in boosting are weak learners in which the bias is high, and the predictive power is just a tad better than random guessing. Each of these weak learners contributes some vital information for prediction, enabling the boosting technique to produce a strong learner by effectively combining these weak learners. As we already know that errors play a major role in any machine learning algorithm. There are mainly two types of error, bias error and variance error. The final strong  learner helps us minimize bring down both the bias and the variance.\n\nIn contrast to bagging techniques like Random Forest, in which trees are grown to their maximum extent, boosting makes use of trees with fewer splits. Such small trees, which are not very deep, are highly interpretable. Parameters like the number of trees or iterations, the rate at which the gradient boosting learns, and the depth of the tree, could be optimally selected through validation techniques like k-fold cross validation. Having a large number of trees might lead to overfitting. So, it is necessary to carefully choose the stopping criteria for boosting.\n\n![image](https://user-images.githubusercontent.com/99672298/180591175-9472eed5-f0f1-4fb7-9dce-0b06bee28f12.png)\\\n\n### The first realization of boosting that saw great success in application was Adaptive Boosting or AdaBoost for short.\nAdaBoost Algorithm which is again a boosting method. The weak learners in AdaBoost are decision trees with a single split, called decision stumps for their shortness. This algorithm starts by building a decision stump and then assigning equal weights to all the data points. Then it increases the weights for all the points which are misclassified and lowers the weight for those that are easy to classify or are correctly classified. A new decision stump is made for these weighted data points. The idea behind this is to improve the predictions made by the first stump. New weak learners are added sequentially that focus their training on the more difficult patterns.The main difference between these two algorithms is that Gradient boosting has a fixed base estimator i.e., Decision Trees whereas in AdaBoost we can change the base estimator according to our needs.\n\nAdaBoost uses multiple iterations to generate a single composite strong learner. It creates a strong learner by iteratively adding weak learners. During each phase of training, a new weak learner is added to the ensemble, and a weighting vector is adjusted to focus on examples that were misclassified in previous rounds. The result is a classifier that has higher accuracy than the weak learner classifiers.\n\n![image](https://user-images.githubusercontent.com/99672298/180611411-faf62bc8-9795-444d-85f4-3a6ffab8cec8.png)\n\nGradient Boosting trains many models in a gradual, additive and sequential manner. The major difference between AdaBoost and Gradient Boosting Algorithm is how the two algorithms identify the shortcomings of weak learners (eg. decision trees). Thus, like AdaBoost, Gradient Boost builds fixed sized trees based on the previous tree's errors, but unlike AdaBoost, each tree can be larger than a stump.In Contrast, Gradient Boost starts by making a single leaf, instead of a tree or a stump. While the AdaBoost model identifies the shortcomings by using high weight data points, gradient boosting performs the same by using gradients in the loss function\n\n## 1.2 What is boosting?\u003ca class=\"anchor\" id=\"1.2\"\u003e\u003c/a\u003e\n\nWhile studying machine learning you must have come across this term called Boosting. Boosting is an ensemble learning technique to build a strong classifier from several weak classifiers in series. Boosting algorithms play a crucial role in dealing with bias-variance trade-offs. Unlike bagging algorithms, which only control for high variance in a model, boosting controls both the aspects (bias \u0026 variance) and is considered to be more effective.\n\nBelow are the few types of boosting algorithms:\n\n1. AdaBoost (Adaptive Boosting)\n2. Gradient Boosting\n3. XGBoost\n4. CATBoost\n5. Light GBM\n\nThe principle behind boosting algorithms is first we built a model on the training dataset, then a second model is built to rectify the errors present in the first model. Let me try to explain to you what exactly does this means and how does this works.\n![image](https://user-images.githubusercontent.com/99672298/180592033-ba34e5cd-e22c-4a6b-a4b6-3588662a0cfb.png)\\\nSuppose you have n data points and 2 output classes (0 and 1). You want to create a model to detect the class of the test data. Now what we do is randomly select observations from the training dataset and feed them to model 1 (M1), we also assume that initially, all the observations have an equal weight that means an equal probability of getting selected.\n\nRemember in ensembling techniques the weak learners combine to make a strong model so here M1, M2, M3….Mn all are weak learners.\n\nSince M1 is a weak learner, it will surely misclassify some of the observations. Now before feeding the observations to M2 what we do is update the weights of the observations which are wrongly classified. You can think of it as a bag that initially contains 10 different color balls but after some time some kid takes out his favorite color ball and put 4 red color balls instead inside the bag. Now off-course the probability of selecting a red ball is higher. This same phenomenon happens in Boosting techniques, when an observation is wrongly classified, its weight get’s updated and for those which are correctly classified, their weights get decreased. The probability of selecting a wrongly classified observation gets increased hence in the next model only those observations get selected which were misclassified in model 1.\n\nSimilarly, it happens with M2, the wrongly classified weights are again updated and then fed to M3. This procedure is continued until and unless the errors are minimized, and the dataset is predicted correctly. Now when the new datapoint comes in (Test data) it passes through all the models (weak learners) and the class which gets the highest vote is the output for our test data.\n\n## 1.3 Improvements to Basic Gradient Boosting\u003ca class=\"anchor\" id=\"1.3\"\u003e\u003c/a\u003e\n### Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.\n\nIt can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.\n\nIn this this section we will look at 4 enhancements to basic gradient boosting:\n\n+ Tree Constraints\n+ Shrinkage\n+ Random sampling\n+ Penalized Learning\n\n1. Tree Constraints\nIt is important that the weak learners have skill but remain weak.\n\nThere are a number of ways that the trees can be constrained.\n\nA good general heuristic is that the more constrained tree creation is, the more trees you will need in the model, and the reverse, where less constrained individual trees, the fewer trees that will be required.\n\nBelow are some constraints that can be imposed on the construction of decision trees:\n\nNumber of trees, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.\nTree depth, deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4-8 levels.\nNumber of nodes or number of leaves, like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used.\nNumber of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered\nMinimim improvement to loss is a constraint on the improvement of any split added to a tree.\n\n2. Weighted Updates\nThe predictions of each tree are added together sequentially.\n\nThe contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate.\n\n3. Stochastic Gradient Boosting\nA big insight into bagging ensembles and random forest was allowing trees to be greedily created from subsamples of the training dataset.\n\nThis same benefit can be used to reduce the correlation between the trees in the sequence in gradient boosting models.\n\nThis variation of boosting is called stochastic gradient boosting.A few variants of stochastic boosting that can be used:\n\n+ Subsample rows before creating each tree.\n+ Subsample columns before creating each tree\n+ Subsample columns before considering each split.\n\n4. Penalized Gradient Boosting\nAdditional constraints can be imposed on the parameterized trees in addition to their structure.\n\n+ L1 regularization of weights.\n+ L2 regularization of weights.\n\n#### Gradient boosting is a greedy algorithm and can overfit a training dataset quickly. So regularization methods are used to improve the performance of the algorithm by reducing overfitting.\n\n+ **Subsampling:** This is the simplest form of regularization method introduced for GBM’s. This improves the generalization properties of the model and reduces the computation efforts. Subsampling introduces randomness into the fitting procedure. At each learning iteration, only a random part of the training data is used to fit a consecutive base-learner. The training data is sampled without replacement.\n+ **Shrinkage:** Shrinkage is commonly used in ridge regression where it shrinks regression coefficients to zero and, thus, reduces the impact of potentially unstable regression coefficients. In GBM’s, shrinkage is used for reducing the impact of each additionally fitted base-learner. It reduces the size of incremental steps and thus penalizes the importance of each consecutive iteration. The intuition behind this technique is that it is better to improve a model by taking many small steps than by taking fewer large steps. If one of the boosting iterations turns out to be erroneous, its negative impact can be corrected easily in subsequent steps.\n+ **Early Stopping:** One important practical consideration that can be derived from Decision Tree is early stopping or tree pruning. This means that if the ensemble was trimmed by the number of trees, corresponding to the validation set minima on the error curve, the overfitting would be circumvented at the minimal accuracy expense. Another observation is that the optimal number of boosts, at which the early stopping is considered, varies concerning the shrinkage parameter λ. Therefore, a trade-off between the number of boosts and λ should be considered.\n\n## 1.4 Summary:\u003ca class=\"anchor\" id=\"1.4\"\u003e\u003c/a\u003e\nGradient boosting involves three elements:\n\n+ 1. A loss function to be optimized.\n+ 2. A weak learner to make predictions.\n+ 3. An additive model to add weak learners to minimize the loss function.\n\n1. Loss Function\nThe loss function used depends on the type of problem being solved.\n\nIt must be differentiable, but many standard loss functions are supported and you can define your own.\n\nFor example, regression may use a squared error and classification may use logarithmic loss.\n\n2. Weak Learner\nDecision trees are used as the weak learner in gradient boosting.\n\nSpecifically regression trees are used that output real values for splits and whose output can be added together, allowing subsequent models outputs to be added and “correct” the residuals in the predictions.\n\nTrees are constructed in a greedy manner, choosing the best split points based on purity scores like Gini or to minimize the loss.\n\nInitially, such as in the case of AdaBoost, very short decision trees were used that only had a single split, called a decision stump. Larger trees can be used generally with 4-to-8 levels.\n\nIt is common to constrain the weak learners in specific ways, such as a maximum number of layers, nodes, splits or leaf nodes.\n\nThis is to ensure that the learners remain weak, but can still be constructed in a greedy manner.\n\n3. Additive Model\nTrees are added one at a time, and existing trees in the model are not changed.\n\nA gradient descent procedure is used to minimize the loss when adding trees.\n\nTraditionally, gradient descent is used to minimize a set of parameters, such as the coefficients in a regression equation or weights in a neural network. After calculating error or loss, the weights are updated to minimize that error.\n\nInstead of parameters, we have weak learner sub-models or more specifically decision trees. After calculating the loss, to perform the gradient descent procedure, we must add a tree to the model that reduces the loss (i.e. follow the gradient). We do this by parameterizing the tree, then modify the parameters of the tree and move in the right direction by (reducing the residual loss.\n\nThe output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model.\n\nA fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset.\n\n\n[Table of Content](#0.1)\n## Maths Intuition\n### 1.5 Understand Gradient Boosting Algorithm with example (Regression)\u003ca class=\"anchor\" id=\"1.5\"\u003e\u003c/a\u003e\nLet’s understand the intuition behind Gradient boosting with the help of an example. Here our target column is continuous hence we will use Gradient Boosting Regressor.\n\nFollowing is a sample from a random dataset where we have to predict the car price based on various features. The target column is price and other features are independent features.\n\n![image](https://user-images.githubusercontent.com/99672298/180592447-05d51d72-bd76-40b2-850f-da745d8e0e75.png)\\\n_______________________________________________________________________________________________________________________________________________________________\n#### Step -1 The first step in gradient boosting is to build a base model to predict the observations in the training dataset. For simplicity we take an average of the target column and assume that to be the predicted value as shown below:\n_______________________________________________________________________________________________________________________________________________________________\n![image](https://user-images.githubusercontent.com/99672298/180592468-df49c744-2394-4254-b90f-63809377f4fb.png)\n\nLooking at this may give you a headache, but don’t worry we will try to understand what is written here.\n\nHere L is our loss function\n\nGamma is our predicted value\n\nargmin means we have to find a predicted value/gamma for which the loss function is minimum.\n\nSince the target column is continuous our loss function will be:\n\n![image](https://user-images.githubusercontent.com/99672298/180592483-f1f2c325-649a-4e1e-b866-949fb111a529.png)\\\n\nloss function | Gradient Boosting Algorithm\nHere yi is the observed value\n\nAnd gamma is the predicted value\n\nNow we need to find a minimum value of gamma such that this loss function is minimum. We all have studied how to find minima and maxima in our 12th grade. Did we use to differentiate this loss function and then put it equal to 0 right? Yes, we will do the same here.\n\n![image](https://user-images.githubusercontent.com/99672298/180592494-7fdb2c75-654f-45a9-b182-1ac654c43747.png)\n\ndifferentiate loss function\nLet’s see how to do this with the help of our example. Remember that y_i is our observed value and gamma_i is our predicted value, by plugging the values in the above formula we get:\n\n![image](https://user-images.githubusercontent.com/99672298/180592500-c818dd45-37f4-492f-a639-715eb4cf0bba.png)\n\nplug values | Gradient Boosting Algorithm\nWe end up over an average of the observed car price and this is why I asked you to take the average of the target column and assume it to be your first prediction.\n\nHence for gamma=14500, the loss function will be minimum so this value will become our prediction for the base model.\n![image](https://user-images.githubusercontent.com/99672298/180592508-eb40a933-f93a-401c-b225-fe751fd84807.png)\n_______________________________________________________________________________________________________________________________________________________________\n#### Step-2 The next step is to calculate the pseudo residuals which are (observed value – predicted value)\n_______________________________________________________________________________________________________________________________________________________________\n\n![20 07 2022_20 25 10_REC](https://user-images.githubusercontent.com/99672298/180602144-1cba6543-31a0-437f-b893-847962ac1744.png)\n\nAgain the question comes why only observed – predicted? Everything is mathematically proved, let’s from where did this formula come from. This step can be written as:\n\n![image](https://user-images.githubusercontent.com/99672298/180592566-be077eb8-3843-4735-bf10-4269b26fd5e0.png)\n\nHere F(xi) is the previous model and m is the number of DT made.\n\nThe predicted value here is the prediction made by the previous model. In our example the prediction made by the previous model (initial base model prediction) is 14500, to calculate the residuals our formula becomes:\n\n![image](https://user-images.githubusercontent.com/99672298/180592632-e80350f1-b5f7-4239-99d3-1f69af26087d.png)\n![image](https://user-images.githubusercontent.com/99672298/180592557-6acc1beb-8907-4353-af6f-8ddb627f0055.png)\n\nIn the next step, we will build a model on these pseudo residuals and make predictions. Why do we do this? Because we want to minimize these residuals and minimizing the residuals will eventually improve our model accuracy and prediction power. So, using the Residual as target and the original feature Cylinder number, cylinder height, and Engine location we will generate new predictions. Note that the predictions, in this case, will be the error values, not the predicted car price values since our target column is an error now.\n\nLet’s say hm(x) is our DT made on these residuals.\n_______________________________________________________________________________________________________________________________________________________________\n#### Step- 3 In this step we find the output values for each leaf of our decision tree. That means there might be a case where 1 leaf gets more than 1 residual, hence we need to find the final output of all the leaves. TO find the output we can simply take the average of all the numbers in a leaf, doesn’t matter if there is only 1 number or more than 1.\\\n_______________________________________________________________________________________________________________________________________________________________\nLet’s see why do we take the average of all the numbers. Mathematically this step can be represented as:\n\n![image](https://user-images.githubusercontent.com/99672298/180592730-8c4444e7-79e1-4f5f-a5b6-22e04c6c0a41.png)\n\nHere hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking about the 1st DT and when it is “M” we are talking about the last DT.\n\nThe output value for the leaf is the value of gamma that minimizes the Loss function. The left-hand side “Gamma” is the output value of a particular leaf. On the right-hand side [Fm-1(xi)+ƴhm(xi))] is similar to step 1 but here the difference is that we are taking previous predictions whereas earlier there was no previous prediction.\n\nLet’s understand this even better with the help of an example. Suppose this is our regressor tree:\n\n![image](https://user-images.githubusercontent.com/99672298/180592739-c1ad662e-81a4-45db-95e1-690e5383f617.png)\n\nWe see 1st residual goes in R1,1  ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .\n\nLet’s calculate the output for the first leave that is R1,1\n\n![image](https://user-images.githubusercontent.com/99672298/180592751-4a6799b5-04e3-4898-95bd-3f95b87ec836.png)\n\nNow we need to find the value for gamma for which this function is minimum. So we find the derivative of this equation w.r.t gamma and put it equal to 0.\n\n![image](https://user-images.githubusercontent.com/99672298/180592757-40d1952d-d41e-4801-8c75-7411fdcdd00e.png)\n\nHence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1\n\n![image](https://user-images.githubusercontent.com/99672298/180592813-221232e1-55c8-416c-9e55-cc7231d01ae8.png)\n\nLet’s take the derivative to get the minimum value of gamma for which this function is minimum:\n\n![image](https://user-images.githubusercontent.com/99672298/180592825-83b97463-e9d2-4a92-8009-51139455f696.png)\n\nWe end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more than 1 residual, we can simply find the average of that leaf and that will be our final output.\n\nNow after calculating the output of all the leaves, we get:\n\n![image](https://user-images.githubusercontent.com/99672298/180592833-a5b59e63-bd97-4da3-8b41-2d76155690d7.png)\n_______________________________________________________________________________________________________________________________________________________________\n#### Step-4 This is finally the last step where we have to update the predictions of the previous model. It can be updated as:\n_______________________________________________________________________________________________________________________________________________________________\n\n![image](https://user-images.githubusercontent.com/99672298/180592838-6d150e8f-9cd1-4b1c-a9b7-defced68e81b.png)\n\nwhere m is the number of decision trees made.\n\nSince we have just started building our model so our m=1. Now to make a new DT our new predictions will be:\n\n![image](https://user-images.githubusercontent.com/99672298/180592844-6b95fe59-f048-49f7-8ff9-3bb64b6fa3d9.png)\n\nHere Fm-1(x) is the prediction of the base model (previous prediction) since F1-1=0 , F0 is our base model hence the previous prediction is 14500.\n\nnu is the learning rate that is usually selected between 0-1. It reduces the effect each tree has on the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this example.\n\nHm(x) is the recent DT made on the residuals.\n\nLet’s calculate the new prediction now:\n\n![image](https://user-images.githubusercontent.com/99672298/180592852-56692fcf-636a-41d1-aaee-0f16474df415.png)\n\n[Table of Content](#0.1)\n## Maths Intuition\n### 1.6 Gradient Boosting Classifier\u003ca class=\"anchor\" id=\"1.6\"\u003e\u003c/a\u003e\nWhat is Gradient Boosting Classifier?\nA gradient boosting classifier is used when the target column is binary. All the steps explained in the Gradient boosting regressor are used here, the only difference is we change the loss function. Earlier we used Mean squared error when the target column was continuous but this time, we will use log-likelihood as our loss function.\n\n![image](https://user-images.githubusercontent.com/99672298/180600358-32739748-9481-4bfd-bba3-7e58dc86f6eb.png)\n\nLet’s see how this loss function works,\nThe first step is creating an initial constant prediction value F₀. L is the loss function and we are using log loss (or more generally called cross-entropy loss) for it.\n_______________________________________________________________________________________________________________________________________________________________\n### Step 1\n_______________________________________________________________________________________________________________________________________________________________\n![image](https://user-images.githubusercontent.com/99672298/180600390-f0be5a77-4591-4c8b-8a1e-2f31009b59ad.png)\n\n![image](https://user-images.githubusercontent.com/99672298/180598331-ffc05535-2648-43c4-bc9b-806142c2a406.png)\n\nyᵢ is our classification target and it is either 0 or 1. p is the predicted probability of class 1. You might see L taking different values depending on the target class yᵢ.\n\n![image](https://user-images.githubusercontent.com/99672298/180598339-1f18ea4e-3c6a-4f91-9728-cad35b91faae.png)\n\nAs −log(x) is the decreasing function of x, the better the prediction (i.e. increasing p for yᵢ=1), the smaller loss we will have.\n\nargmin means we are searching for the value γ (gamma) that minimizes ΣL(yᵢ,γ). While it is more straightforward to assume γ is the predicted probability p, we assume γ is log-odds as it makes all the following computations easier. For those who forgot the log-odds definition , it is defined as log(odds) = log(p/(1-p)).\n\nTo be able to solve the argmin problem in terms of log-odds, we are transforming the loss function into the function of log-odds.\n\nOur first step in the gradient boosting algorithm was to initialize the model with some constant value, there we used the average of the target column but here we’ll use log(odds) to get that constant value. The question comes why log(odds)?\n\nWhen we differentiate this loss function, we will get a function of log(odds) and then we need to find a value of log(odds) for which the loss function is minimum.\n\nConfused right? Okay let’s see how it works:\n\nLet’s first transform this loss function so that it is a function of log(odds), I’ll tell you later why we did this transformation.\n\n![image](https://user-images.githubusercontent.com/99672298/180598182-fa3f6c4b-05ba-40d8-b0d4-2ccce4dae77b.png)\n\nNow we might want to replace p in the above equation with something that is expressed in terms of log-odds. By transforming the log-odds expression shown earlier, p can be represented by log-odds:\n\n![image](https://user-images.githubusercontent.com/99672298/180598541-74d8e309-9fdc-4ebb-9849-716cda4754f2.png)\n\nThen, we are substituting this value for p in the previous L equation and simplying it.\n\n![image](https://user-images.githubusercontent.com/99672298/180599089-9e0ddfc0-8f5b-4861-b1be-8bd58d736d01.png)\n\nNow this is our loss function, and we need to minimize it, for this, we take the derivative of this w.r.t to log(odds) and then put it equal to 0,\n\n![image](https://user-images.githubusercontent.com/99672298/180600153-6b966f24-22b0-4eb3-89bb-9af03278c1bf.png)\\\n\nIn the equations above, we replaced the fraction containing log-odds with p to simplify the equation. Next, we are setting ∂ΣL/∂log(odds) equal to 0 and solving it for p.\n\n![image](https://user-images.githubusercontent.com/99672298/180600167-184bf77d-82b5-4fa2-a56b-1b45ed99f8c0.png)\n\nIn this binary classification problem, y is either 0 or 1. So, the mean of y is actually the proportion of class 1. You might now see why we used p = mean(y) for our initial prediction.\n\nAs γ is log-odds instead of probability p, we are converting it into log-odds.\n\n![image](https://user-images.githubusercontent.com/99672298/180600342-5b65fb23-a0de-43cf-9786-2bffe3cd5232.png)\n_______________________________________________________________________________________________________________________________________________________________\n### Step2\n_______________________________________________________________________________________________________________________________________________________________\n\n![image](https://user-images.githubusercontent.com/99672298/180600786-92aa4e79-b132-4408-9caa-125175bb7051.png)\n\nThe whole step2 processes from 2–1 to 2–4 are iterated M times. M denotes the number of trees we are creating and the small m represents the index of each tree.\n_______________________________________________________________________________________________________________________________________________________________\n#### Step2-1\n_______________________________________________________________________________________________________________________________________________________________\n\n\n![image](https://user-images.githubusercontent.com/99672298/180600692-5b95f055-62d3-49d9-be58-220ba93165d6.png)\n\nWe are calculating residuals rᵢ𝑚 by taking a derivative of the loss function with respect to the previous prediction F𝑚-₁ and multiplying it by −1. As you can see in the subscript index, rᵢ𝑚 is computed for each single sample i. Some of you might be wondering why we are calling this rᵢ𝑚 residuals. This value is actually negative gradient that gives us the directions (+/−) and the magnitude in which the loss function can be minimized. You will see why we are calling it residuals shortly. By the way, this technique where you use a gradient to minimize the loss on your model is very similar to gradient descent technique which is typically used to optimize neural networks. (In fact, they are slightly different from each other.\n\nLet’s compute the residuals here. F𝑚-₁ in the equation means the prediction from the previous step. In this first iteration, it is F₀. As in the previous step, we are taking a derivative of L with respect to log-odds instead of p since our prediction F𝑚 is log-odds. Below we are using L expressed by log-odds which we got in the previous step.\n\n![image](https://user-images.githubusercontent.com/99672298/180600756-026564dd-fe2c-4548-9885-5e4d21b90286.png)\n\nIn the previous step, we also got this equation:\n\n![image](https://user-images.githubusercontent.com/99672298/180600760-bf137c5f-0e24-44c2-9dd7-eae2f8dd77d5.png)\n\nSo, we can replace the second term in rᵢ𝑚 equation with p.\n\n![image](https://user-images.githubusercontent.com/99672298/180600767-9e79e9b5-42fc-49af-9a75-f45dc67b9417.png)\n\nYou might now see why we call r residuals. This also gives us interesting insight that the negative gradient that provides us the direction and the magnitude to which the loss is minimized is actually just residuals.\n\n-----------------------------------------------------------------------------OR-----------------------------------------------------------------------------\n\n![image](https://user-images.githubusercontent.com/99672298/180598190-6bf2c163-ad33-43ce-a07d-77e556163a21.png)\n\nHere y are the observed values\\\nYou must be wondering that why did we transform the loss function into the function of log(odds). Actually, sometimes it is easy to use the function of log(odds), and sometimes it’s easy to use the function of predicted probability “p”.\n\nIt is not compulsory to transform the loss function, we did this just to have easy calculations.\n\nHence the minimum value of this loss function will be our first prediction (base model prediction)\n\nNow in the Gradient boosting regressor our next step was to calculate the pseudo residuals where we multiplied the derivative of the loss function with -1. We will do the same but now the loss function is different, and we are dealing with the probability of an outcome now.\n\n![image](https://user-images.githubusercontent.com/99672298/180598203-60423cee-c552-44f7-ae93-a1c057db4adb.png)\n\nAfter finding the residuals we can build a decision tree with all independent variables and target variables as “Residuals”.\n\nNow when we have our first decision tree, we find the final output of the leaves because there might be a case where a leaf gets more than 1 residuals, so we need to calculate the final output value. \n\n![image](https://user-images.githubusercontent.com/99672298/180598211-b3ab87e1-f5b0-40af-80e4-4959b5a0f046.png)\n\nFinally, we are ready to get new predictions by adding our base model with the new tree we made on residuals.\n_______________________________________________________________________________________________________________________________________________________________\n\n[Table of Content](#0.1)\n## 2. Extreme Gradient Boosting Machine (XGBM)\u003ca class=\"anchor\" id=\"2\"\u003e\u003c/a\u003e\n\n![image](https://user-images.githubusercontent.com/99672298/180611820-2137c89b-1484-418d-bde8-7818814751a2.png)\n\nXGBoost is an extension to gradient boosted decision trees (GBM) and specially designed to improve speed and performance. In fact, XGBoost is simply an improvised version of the GBM algorithm! The working procedure of XGBoost is the same as GBM. `Regularized Learning`, `Gradient Tree Boosting` and `Shrinkage and Column Subsampling`. The trees in XGBoost are built sequentially, trying to correct the errors of the previous trees. It is an implementation of Gradient Boosting machines which exploits various optimizations to train powerful predictive models very quickly.\n\n### 2.1 XGBoost Features\u003ca class=\"anchor\" id=\"2.1\"\u003e\u003c/a\u003e\n+ **Regularized Learning:** The regularization term helps to smooth the final learned weights to avoid over-fitting. The regularized objective will tend to select a model employing simple and predictive functions.\nGradient Tree Boosting: The tree ensemble model cannot be optimized using traditional optimization methods in Euclidean space. Instead, the model is trained in an additive manner.\n+ **Shrinkage and Column Subsampling:** Besides the regularized objective, two additional techniques are used to further prevent overfitting. The first technique is shrinkage introduced by Friedman. Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each tree and leaves space for future trees to improve the model.\n+ The second technique is the column (feature) subsampling.\n+ **Column and Row Subsampling** — To reduce training time, XGBoost provides the option of training every tree with only a randomly sampled subset of the original data rows where the size of this subset is determined by the user. The same applies to the columns/features of the dataset. Apart from savings in training time, subsampling the columns during training has the effect of decorrelating the trees which can reduce overfitting and boost model performance. This technique is used in Random Forest. Column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling. The usage of column sub-samples also speeds up computations of the parallel algorithm.\n\n#### But there are certain features that make XGBoost slightly better than GBM:\n\n+ One of the most important points is that XGBM implements parallel preprocessing (at the node level) which makes it faster than GBM and that means using Parallel learning to split up the dataset so that multiple computers can work on it at the same time.\n+ XGBoost also includes a variety of regularization techniques that reduce overfitting and improve overall performance. You can select the regularization technique by setting the hyperparameters of the XGBoost algorithm\n+ Additionally, if you are using the XGBM algorithm, you don’t have to worry about imputing missing values in your dataset. The XGBM model can handle the missing values on its own. During the training process, the model learns whether missing values should be in the right or left node.\n\n#### In other words, the first three parts give us a conceptual idea of How XGBoost is fit to training data and how it makes predictions\nand the other parts we are going to discuss are going to describe optimization techniques for large datasets\n![22 07 2022_16 05 44_REC](https://user-images.githubusercontent.com/99672298/180612232-b6f1e813-5f3e-4632-b3ad-c055d0e0b137.png)\n\n### 2.2 XGBM Optimizations:\u003ca class=\"anchor\" id=\"2.2\"\u003e\u003c/a\u003e\n+ **Exact Greedy Algorithm:** The main problem in tree learning is to find the best split. This algorithm enumerates all the possible splits on all the features. It is computationally demanding to enumerate all the possible splits for continuous features.\n+ **Approximate Algorithm:** The exact greedy algorithm is very powerful since it enumerates overall possible splitting points greedily. However, it is impossible to efficiently do so when the data does not fit entirely into memory. Approximate Algorithm proposes candidate splitting points according to percentiles of feature distribution. The algorithm then maps the continuous features into buckets split by these candidate points, aggregates the statistics, and finds the best solution among proposals based on the aggregated statistics. So when we have huge training dataset, XGBoost uses an Approximate Greedy Algorithm.\n+ **Weighted Quantile Sketch:** Weighted Quantile Sketch merges the data into an approximate histogram for finding approximate best split — Before finding the best split, we form a histogram for each feature. The boundaries of the histogram bins are then used as candidate points for finding the best split. In the Weighted Quantile Sketch, the data points are assigned weights based on the “confidence” of their current predictions and the histograms are built such that each bin has approximately the same total weight (as opposed to the same number of points in the traditional quantile sketch). As a result, more candidate points and thus, a more detailed search will exist in areas where the model is doing poorly. One important step in the approximate algorithm is to propose candidate split points. XGBoost has a distributed weighted quantile sketch algorithm to effectively handle weighted data.\n+ **Parallelization for faster tree building process** — When finding optimal splits, the trying of candidate points can be parallelized at the feature/column level. For example, core 1 can be finding the best split point and its corresponding loss for feature A while core 2 can be doing the same for feature B. In the end, we compare the losses and use the best one as the split point.\n+ **Sparsity-aware Split Finding:** In many real-world problems, it is quite common for the input x to be sparse. There are multiple possible causes for sparsity:\nPresence of missing values in the data\nFrequent zero entries in the statistics\nArtifacts of feature engineering such as one-hot encoding\nThe default direction is chosen based on which reduces the Loss more. On top of this, XGBoost ensures that sparse data are not iterated over during the split finding process, preventing unnecessary computation.\nIt is important to make the algorithm aware of the sparsity pattern in the data. XGBoost handles all sparsity patterns in a unified way.\n+ **Hardware Optimizations** — XGBoost stores the frequently used gs and hs in the cache to minimize data access costs. When disk usage is required (due to data not fitting into memory), the data is compressed before storage, reducing the IO cost involved at the expense of some compression computation. If multiple disks exist, the data can be sharded to increase disk reading throughput.\n\n### 2.3 System Features\u003ca class=\"anchor\" id=\"2.3\"\u003e\u003c/a\u003e\nThe library provides a system for use in a range of computing environments, not least:\n\n+ **Parallelization:** Parallelization of tree construction using all of your CPU cores during training. Collecting statistics for each column can be parallelized, giving us a parallel algorithm for split finding.\n+ **Cache-aware Access:** XGBoost has been designed to make optimal use of hardware. This is done by allocating internal buffers in each thread, where the gradient statistics can be stored.\n+ **Distributed Computing** for training very large models using a cluster of machines.\n+ **Out-of-Core Computing** for very large datasets that don’t fit into memory.\n+ **Cache Optimization** of data structures and algorithm to make the best use of hardware.\n+ **Column Block for Parallel Learning**: The most time-consuming part of tree learning is to get the data into sorted order. In order to reduce the cost of sorting, the data is stored in the column blocks in sorted order in compressed format.\n\n### 2.4 Algorithm Features\u003ca class=\"anchor\" id=\"2.4\"\u003e\u003c/a\u003e\nThe implementation of the algorithm was engineered for the efficiency of computing time and memory resources. A design goal was to make the best use of available resources to train the model. Some key algorithm implementation features include:\n\n+ **Sparse Aware implementation** with automatic handling of missing data values.\n+ **Block Structure** to support the parallelization of tree construction.\n+ **Continued Training** so that you can further boost an already fitted model on new data.\n\n### 2.5 Weak Learner Tree Splitting\u003ca class=\"anchor\" id=\"2.5\"\u003e\u003c/a\u003e\nSo far, we got the t-th step object function, next step is to build the t-th tree, and this tree should be constructed to reduce object function value as much as possible.\n\nIn order to build a tree to reduce object function value, we only allow node split which can reduce object function value, and looking for a best split which can reduce the most.\n\nSo in each split we measure the object function value reduce by Tree Object function value(After Node Split) — (Before Node Split)\n\n![gain](https://user-images.githubusercontent.com/99672298/180615921-900d45fa-5aaf-412c-bbcc-6708ab2759f8.png)\n\nGain is how much object function value reduced in the split.\n\n${Left}_{Similarity}$ is left splitting child leaf\n\n${Right}_{Similarity}$ is right splitting leaf\n\n${Root}_{Similarity}$ is parent leaf\n\nFor simplicity, each leaf can calculate its Similarity Score\nSplitting gain can be expressed as\n\nLeft(Similarity Score)+ Right(Similarity Score) - Parent(Similarity Score)\n\n![21 07 2022_15 11 06_REC](https://user-images.githubusercontent.com/99672298/180616098-a0b22e3a-ffd2-4254-834f-bd0833959cf8.png)\n![21 07 2022_15 25 37_REC](https://user-images.githubusercontent.com/99672298/180616124-39cd86dc-442f-41b7-bf18-f0ddd9cd3555.png)\n![21 07 2022_16 15 17_REC](https://user-images.githubusercontent.com/99672298/180617051-8af16b85-3817-44b6-9c9f-8b41c625ecde.png)\n![21 07 2022_15 25 37_REC](https://user-images.githubusercontent.com/99672298/180617218-fa0b9fe1-13c8-4ae1-b93c-e2278a201e33.png)\n![21 07 2022_16 17 02_REC](https://user-images.githubusercontent.com/99672298/180617262-f9ccde6f-0106-47db-8640-42b36c97ae1a.png)\n![21 07 2022_16 17 29_REC](https://user-images.githubusercontent.com/99672298/180617265-fc2db054-f2cb-4913-90ff-f6556a45ba5b.png)\n![21 07 2022_16 17 55_REC](https://user-images.githubusercontent.com/99672298/180617271-f753a548-898b-403b-8adf-c2cad46568b7.png)\n![21 07 2022_19 09 57_REC](https://user-images.githubusercontent.com/99672298/180617307-ad884318-e96f-4db3-bd4f-7f025f67f470.png)\n![21 07 2022_19 21 56_REC](https://user-images.githubusercontent.com/99672298/180617317-83acf7db-8fb8-495e-bfaa-e38f0c45927e.png)\n![19 07 2022_22 19 52_REC](https://user-images.githubusercontent.com/99672298/180617330-cd26bd30-24c1-4095-a446-7a866467c9e6.png)\n![20 07 2022_12 32 54_REC](https://user-images.githubusercontent.com/99672298/180617331-fc0de8c9-d35a-45df-af83-c10c5e6d5b35.png)\n\n#### Simplified Summary For Regression and Classification\nWe know calculating tree node similarity and tree leaf output wᵢ will base on the chosen loss function, because gᵢ and hᵢ are 1-order and 2-order derivatives from loss function.\n\nStatQuest with Josh Starmer gives a a good simplified summary for quick reference.\n\n![image](https://user-images.githubusercontent.com/99672298/180617100-c2594a44-25e6-473d-8287-e6f0fd8cc56d.png)\n\n+ Similarity Score is applied for every node in the tree\n+ Output Value normally is for leaf node output(wᵢ)\n\n### 2.6 XGBoost Training Features\u003ca class=\"anchor\" id=\"2.6\"\u003e\u003c/a\u003e\n+ When searching for best feature value for node split, XGBoost provides an option to search on the feature value’s quantiles or histogram instead of try all the feature values to split node.\n+ When building feature histogram, XGBoost may split feature data into multiple computers to calculate histogram, then merge back to generate a aggregate histogram, this like Hadoop Map-reduce operation, and the generated histogram will be cached for next split.\n+ XGBoost can automatically handle missing values in feature. In tree node split step, XGBoost will either assign all missing value instances to left or right child, depend on which side has larger gain.\n+ XGBoost provide lots hyper-parameters to deal with overfitting\n\n### 2.7 XGBoost Algorithm — Parameters\u003ca class=\"anchor\" id=\"2.7\"\u003e\u003c/a\u003e\na. General Parameters\nFollowing are the General parameters used in Xgboost Algorithm:\n\n+ **booster:** The default value is GBtree. You need to specify the booster to use: GBtree (tree-based) or GBlinear (linear function).\n+ **num_pbuffer:** This is set automatically by XGBoost Algorithm, no need to be set by a user. Read the documentation of XGBoost for more details.\n+ num_feature:** This is set automatically by XGBoost Algorithm, no need to be set by a user.\n\nb. Booster Parameters\nBelow we discussed tree-specific parameters in Xgboost Algorithm:\n\n+ **eta:** The default value is set to 0.3. You need to specify step size shrinkage used in an update to prevents overfitting. After each boosting step, we can directly get the weights of new features. eta actually shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. Low eta value means the model is more robust to overfitting.\n+ **gamma:** The default value is set to 0. You need to specify the minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The range is 0 to ∞. The larger the gamma more conservative the algorithm is.\n+ **max_depth:** The default value is set to 6. You need to specify the maximum depth of a tree. The range is 1 to ∞.\n+ **min_child_weight:** The default value is set to 1. You need to specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node. Then with the sum of instance weight less than min_child_weight. Then the building process will give up further partitioning. In linear regression mode, corresponds to a minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. The range is 0 to ∞.\n+ **max_delta_step:** The default value is set to 0. Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help make the update step more conservative. Usually, this parameter is not needed, but it might help in logistic regression. Especially, when a class is extremely imbalanced. Set it to a value of 1–10 might help control the update. The range is 0 to ∞.\n+ **subsample:** The default value is set to 1. You need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances. That needs to grow trees and this will prevent overfitting. The range is 0 to 1.\ncolsample_bytree: The default value is set to 1. You need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1.\n\nc. Linear Booster Specific Parameters\nThese are Linear Booster Specific Parameters in XGBoost Algorithm.\n\n+ **lambda and alpha:** These are regularization terms on weights. Lambda default value assumed is 1 and alpha is 0.\n+ **lambda_bias:** L2 regularization term on bias and has a default value of 0.\n\nd. Learning Task Parameters\nFollowing are the Learning Task Parameters in XGBoost Algorithm\n\n+ **base_score:** The default value is set to 0.5. You need to specify the initial prediction score of all instances, global bias.\n+ **objective:** The default value is set to reg: linear. You need to specify the type of learner you want. That includes linear regression, Poisson regression, etc.\n+ **eval_metric:** You need to specify the evaluation metrics for validation data. And a default metric will be assigned according to the objective.\n+ **seed:** As always here you specify the seed to reproduce the same set of outputs.\n\n_______________________________________________________________________________________________________________________________________________________________\n\n[Table of Content](#0.1)\n## 3 Light Gradient Boosting Machine\u003ca class=\"anchor\" id=\"3\"\u003e\u003c/a\u003e\n\n![image](https://user-images.githubusercontent.com/99672298/180640882-5b68aa13-21ef-4ece-8f7b-82a8998a679c.png)\n\n\nLightGBM extends the gradient boosting algorithm by adding a type of automatic feature selection as well as focusing on boosting examples with larger gradients. This can result in a dramatic speedup of training and improved predictive performance.\nLightGBM is able to handle huge amounts of data with ease. But keep in mind that this algorithm does not perform well with a small number of data points.\n\nLet’s take a moment to understand why that’s the case.\n\nThe trees in LightGBM have a leaf-wise growth, rather than a level-wise growth. After the first split, the next split is done only on the leaf node that has a higher delta loss.\n\nConsider the example I’ve illustrated in the below image:\n\n![image](https://user-images.githubusercontent.com/99672298/180636740-9397eccd-31ec-4ef4-b321-867284a25805.png)\n\nAfter the first split, the left node had a higher loss and is selected for the next split. Now, we have three leaf nodes, and the middle leaf node had the highest loss. The leaf-wise split of the LightGBM algorithm enables it to work with large datasets.\n\nIn order to speed up the training process, `LightGBM uses a histogram-based method for selecting the best split`. For any continuous variable, instead of using the individual values, these are divided into bins or buckets. This makes the training process faster and lowers memory usage.\n\nLight GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.\n\nSince it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.\n\nBefore is a diagrammatic representation by the makers of the Light GBM to explain the difference clearly.\n\n#### Level-wise tree growth in XGBOOST.\n\n![image](https://user-images.githubusercontent.com/99672298/180640379-3fb4e9da-74e0-486e-9062-4d6ce20d7613.png)\n\n#### Leaf wise tree growth in Light GBM.\n\n![image](https://user-images.githubusercontent.com/99672298/180640390-abbe1e54-36ff-4112-a4b8-1b98d7691403.png)\n\nLeaf wise splits lead to increase in complexity and may lead to overfitting and it can be overcome by specifying another parameter max-depth which specifies the depth to which splitting will occur.\n\nThe costliest operation is training the decision tree and the most time consuming task is to find the optimum split points.\n### 3.1 What are split points?\u003ca class=\"anchor\" id=\"3.1\"\u003e\u003c/a\u003e\n\nSplit points are the feature values depending on which data is divided at a tree node. In the above example data division happens at node1 on Height ( 180 ) and at node 2 on Weight ( 80 ). The optimum splits are selected from a pool of candidate splits on the basis of information gain. In other words split points with maximum information gain are selected.\n\n### 3.2 How are the optimum split points created?\u003ca class=\"anchor\" id=\"3.2\"\u003e\u003c/a\u003e\n\nSplit finding algorithms are used to find candidate splits.\nOne of the most popular split finding algorithm is the Pre-sorted algorithm which enumerates all possible split points on pre-sorted values. This method is simple but highly inefficient in terms of computation power and memory .\nThe second method is the Histogram based algorithm which buckets continuous features into discrete bins to construct feature histograms during training. It costs O(#data * #feature) for histogram building and O(#bin * #feature) for split point finding. As bin \u003c\u003c data histogram building will dominate the computational complexity.\n\n![image](https://user-images.githubusercontent.com/99672298/180641741-f09e9643-010c-4df5-a0d6-658bc373500b.png)\n\nWhat makes LightGBM special?\n\nLightGBM aims to reduce complexity of histogram building ( O(data * feature) ) by down sampling data and feature using GOSS and EFB.\nWhat makes LightGBM different is that it uses a unique technique called Gradient-based One-Side Sampling (GOSS) to filter out the data instances to find a split value. This is different than XGBoost which uses pre-sorted and histogram-based algorithms to find the best split.\n\n![image](https://user-images.githubusercontent.com/99672298/180641782-8be3cf55-0e40-464a-b181-4236ab269e8d.png)\n\n### 3.3 Structural Differences\u003ca class=\"anchor\" id=\"3.3\"\u003e\u003c/a\u003e\nStructural Differences in LightGBM \u0026 XGBoost LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm \u0026 Histogram-based algorithm for computing the best split. Here instances mean observations/samples. First, let us understand how pre-sorting splitting works- For each node, enumerate over all features For each feature, sort the instances by feature value Use a linear scan to decide the best split along that feature basis information gain Take the best split solution along all the features In simple terms, Histogram-based algorithm splits all the data points for a feature into discrete bins and uses these bins to find the split value of histogram.\nWhile, it is efficient than pre-sorted algorithm in training speed which enumerates all possible split points on the pre-sorted feature values, it is still behind GOSS in terms of speed. So what makes this GOSS method efficient? In AdaBoost, the sample weight serves as a good indicator for the importance of samples. However, in Gradient Boosting Decision Tree (GBDT), there are no native sample weights, and thus the sampling methods proposed for AdaBoost cannot be directly applied. Here comes gradient-basedsampling. Gradient represents the slope of the tangent of the loss function, so logically if gradient of data points are large in some sense, these points are important for finding the optimal split point as they have higher error GOSS keeps all the instances with large gradients and performs random sampling on the instances with small gradients.\n\nFor example, let’s say I have 500K rows of data where 10k rows have higher gradients. So my algorithm will choose (10k rows of higher gradient+ x% of remaining 490k rows chosen randomly).\nAssuming x is 10%, total rows selected are 59k out of 500K on the basis of which split value if found. The basic assumption taken here is that samples with training instances with small gradients have smaller training error and it is already well-trained. In order to keep the same data distribution, when computing the information gain, GOSS introduces a constant multiplier for the data instances with small gradients. Thus, GOSS achieves a good balance between reducing the number of data instances and keeping the accuracy for learned decision trees.\n\n### 3.4 What is GOSS?\u003ca class=\"anchor\" id=\"3.4\"\u003e\u003c/a\u003e\n\nGradient-based One-Side Sampling, or GOSS for short, is a modification to the gradient boosting method that focuses attention on those training examples that result in a larger gradient, in turn speeding up learning and reducing the computational complexity of the method.\n\nGOSS is a novel sampling method which down samples the instances on basis of gradients. As we know instances with small gradients are well trained (small training error) and those with large gradients are under trained. A naive approach to downsample is to discard instances with small gradients by solely focussing on instances with large gradients but this would alter the data distribution. In a nutshell GOSS retains instances with large gradients while performing random sampling on instances with small gradients.\n\nWith GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size.\n\nIntuitive steps for GOSS calculation\n1. Sort the instances according to absolute gradients in a descending order\n2. Select the top a * 100% instances. [ Under trained / large gradients ]\n3. Randomly samples b * 100% instances from the rest of the data. This will reduce the contribution of well trained examples by a factor of b ( b \u003c 1 )\n4. Without point 3 count of samples having small gradients would be 1-a ( currently it is b ). In order to maintain the original distribution LightGBM amplifies the contribution of samples having small gradients by a constant (1-a)/b to put more focus on the under-trained instances. This puts more focus on the under trained instances without changing the data distribution by much.\n\n### 3.5 What is EFB(Exclusive Feature Bundling)?\u003ca class=\"anchor\" id=\"3.5\"\u003e\u003c/a\u003e\n\nExclusive Feature Bundling, or EFB for short, is an approach for bundling sparse (mostly zero) mutually exclusive features, such as categorical variable inputs that have been one-hot encoded. As such, it is a type of automatic feature selection.\n\nRemember histogram building takes O(#data * #feature). If we are able to down sample the #feature we will speed up tree learning. LightGBM achieves this by bundling features together. We generally work with high dimensionality data. Such data have many features which are mutually exclusive i.e they never take zero values simultaneously. LightGBM safely identifies such features and bundles them into a single feature to reduce the complexity to O(#data * #bundle) where #bundle \u003c\u003c #feature.\n\nPart 1 of EFB : Identifying features that could be bundled together\n\nIntuitive explanation for creating feature bundles\n\nConstruct a graph with weighted (measure of conflict between features) edges. Conflict is measure of the fraction of exclusive features which have overlapping non zero values.\nSort the features by count of non zero instances in descending order.\nLoop over the ordered list of features and assign the feature to an existing bundle (if conflict \u003c threshold) or create a new bundle (if conflict \u003e threshold).\n\n#### Algorithm for merging features\n\nWe will try to understand the intuition behind merging features by an example. But before that let’s answer the following questions :\n\n#### What is EFB achieving?\n\nEFB is merging the features to reduce the training complexity. In order to keep the merge reversible we will keep exclusive features reside in different bins.\n\n#### Example of the merge\n\nIn the example below you can see that feature1 and feature2 are mutually exclusive. In order to achieve non overlapping buckets we add bundle size of feature1 to feature2. This makes sure that non zero data points of bundled features ( feature1 and feature2 ) reside in different buckets. In feature_bundle buckets 1 to 4 contains non zero instances of feature1 and buckets 5,6 contain non zero instances of feature2.\n\n![image](https://user-images.githubusercontent.com/99672298/180641900-3ed03585-a526-4582-aded-ae616c0ce31c.png)\n\n#### Intuitive explanation for merging features\n\n+ Calculate the offset to be added to every feature in feature bundle.\n+ Iterate over every data instance and feature.\n+ Initialise the new bucket as zero for instances where all features are zero.\n+ Calculate the new bucket for every non zero instance of a feature by adding respective offset to original bucket of that feature.\n\n### 3.6 Advantages of Light GBM\u003ca class=\"anchor\" id=\"3.6\"\u003e\u003c/a\u003e\n+ **Faster training speed and higher efficiency:** Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.\n+ **Lower memory usage:** Replaces continuous values to discrete bins which result in lower memory usage.\n+ **Better accuracy than any other boosting algorithm:** It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy. However, it can sometimes lead to overfitting which can be avoided by setting the max_depth parameter.\n= **Compatibility with Large Datasets:** It is capable of performing equally good with large datasets with a significant reduction in training time as compared to XGBOOST.\n+ **Parallel learning supported.**\n\n### 3.7 Performance comparison\u003ca class=\"anchor\" id=\"3.7\"\u003e\u003c/a\u003e\n![image](https://user-images.githubusercontent.com/99672298/180640529-704578c6-ac98-485d-ba89-10131bdd9d81.png)\n\nThere has been only a slight increase in accuracy and auc score by applying Light GBM over XGBOOST but there is a significant difference in the execution time for the training procedure. Light GBM is almost 7 times faster than XGBOOST and is a much better approach when dealing with large datasets.\n\nThis turns out to be a huge advantage when you are working on large datasets in limited time competitions.\n\n### 3.8 Tuning Parameters of Light GBM\u003ca class=\"anchor\" id=\"3.8\"\u003e\u003c/a\u003e\nLight GBM uses leaf wise splitting over depth-wise splitting which enables it to converge much faster but also leads to overfitting. So here is a quick guide to tune the parameters in Light GBM.\n\n**For best fit**\n+ **num_leaves :** This parameter is used to set the number of leaves to be formed in a tree. Theoretically relation between num_leaves and max_depth is num_leaves= 2^(max_depth). However, this is not a good estimate in case of Light GBM since splitting takes place leaf wise rather than depth wise. Hence num_leaves set must be smaller than 2^(max_depth) otherwise it may lead to overfitting. Light GBM does not have a direct relation between num_leaves and max_depth and hence the two must not be linked with each other.\n+ **min_data_in_leaf :** It is also one of the important parameters in dealing with overfitting. Setting its value smaller may cause overfitting and hence must be set accordingly. Its value should be hundreds to thousands of large datasets.\n+ **max_depth:** It specifies the maximum depth or level up to which tree can grow.\n \n\n**For faster speed**\n+ **bagging_fraction**: Is used to perform bagging for faster results\n+ **feature_fraction :** Set fraction of the features to be used at each iteration\n+ **max_bin :** Smaller value of max_bin can save much time as it buckets the feature values in discrete bins which is computationally inexpensive.\n \n\n**For better accuracy**\n+ **Use bigger training data**\n+ **num_leaves :** Setting it to high value produces deeper trees with increased accuracy but lead to overfitting. Hence its higher value is not preferred.\n+ **max_bin :** Setting it to high values has similar effect as caused by increasing value of num_leaves and also slower our training procedure.\n\n_______________________________________________________________________________________________________________________________________________________________\n\n[Table of Content](#0.1)\n## 4. CatBoost\u003ca class=\"anchor\" id=\"4\"\u003e\u003c/a\u003e\n\nAs the name suggests, CatBoost is a boosting algorithm that can handle categorical variables in the data. Most machine learning algorithms cannot work with strings or categories in the data. Thus, converting categorical variables into numerical values is an essential preprocessing step.\n\nCatBoost can internally handle categorical variables in the data. These variables are transformed to numerical ones using various statistics on combinations of features.\n\nAnother reason why CatBoost is being widely used is that it works well with the default set of hyperparameters. Hence, as a user, we do not have to spend a lot of time tuning the hyperparameters.\n\n“CatBoost” name comes from two words “Category” and “Boosting”.As discussed, the library works well with multiple Categories of data, such as audio, text, image including historical data.“Boost” comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well also. It can also return very good result with relatively less data, unlike DL models that need to learn from a massive amount of data.\n\nCatBoost builds upon the theory of decision trees and gradient boosting. The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model. Because gradient boosting fits the decision trees sequentially, the fitted trees will learn from the mistakes of former trees and hence reduce the errors. This process of adding a new function to existing ones is continued until the selected loss function is no longer minimized.\n\nIn the growing procedure of the decision trees, CatBoost does not follow similar gradient boosting models. Instead, CatBoost grows oblivious trees, which means that the trees are grown by imposing the rule that all nodes at the same level, test the same predictor with the same condition, and hence an index of a leaf can be calculated with bitwise operations. The oblivious tree procedure allows for a simple fitting scheme and efficiency on CPUs, while the tree structure operates as a regularization to find an optimal solution and avoid overfitting.\n\n### 4.1 Compared computational efficiency:\u003ca class=\"anchor\" id=\"4.1\"\u003e\u003c/a\u003e\n\n![image](https://user-images.githubusercontent.com/99672298/180643905-c5459310-0513-41dc-9377-6b4e26fef05b.png)\n\n### 4.2 Advantages of CatBoost Library\u003ca class=\"anchor\" id=\"4.2\"\u003e\u003c/a\u003e\n+ **Performance:** CatBoost provides state of the art results and it is competitive with any leading machine learning algorithm on the performance front.\n+ **Handling Categorical features automatically:**  We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features. You can read more about it here.\n+ **Robust:** It reduces the need for extensive hyper-parameter tuning and lower the chances of overfitting also which leads to more generalized models. Although, CatBoost has multiple parameters to tune and it contains parameters like the number of trees, learning rate, regularization, tree depth, fold size, bagging temperature and others. You can read about all these parameters here.\n\n_______________________________________________________________________________________________________________________________________________________________\n\n[Table of Content](#0.1)\n\n## Author\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd\u003e\n     \u003cimg src=\"https://avatars.githubusercontent.com/u/99672298?v=4\" width=\"180\"/\u003e\n     \n     moindalvs@gmail.com\n\n\u003cp align=\"center\"\u003e\n\u003ca href = \"https://github.com/MoinDalvs\"\u003e\u003cimg src = \"http://www.iconninja.com/files/241/825/211/round-collaboration-social-github-code-circle-network-icon.svg\" width=\"36\" height = \"36\"/\u003e\u003c/a\u003e\n\u003ca href = \"https://twitter.com/DalvsHubot\"\u003e\u003cimg src = \"https://www.shareicon.net/download/2016/07/06/107115_media.svg\" width=\"36\" height=\"36\"/\u003e\u003c/a\u003e\n\u003ca href = \"https://www.linkedin.com/in/moin-dalvi-277b0214a//\"\u003e\u003cimg src = \"http://www.iconninja.com/files/863/607/751/network-linkedin-social-connection-circular-circle-media-icon.svg\" width=\"36\" height=\"36\"/\u003e\u003c/a\u003e\n\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e \n  \u003c/table\u003e\n  \n\u003cdiv style=\"display:fill;\n            border-radius: false;\n            border-style: solid;\n            border-color:#000000;\n            border-style: false;\n            border-width: 2px;\n            color:#CF673A;\n            font-size:15px;\n            font-family: Georgia;\n            background-color:#E8DCCC;\n            text-align:center;\n            letter-spacing:0.1px;\n            padding: 0.1em;\"\u003e\n\n**\u003ch2\u003e♡ Thank you for taking the time ♡**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoindalvs%2Fgradient_boosting_algorithms_from_scratch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoindalvs%2Fgradient_boosting_algorithms_from_scratch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoindalvs%2Fgradient_boosting_algorithms_from_scratch/lists"}