{"id":23398873,"url":"https://github.com/leogaudin/dslr","last_synced_at":"2025-04-08T19:38:39.021Z","repository":{"id":266697348,"uuid":"898980847","full_name":"leogaudin/dslr","owner":"leogaudin","description":"42 • A guide to implement a linear classification model: logistic regression.","archived":false,"fork":false,"pushed_at":"2025-01-04T14:31:44.000Z","size":2651,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-14T15:24:05.197Z","etag":null,"topics":["42","classifier-training","data-science","dslr","gradient-descent","logistic-regression","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leogaudin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-05T11:58:19.000Z","updated_at":"2025-01-04T14:31:48.000Z","dependencies_parsed_at":"2025-01-04T15:22:56.789Z","dependency_job_id":"23a6a741-7273-4515-ac8a-84eca74a7465","html_url":"https://github.com/leogaudin/dslr","commit_stats":null,"previous_names":["leogaudin/dslr"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2Fdslr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2Fdslr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2Fdslr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2Fdslr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leogaudin","download_url":"https://codeload.github.com/leogaudin/dslr/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247913586,"owners_count":21017170,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["42","classifier-training","data-science","dslr","gradient-descent","logistic-regression","python"],"created_at":"2024-12-22T09:49:41.219Z","updated_at":"2025-04-08T19:38:39.006Z","avatar_url":"https://github.com/leogaudin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align='center'\u003e🎩 dslr\u003c/h1\u003e\n\n\u003e ⚠️ This guide is written assuming that you have done `ft_linear_regression` (if not, do it), and does not go in depth on things seen in the `Python Piscine for Data Science`, like `pandas`, `numpy`, `matplotlib`, etc.\n\n## Table of Contents\n\n- [Introduction](#introduction) 👋\n- [Decrypting the subject](#decrypting-the-subject) 🔍\n\t- [Logistic Regression](#logistic-regression) 📈\n\t- [Multi-classifier](#multi-classifier) 🔢\n\t- [One-vs-all](#one-vs-all) 🏁\n\t- [Mathematics](#mathematics) 🧮\n\t\t- [Sigmoid function](#sigmoid-function) 📈\n\t\t- [Hypothesis function](#hypothesis-function) 🤔\n\t\t- [Cost function](#cost-function) 💰\n\t\t- [Gradient descent](#gradient-descent) 📉\n- [Resources](#resources) 📖\n\n## Introduction\n\nWith `dslr`, we take a step further into the world of data science.\n\nWe are given a dataset of students at Hogwarts, with their grades in various subjects, and their house.\n\nThe final goal is to predict the house of a student based on similar data.\n\nThe subject is divided into the following parts:\n\n1. **Data Analysis**\n2. **Data Visualization**\n\t1. Histogram\n\t2. Scatter plot\n\t3. Pair plot\n3. **Logistic Regression**\n\nParts 1 and 2 honestly do not require a tutorial. The first part is just reading the data and doing some basic statistics, and the second part is just plotting the data into Matplotlib (see appendix in the subject)\n\nThe third part is where the real work is done, and where this guide will focus.\n\n## Decrypting the subject\n\nLet's first recap all the cryptic terms used in the subject:\n\n- Logistic Regression\n- Multi-classifier\n- One-vs-all\n\nAnd worst of all, appendix VIII.1, *Mathematics*:\n\n\u003cp align='center'\u003e\n\t\u003cimg src='./assets/appendix.webp' alt='Appendix' width='auto' /\u003e\n\u003c/p\u003e\n\n\u003e *What is this supposed to mean?*\n\n### Logistic Regression\n\nLogistic Regression is a classification algorithm used to tell if an object is part of a class or not.\n\nUnlike linear regression which takes a scalar input and gives a scalar output (e.g. `price = mileage * weight`), logistic regression gives a probability of the input being part of a class (e.g. `[0, 1] = input * weight`).\n\n\u003cp align='center'\u003e\n\t\u003cimg src='./assets/sigmoid_example.webp' alt='Sigmoid Example' width='auto' /\u003e\n\u003c/p\u003e\n\n\u003e *Here, the sigmoid function gives the probability of a student passing their exam based on the number of hours they studied.*\n\n### Multi-classifier\n\nMulti-classification is simply when you have more than 2 classes to classify.\n\nInstead of having a binary class like \"passing the exam\" or \"not passing\", you have multiple classes like `Gryffindor`, `Hufflepuff`, `Ravenclaw`, or `Slytherin`.\n\nThat is where one-vs-all comes in.\n\n### One-vs-all\n\nThe one-vs-all strategy is a way to apply the logic we just described to the problem of having multiple classes.\n\nLet's take the example from [this lecture](https://www.cs.rice.edu/~as143/COMP642_Spring22/Scribes/Lect5):\n\n\u003e Suppose you have classes `A`, `B`, and `C`. We will build one model for each class:\n\u003e\n\u003e - Model 1: `A` or `BC`\n\u003e - Model 2: `B` or `AC`\n\u003e - Model 3: `C` or `AB`\n\u003e\n\u003e Another way to think about the models is each class vs everything else (hence the name):\n\u003e\n\u003e - Model 1: `A` or not `A`\n\u003e - Model 2: `B` or not `B`\n\u003e - Model 3: `C` or not `C`\n\nIn our case, we will have 4 models:\n\n- Model 1: `Gryffindor` or not `Gryffindor`\n- Model 2: `Hufflepuff` or not `Hufflepuff`\n- Model 3: `Ravenclaw` or not `Ravenclaw`\n- Model 4: `Slytherin` or not `Slytherin`\n\n### Mathematics\n\nNow let's dive into this appendix, starting with the last equation before the derivative:\n\n#### Sigmoid function\n\n$$\ng(z) = \\frac{1}{1 + e^{-z}}\n$$\n\nThis is the sigmoid function we talked about earlier, the blue line representing the probability of passing the exam.\n\n\u003cp align='center'\u003e\n\t\u003cimg src='./assets/sigmoid_curve.gif' alt='Sigmoid curve' width='auto' /\u003e\n\u003c/p\u003e\n\n\u003e *Here is how the sigmoid curve changes based on the value of `z`. The higher `z` is, the steeper the curve is, i.e. there is a threshold where the probability goes from 0 to 1 almost instantly.*\n\n#### Hypothesis function\n\n$$\nh_{\\theta}(x) = g(\\theta^T x)\n$$\n\n$h_{\\theta}(x)$ is the hypothesis we are making based on the input $x$ and the weights $\\theta$ passed to $g(z)$.\n\nWe just learned what was this $g(z)$, but what about $\\theta^T x$?\n\nThe $T$ in $\\theta^T$ means \"transpose\", which is a matrix operation.\n\nIndeed, we have many parameters in our dataset, not a single one like in `ft_linear_regression`. We therefore need to \"group\" them in a vector.\n\n\u003e 💡 If you arrived here right after C, think of it as a 1-D array.\n\u003e\n\u003e If you are not familiar with vectors and matrices, you should do the `matrix` project, that can bring you very interesting bases for `dslr`.\n\nAssuming that $\\theta$ and $x$ are both column vectors as follows:\n\n$$\n\\theta = \\begin{bmatrix}\nw_{\\text{param1}} \\\\\nw_{\\text{param2}} \\\\\nw_{\\text{param3}} \\\\\n\\ldots\n\\end{bmatrix}\n$$\n\n$$\nx = \\begin{bmatrix}\n\\text{param1} \\\\\n\\text{param2} \\\\\n\\text{param3} \\\\\n\\ldots\n\\end{bmatrix}\n$$\n\n\"Multiplying\" them as they are, with an element-wise product for example, would result in a third vector that would look like:\n\n$$\n\\theta x = \\begin{bmatrix}\nw_{\\text{param1}} \\times \\text{param1} \\\\\nw_{\\text{param2}} \\times \\text{param2} \\\\\nw_{\\text{param3}} \\times \\text{param3} \\\\\n\\ldots\n\\end{bmatrix}\n$$\n\nThe notation $\\theta^T x$ is a way of clarifying we are doing a dot product between the two vectors, which would look like:\n\n$$\n\\theta^T x = w_{\\text{param1}} \\times \\text{param1} + w_{\\text{param2}} \\times \\text{param2} + w_{\\text{param3}} \\times \\text{param3} + \\ldots\n$$\n\nThis operation gives us a scalar (i.e. a single value), which is what we want to pass to the sigmoid function.\n\n#### Cost function\n\nNow that we know what $g(z)$ and $h_{\\theta}(x)$ are, let's dive into the cost function, formally known as the **negative log-loss function**.\n\n$$\nJ(\\theta) = -\\frac{1}{m} \\sum_{i=1}^{m} \\left[ y_i \\log(h_{\\theta}(x_i)) + (1 - y_i) \\log(1 - h_{\\theta}(x_i)) \\right]\n$$\n\nTo make it lighter on our eyes, let's declare that $h_{\\theta}(x)$ is equivalent to $y_{\\text{predicted}}$, literally the value we predicted $y$ to be based on our hypothesis.\n\n$$\nJ(\\theta) = -\\frac{1}{m} \\sum_{i=1}^{m} \\left[ y_i \\log(y_{\\text{predicted}}) + (1 - y_i) \\log(1 - y_{\\text{predicted}}) \\right]\n$$\n\nNow, what are those two logarithms doing here?\n\n\u003e 💡 The $\\log$ used here are in base $e$, so they should actually be written $\\ln$.\n\n$y_i$, the actual value, can only be 0 or 1 (true or false, Gryffindor or not Gryffindor).\n\nSo we only have two cases:\n\n- If $y_i = 0$, the first $\\log$ is eliminated, and we are left with $- \\log(1 - y_{\\text{predicted}})$\n- If $y_i = 1$, the second $\\log$ is eliminated, and we are left with $- \\log(y_{\\text{predicted}})$\n\n\u003e The $-$ sign in front of the $\\log$ did not appear magically, it is the one at the beginning of the equation, in the $- \\frac{1}{m}$ factor.\n\nWhat this cost function does is give **exponential importance** to the error we make.\n\n\u003cp align='center'\u003e\n\t\u003cimg src='./assets/neg_log.webp' alt='Negative log-loss function' width='auto' /\u003e\n\u003c/p\u003e\n\n\u003e See how the closer we are from 1, the lower $- \\log(x)$ is, i.e. the less importance the difference is, and as we get closer to 0, the higher it is.\n\u003e\n\u003e Example:\n\u003e\n\u003e - $y_i$ is 0, so the cost function is $- \\log(1 - y_{\\text{predicted}})$.\n\u003e - We make a first hypothesis $y_{\\text{predicted}} = 0.99$, very far from the actual value.\n\u003e - The first cost is $- \\log(1 - 0.99) = - \\log(0.01) = 4.6$.\n\u003e - We make a second hypothesis $y_{\\text{predicted}} = 0.9$, a bit closer but still very far.\n\u003e - The second cost is $- \\log(1 - 0.9) = - \\log(0.1) = 2.3$.\n\u003e\n\u003e The negative log-loss function makes the first hypothesis twice as bad as the second one, although they are only 0.1 apart.\n\n#### Gradient descent\n\nGreat, now we know how to determine how bad our hypothesis is.\n\nIf we were in `ft_linear_regression`, we simply would need to derive it to know in what direction and how much we should correct our weights, to make our hypothesis less lame.\n\nBut here, we have multiple weights and parameters, so the problem is a bit more complex.\n\nYes, we know how to calculate a global sum of all the errors we are making, but how do we know **how to individually correct each parameter**?\n\nThe subject says:\n\n\u003e The loss function gives us the following partial derivative:\n\u003e\n\u003e $$\n\u003e \\frac{\\partial J(\\theta)}{\\partial \\theta_j} = \\frac{1}{m} \\sum_{i=1}^{m} (h_{\\theta}(x_i) - y_i) x_{ij}\n\u003e $$\n\u003e\n\u003e This is the derivative of the cost function with respect to the $j$-th parameter of $\\theta$.\n\nLet's say we are in the middle of training our model.\n\nWe have the following parameters/weights:\n\n$$\n\\theta = \\begin{bmatrix}\n0.5 \\\\\n0.3 \\\\\n0.2 \\\\\n\\end{bmatrix}\n$$\n\nThe following input:\n\n$$\nx = \\begin{bmatrix}\n1 \\\\\n2 \\\\\n3 \\\\\n\\end{bmatrix}\n$$\n\nThe following output (actual value):\n\n$$\ny = 1\n$$\n\nAnd the following hypothesis:\n\n$$\ny_{\\text{predicted}} = 0.7\n$$\n\nWe can calculate the derivative of the cost function with respect to each parameter, which for the first element of $\\theta$ and $x$ would give:\n\n$$\n(0.7 - 1) \\times 1 = -0.3\n$$\n\n$$\n(0.7 - 1) \\times 2 = -0.6\n$$\n\n$$\n(0.7 - 1) \\times 3 = -0.9\n$$\n\n\u003e ⚠️ This assumes $m = 1$ for simplicity, i.e. we only have one observation in our dataset.\n\nWe would then update our parameters as follows:\n\n$$\n\\theta = \\begin{bmatrix}\n0.5 - 0.3 \\\\\n0.3 - 0.6 \\\\\n0.2 - 0.9 \\\\\n\\end{bmatrix}\n= \\begin{bmatrix}\n0.2 \\\\\n-0.3 \\\\\n-0.7 \\\\\n\\end{bmatrix}\n$$\n\n\u003e ⚠️ We also omitted the learning rate $\\alpha$ for simplicity, but it should be there in your implementation.\n\n# Resources\n\n- [📺 YouTube − Multiclass - One-vs-rest classification](https://www.youtube.com/watch?v=EYXSve6T5BU)\n- [📺 YouTube − Logistic Regression Machine Learning Example](https://www.youtube.com/watch?v=U1omz0B9FTw)\n- [📺 YouTube − Logistic Regression Cost Function](https://www.youtube.com/watch?v=ar8mUO3d05w)\n- [📺 YouTube − Derivative of Cost function for Logistic Regression](https://www.youtube.com/watch?v=0VMK18nphpg)\n- [📺 YouTube − Logistic Regression in Python from Scratch](https://www.youtube.com/watch?v=nzNp05AyBM8)\n- [📖 Rice University − Multi-Class Classification: One-vs-All](https://www.cs.rice.edu/~as143/COMP642_Spring22/Scribes/Lect5)\n- [💬 Stack Exchange − Theta * X vs Sum_j=1(Theta_j * x_j)](https://math.stackexchange.com/questions/3485981/thetatx-vs-sum-j-1n-theta-j-x-j)\n- [💬 Stack Exchange − Theta transposes to x](https://math.stackexchange.com/questions/60212/theta-transposes-to-x)\n- [📖 Wikipedia − Dot product](https://en.wikipedia.org/wiki/Dot_product)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleogaudin%2Fdslr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleogaudin%2Fdslr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleogaudin%2Fdslr/lists"}