https://github.com/visweswaran1998/sklearn
Trying to implement Scikit Learn for Python in C++ (Single Headers and No dependencies)
https://github.com/visweswaran1998/sklearn
machine-learning
Last synced: 8 months ago
JSON representation
Trying to implement Scikit Learn for Python in C++ (Single Headers and No dependencies)
- Host: GitHub
- URL: https://github.com/visweswaran1998/sklearn
- Owner: VISWESWARAN1998
- License: mit
- Created: 2019-08-26T09:47:44.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2020-06-09T13:23:13.000Z (about 6 years ago)
- Last Synced: 2025-10-14T05:17:01.095Z (9 months ago)
- Topics: machine-learning
- Language: C++
- Homepage:
- Size: 340 KB
- Stars: 48
- Watchers: 10
- Forks: 19
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# sklearn
Trying to implement Scikit Learn for Python in C++
#### PREPROCESSING:
1. [Standardization](https://github.com/VISWESWARAN1998/sklearn#standardization)
2. [Normalization](https://github.com/VISWESWARAN1998/sklearn#normalization)
3. [Label Encoding](https://github.com/VISWESWARAN1998/sklearn#label-encoding)
4. [Label Binarization](https://github.com/VISWESWARAN1998/sklearn#label-binarization)
#### REGRESSION:
1. [Least Squares Regression](https://github.com/VISWESWARAN1998/sklearn#least-squares-regressionsimple-linear-regression)
2. [Multiple Linear Regression](https://github.com/VISWESWARAN1998/sklearn#multiple-linear-regression)
#### CLASSIFIFCATION:
1. [Gaussian Naive Bayes](https://github.com/VISWESWARAN1998/sklearn#classification---gaussian-naive-bayes)
2. [Logistic Regression](https://github.com/VISWESWARAN1998/sklearn#logistic-regression)
#### STANDARDIZATION
**SOURCE NEEDED:** preprocessing.h, proecessing.cpp and statx.h
StandardScaler will standardize features by removing the mean and scaling to unit variance. _ref:_ [Scikit Learn docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include
#include "preprocessing.h"
int main()
{
StandardScaler scaler({0, 0, 1, 1});
std::vector scaled = scaler.scale();
// Scaled value and inverse scaling
for (double i : scaled)
{
std::cout << i << " " << scaler.inverse_scale(i) << "\n";
}
}
```
#### NORMALIZATION:
**SOURCE NEEDED:** preprocessing.h, proecessing.cpp and statx.h
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include
#include "preprocessing.h"
int main()
{
std::vector normalized_vec = preprocessing::normalize({ 800, 10, 12, 78, 56, 49, 7, 1200, 1500 });
for (double i : normalized_vec) std::cout << i << " ";
}
```
#### LABEL ENCODING:
**SOURCE NEEDED:** preprocessing.h and preprocessing.cpp
Label encoding is the process of encoding the categorical data into numerical data. For example if a column in the dataset contains country values like GERMANY, FRANCE, ITALY then label encoder will convert this categorical data into numerical data like this
country - categorical |country - numerical
-------------------|-------------------
GERMANY | 1
FRANCE | 0
ITALY | 2
_Example code:_
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include
#include
#include "preprocessing.h"
int main()
{
std::vector categorical_data = { "GERMANY", "FRANCE", "ITALY" };
LabelEncoder encoder(categorical_data);
std::vector numerical_data = encoder.fit_transorm();
for (int i = 0; i < categorical_data.size(); i++)
{
std::cout << categorical_data[i] << " - " << numerical_data[i] << "\n";
}
}
```
#### Label Binarization:
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include
#include
#include "preprocessing.h"
int main()
{
std::vector ip_addresses = { "A", "B", "A", "B", "C" };
LabelBinarizer binarize(ip_addresses);
std::vector> result = binarize.fit();
for (std::vector i : result)
{
for (unsigned long int j : i) std::cout << j << " ";
std::cout << "\n";
}
// Predict
std::cout << "Prediction:\n-------------\n";
std::string test = "D";
std::vector prediction = binarize.predict(test);
for (unsigned long int i : prediction) std::cout << i << " ";
}
```
#### LEAST SQUARES REGRESSION(SIMPLE LINEAR REGRESSION)
**HEADERS NEEDED:** lsr.h and lsr.cpp
_Creating new model and saving it:_
**DATASET:**
X|y
-|--
2|4
3|5
5|7
7|10
9|15
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include "lsr.h"
int main()
{
// X, y, print_debug messages
simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, DEBUG);
slr.fit();
std::cout << slr.predict(8);
slr.save_model("model.txt");
}
```
Loading existing model
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include "lsr.h"
int main()
{
// X, y, print_debug messages
simple_linear_regression slr("model.txt");
std::cout << slr.predict(8);
}
```
**SAMPLE PREDICTION PLOTTED:**

#### Multiple Linear Regression:
Training and saving the model
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include
#include "mlr.h"
int main()
{
LinearRegression mlr({ {110, 40}, {120, 30}, {100, 20}, {90, 0}, {80, 10} }, {100, 90, 80, 70, 60}, NODEBUG);
mlr.fit();
std::cout << mlr.predict({ 110, 40 });
mlr.save_model("model.json");
}
```
Loading the saved model
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include
#include "mlr.h"
int main()
{
// Don't use fit method here
LinearRegression mlr("model.json");
std::cout << mlr.predict({ 110, 40 });
}
```
#### Classification - Gaussian Naive Bayes
Classification male - female using height, weight, foot size and saving the model.
**HEADERS / SOURCE NEEDED:** naive_bayes.h, naive_bayes.cpp, json.h
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include "naive_bayes.h"
int main()
{
gaussian_naive_bayes nb({ {6, 180, 12}, {5.92, 190, 11}, {5.58, 170, 12}, {5.92, 165, 10}, {5, 100, 6}, {5.5, 150, 8}, {5.42, 130, 7}, {5.75, 150, 9} }, { 0, 0, 0, 0, 1, 1, 1, 1 }, DEBUG);
nb.fit();
nb.save_model("model.json");
std::map probabilities = nb.predict({ 6, 130, 8 });
double male = probabilities[0];
double female = probabilities[1];
if (male > female) std::cout << "MALE";
else std::cout << "FEMALE";
}
```
_Loading a saved model:_
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include "naive_bayes.h"
int main()
{
gaussian_naive_bayes nb(NODEBUG);
nb.load_model("model.json");
std::map probabilities = nb.predict({ 6, 130, 8 });
double male = probabilities[0];
double female = probabilities[1];
if (male > female) std::cout << "MALE";
else std::cout << "FEMALE";
}
```
#### Logistic Regression:
Please do not get confused with the word "regression" in Logistic regression. It is generally used for classification problems. The heart of the logistic regession is sigmoid activation function. An activation function is a function which takes any input value and outputs value within a certain case. In our case(sigmoid), it returns between 0 and 1.
In the image, you can see the output(y) of sigmoid activation function for -3 >= x <= 3

The idea behind the logistic regression is taking the output from linear regression, i.e., y = mx+c and applying logistic function 1/(1+e^-y) which outputs the value between 0 and 1. We can clearly see this is a binary classifier, i.e., for example, it can be used for classifying binary datasets like predicting whether it is a male or a female using certain parameters.
But we can use this logistic regression to classify multi-class problems too with some modifications. Here, we are using the one vs rest principle. That is training many linear regression models, for example, if the class count is 10, it will train 10 Linear Regression models by changing the class values with 1 as the class value to predict the probability and 0 to the rest. If you don't understand, here is a detailed explanation: [https://prakhartechviz.blogspot.com/2019/02/multi-label-classification-python.html](https://prakhartechviz.blogspot.com/2019/02/multi-label-classification-python.html)
We are going to take a simple classification problem to classify whether it is a male or female.
Classification male - female using height, weight, foot size and saving the model. Here is our dataset:

All we have to do is to predict whether the person is male or female using height, weight and foot size.
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include
#include "logistic_regression.h"
int main()
{
logistic_regression lg({ { 6, 180, 12 },{ 5.92, 190, 11 },{ 5.58, 170, 12 },
{ 5.92, 165, 10 },{ 5, 100, 6 },{ 5.5, 150, 8 },{ 5.42, 130, 7 },{ 5.75, 150, 9 } },
{ 0, 0, 0, 0, 1, 1, 1, 1 }, NODEBUG);
lg.fit();
// Save the model
lg.save_model("model.json");
std::map probabilities = lg.predict({ 6, 130, 8 });
double male = probabilities[0];
double female = probabilities[1];
if (male > female) std::cout << "MALE";
else std::cout << "FEMALE";
}
```
and loading a saved model:
```c++
// SWAMI KARUPPASWAMI THUNNAI
#include
#include "logistic_regression.h"
int main()
{
logistic_regression lg("model.json");
std::map probabilities = lg.predict({ 6, 130, 8 });
double male = probabilities[0];
double female = probabilities[1];
if (male > female) std::cout << "MALE";
else std::cout << "FEMALE";
}
```