Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/AndriySkol/MLLinearModels


https://github.com/AndriySkol/MLLinearModels

Last synced: 4 months ago
JSON representation

Lists

README

        

# MLLinearModels

MLLinearModels is a small library tht provides functionality to train and use linear regression models, such as "Ordinary Least Squares", "Ridge", "Lasso", "Elastic Net".
This library consists of 4 main classes:
* `RidgeModel` represents "Ridge" and "Ordinary least squares" models.
* `RidgeModelCV` represents a cross-validation procedure for "Ridge"/"OLS" models.
* `ElasticNetModel` represents "Lasso" and "Elastic Net" models.
* `ElasticNetModelCV` represents a cross-validation procedure for "Lasso" and "Elastic Net" models.

Both "RidgeModel" and "ElasticNetModel" provide:
* `fit:X to: y checkInput: check` message to train model, that accepts X as PMMatrix class and y as PMVector, checkInput specifies whether data given should be preprocessed.
* `predict: X` - returns a vector of predictions for matrix rows
* `score: X output: y`- evaluates R2 coeficient error of prediction, if y is a vector of true values

### Installation
In order to use this library, Polymath project is required to install https://github.com/PolyMathOrg/PolyMath.

In addition, DataFrame library is highly suggested (though not necessary) to manipulate data https://github.com/PolyMathOrg/DataFrame.

Afterwards library can simply be loaded from git repository using iceberg.

### Loading data for the tutorial
We will load housing data and split it into train and test sets
```smalltalk
df := DataFrame loadHousing.
df addColumn: ((1 to: df size) collect:[:i | 100 random > 85]) named: #isTest.

trainX := (df selectAllWhere: [:isTest | isTest not ]) columnsFrom: 1 to: 3.
trainY := (df selectAllWhere: [:isTest | isTest not ]) columnAt: 4.

testX := (df selectAllWhere: [:isTest | isTest ]) columnsFrom: 1 to: 3.
testY := (df selectAllWhere: [:isTest | isTest ]) columnAt: 4.
```
In order, to interact with library though, we need to conver the dataframe data into PMMatrix class from Polymath.
```smalltalk
trainXMatrix := PMMatrix rows: trainX asArrayOfRows .
trainYVec := trainY asPMVector .
testXMatrix := PMMatrix rows: testX asArrayOfRows .
testYVec := testY asPMVector.
```
### Using RidgeModel
```smalltalk
olsModel :=
RidgeModel new alpha: 0;
shouldCenter: true;
shouldNormalize: true.

olsModel fit: trainXMatrix to: trainYVec checkInput: true.
r2coeficient = olsModel score: testXMatrix output: testYVec.
mseError = (((olsModel predict: testXMatrix) - testYVec) inject: 0 into: [ :a :b | a + b squared ]) / tY size.
```

### Using ElasticNetModel
tol - paramater that specifies accuracy of the solution
```smalltalk
lasso :=
ElasticNetModel new
shouldCenter: true;
shouldNormalize: true;
l1Ratio: 1;
alpha: 6.36;
tol: 1e-3.

lasso fit: trainXMatrix to: trainYVec checkInput: true.
lasso score: testXMatrix output: testYVec.
```
### Using RidgeModelCV
This class requires to pass and array of alpha values to choose from.

nFolds - the number of groups to perform more efficient k-cross validation.

if nFolds = nill or: nFolds = 1 - efficient leave-one-out cross validation is performed.

As a result of training this model will contain:
* model property - which will contain the best estimated ridge model;
* mses property - evaluated MSE for each alpha
* minAlpha - the best alpha
* minMse - the smallest error that corresponds to minAlpha
```smalltalk
ridgeCV := RidgeCVModel new
shouldCenter: true;
shouldNormalize: true;
alphas: {1e-3 . 5e-3 . 1e-2 . 3e-2 . 5e-2 . 7e-2 . 1e-1 . 3e-1 . 5e-1. 1 . 5 . 10 . 20}.

ridgeCV fit: trainXMatrix to: trainYVec checkInput: true.
ridgeCV model score: testXMatrix output: testYVec.
```

### Using ElasticModelCV
This class requires to pass and array of l1Ration values to choose from.

If an array of alphas is not passed, they will be autogenerated (though generated grid does not work too well when l1Ratio is small).

In that case, epsilon specifies the difference between max and min alpha generated for l1Ration.

nAlphas - number of alphas in range(minAlpha, maxAlpha).

nFolds - the number of groups to perform more efficient k-cross validation.
* model property - which will contain the best estimated ridge model;
* mses property - evaluated MSE fr l1Ration/alpha grid
* minAlpha - the best alpha
* minL1Ratio - the best l1Ratio
* minMse - the smalles error that corresponds to minAlpha
```smalltalk
elasticNetCV:= ElasticNetCVModel new
shouldCenter: true;
shouldNormalize: true;
l1Ratios: { 0.1 . 0.2 . 0.3 . 0.4 . 0.5 . 0.6 .0.7 . 0.8. 0.9 . 0.99 . 1} ;
alphas: {1e-3 . 5e-3 . 1e-2 . 3e-2 . 5e-2 . 7e-2 . 1e-1 . 3e-1 . 5e-1. 1 . 5 . 10 . 20};
nFolds: 10.

elasticNetCVAutoAlpha:= ElasticNetCVModel new
shouldCenter: true;
shouldNormalize: true;
l1Ratios: { 0.1 . 0.2 . 0.3 . 0.4 . 0.5 . 0.6 .0.7 . 0.8. 0.9 . 0.99 . 1} ;
nAlphas: 100;
epsilon: 1e-3.
nFolds: 10.

elasticNetCV fit: trainXMatrix to: trainYVec checkInput: true.
elasticNetCV model score: testXMatrix output: testYVec.
```