Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/JuliaStats/GLMNet.jl

Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet
https://github.com/JuliaStats/GLMNet.jl
Last synced: about 2 months ago
JSON representation
Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet
Host: GitHub
URL: https://github.com/JuliaStats/GLMNet.jl
Owner: JuliaStats
License: other
Created: 2013-11-19T04:41:43.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2024-02-27T04:49:55.000Z (11 months ago)
Last Synced: 2024-06-17T05:27:31.257Z (7 months ago)
Language: Julia
Homepage:
Size: 295 KB
Stars: 96
Watchers: 12
Forks: 35
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

awesome-julia-datasciences - GLMNet - Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet. (APL / General-Purpose Machine Learning)
README

        
# GLMNet

[![Build Status](https://travis-ci.org/JuliaStats/GLMNet.jl.svg?branch=v0.0.4)](https://travis-ci.org/JuliaStats/GLMNet.jl)

[![Coverage Status](https://coveralls.io/repos/github/JuliaStats/GLMNet.jl/badge.svg)](https://coveralls.io/github/JuliaStats/GLMNet.jl)

[glmnet](http://www.jstatsoft.org/v33/i01/) is an R package by Jerome Friedman, Trevor Hastie, Rob Tibshirani that fits entire Lasso or ElasticNet regularization paths for linear, logistic, multinomial, and Cox models using cyclic coordinate descent. This Julia package wraps the Fortran code from glmnet.

## Quick start

To fit a basic regression model:

```julia

julia> using GLMNet

julia> y = collect(1:100) + randn(100)*10;

julia> X = [1:100 (1:100)+randn(100)*5 (1:100)+randn(100)*10 (1:100)+randn(100)*20];

julia> path = glmnet(X, y)

Least Squares GLMNet Solution Path (86 solutions for 4 predictors in 930 passes):

──────────────────────────────

      df   pct_dev           λ

──────────────────────────────

 [1]   0  0.0       30.0573

 [2]   1  0.152922  27.3871

 [3]   1  0.279881  24.9541

 : 

[84]   4  0.90719    0.0133172

[85]   4  0.9072     0.0121342

[86]   4  0.907209   0.0110562

──────────────────────────────

```

`path` represents the Lasso or ElasticNet fits for varying values of λ. The value of the intercept for each λ value are in `path.a0`. The coefficients for each fit are stored in compressed form in `path.betas`.

```julia

julia> path.betas

4×86 CompressedPredictorMatrix:

 0.0  0.0925032  0.176789  0.253587  0.323562  0.387321  0.445416  0.498349  0.546581  0.590527  0.63057  0.667055  0.700299  …   1.33905      1.34855     1.35822     1.36768     1.37563     1.3829      1.39005     1.39641     1.40204     1.40702     1.41195

 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0       0.0          -0.165771    -0.17235    -0.178991   -0.185479   -0.190945   -0.195942   -0.200851   -0.20521    -0.209079   -0.212501   -0.215883

 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0       0.0          -0.00968611  -0.0117121  -0.0135919  -0.0154413  -0.0169859  -0.0183965  -0.0197951  -0.0210362  -0.0221345  -0.0231023  -0.0240649

 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0       0.0          -0.110093    -0.110505   -0.111078   -0.11163    -0.112102   -0.112533   -0.112951   -0.113324   -0.113656   -0.113953   -0.11424

```

This CompressedPredictorMatrix can be indexed as any other AbstractMatrix, or converted to a Matrix using `convert(Matrix, path.betas)`.

One can visualize the path by

``` julia

julia> using Plots, LinearAlgebra, LaTeXStrings

julia> betaNorm = [norm(x, 1) for x in eachslice(path.betas,dims=2)];

julia> extraOptions = (xlabel=L"\| \beta \|_1",ylabel=L"\beta_i", legend=:topleft,legendtitle="Variable", labels=[1 2 3 4]);

julia> plot(betaNorm, path.betas'; extraOptions...)

```

![regression-lasso-path](regressionLassoPath.svg)

To predict the output for each model along the path for a given set of predictors, use `predict`:

```julia

julia> predict(path, [22 22+randn()*5 22+randn()*10 22+randn()*20])

1×86 Array{Float64,2}:

 50.3295  47.6932  45.291  43.1023  41.108  39.2909  37.6352  36.1265  34.7519  33.4995  32.3583  31.3184  30.371  29.5077  28.7211  28.0044  …  21.3966  21.3129  21.2472  21.1746  21.1191  21.0655  21.0127  20.9687  20.9284  20.8885  20.8531  20.8218  20.7942  20.7667

```

To find the best value of λ by cross-validation, use `glmnetcv`:

```julia

julia> cv = glmnetcv(X, y) 

Least Squares GLMNet Cross Validation

86 models for 4 predictors in 10 folds

Best λ 0.136 (mean loss 101.530, std 10.940)

julia> argmin(cv.meanloss)

59

julia> coef(cv) # equivalent to cv.path.betas[:, 59]

4-element Array{Float64,1}:

  1.1277676556880305

  0.0

  0.0

 -0.08747434292954445

```

### A classification Example

```julia

julia> using RDatasets

julia> iris = dataset("datasets", "iris");

julia> X = convert(Matrix, iris[:, 1:4]);

julia> y = convert(Vector, iris[:Species]);

julia> iTrain = sample(1:size(X,1), 100, replace = false);

julia> iTest = setdiff(1:size(X,1), iTrain);

julia> iris_cv = glmnetcv(X[iTrain, :], y[iTrain])

Multinomial GLMNet Cross Validation

100 models for 4 predictors in 10 folds

Best λ 0.001 (mean loss 0.130, std 0.054)

julia> yht = round.(predict(iris_cv, X[iTest, :], outtype = :prob), digits=3);

julia> DataFrame(target=y[iTest], set=yht[:,1], ver=yht[:,2], vir=yht[:,3])[5:5:50,:]

10×4 DataFrame

│ Row │ target     │ set     │ ver     │ vir     │

│     │ Cat…       │ Float64 │ Float64 │ Float64 │

├─────┼────────────┼─────────┼─────────┼─────────┤

│ 1   │ setosa     │ 0.997   │ 0.003   │ 0.0     │

│ 2   │ setosa     │ 0.995   │ 0.005   │ 0.0     │

│ 3   │ setosa     │ 0.999   │ 0.001   │ 0.0     │

│ 4   │ versicolor │ 0.0     │ 0.997   │ 0.003   │

│ 5   │ versicolor │ 0.0     │ 0.36    │ 0.64    │

│ 6   │ versicolor │ 0.0     │ 0.05    │ 0.95    │

│ 7   │ virginica  │ 0.0     │ 0.002   │ 0.998   │

│ 8   │ virginica  │ 0.0     │ 0.001   │ 0.999   │

│ 9   │ virginica  │ 0.0     │ 0.0     │ 1.0     │

│ 10  │ virginica  │ 0.0     │ 0.001   │ 0.999   │

julia> irisLabels = reshape(names(iris)[1:4],(1,4));

julia> βs =iris_cv.path.betas;

julia> λs= iris_cv.lambda;

julia> sharedOpts =(legend=false,  xlabel=L"\lambda", xscale=:log10) 

julia> p1 = plot(λs,βs[:,1,:]',ylabel=L"\beta_i";sharedOpts...);

julia> p2 = plot(λs,βs[:,2,:]',title="Across Cross Validation runs";sharedOpts...);

julia> p3 = plot(λs,βs[:,3,:]', legend=:topright,legendtitle="Variable", labels=irisLabels,xlabel=L"\lambda",xscale=:log10);

julia> plot(p1,p2,p3,layout=(1,3))

```

![iris-lasso-path](iris_path.svg) 

```julia

julia> plot(iris_cv.lambda, iris_cv.meanloss, xscale=:log10, legend=false, yerror=iris_cv.stdloss,xlabel=L"\lambda",ylabel="loss")

julia> vline!([lambdamin(iris_cv)])

```

![iris-cv](lambda_plot.svg)

## Fitting models

`glmnet` has two required parameters: the n x m predictor matrix `X` and the dependent variable `y`. It additionally accepts an optional third argument, `family`, which can be used to specify a generalized linear model. Currently, `Normal()` (least squares, default), `MvNormal()` Multi-response gaussian, `Binomial()` (logistic), `Poisson()` , `Multinomial()`, `CoxPH()` (Cox model) are supported. 

- For linear and Poisson models, `y` is a numerical vector of length n.

- For logistic models, `y` is either a string vector of length n or a n x 2 matrix, where the first column is the count of negative responses for each row in `X` and the second column is the count of positive responses. 

- For multinomial models, `y` is either a string vector (with at least 3 unique values) or a n x k matrix, where k is the number of unique values (classes).

- For Cox models, `y` is a n x 2 matrix, where the first column is survival time and the second column is (right) censoring status. Note that for survival data, `glmnet` has another method `glmnet(X::Matrix, time::Vector, status::Vector)` (`glmnetcv` has a corresponding method as well).

`glmnet` also accepts many optional keyword parameters as described below:

 - `weights`: A vector of length n of weights.

 - `alpha`: The tradeoff between lasso and ridge regression. This defaults to `1.0`, which specifies a lasso model.

 - `penalty_factor`: A vector of length m of penalties for each predictor/column in `X`. This defaults to all ones, which weighs each predictor equally. To unpenalize a predictor, set the corresponding entry to zero.

 - `constraints`: A 2 x m matrix specifying lower bounds (first row) and upper bounds (second row) of the predictors. By default, this is `[-Inf; Inf]` for each predictor in `X`.

 - `dfmax`: The maximum number of predictors in the largest model.

 - `pmax`: The maximum number of predictors in any model.

 - `nlambda`: The number of values of λ along the path to consider.

 - `lambda_min_ratio`: The smallest λ value to consider, as a ratio of the λ value that gives the null model (i.e., the model with only an intercept). If the number of observations (n) exceeds the number of variables (m), this defaults to `0.0001`, otherwise `0.01`.

 - `lambda`: The λ values to consider. By default, this is determined from `nlambda` and `lambda_min_ratio`.

 - `tol`: Convergence criterion, with the default value of `1e-7`.

 - `standardize`: Whether to standardize predictors so that they are in the same units, with the default value of `true`. Beta values are always presented on the original scale.

 - `standardize_response`: (only for `MvNormal()`), Whether to standardize the response variables so that they are the same units. defaults to `false`.

 - `intercept`: Whether to fit an intercept term. The intercept is always unpenalized. The default value is `true`.

 - `maxit`: The maximum number of iterations of the cyclic coordinate descent algorithm. If convergence is not achieved, a warning is returned. The default value is `1e6`.

## See also

 - [Lasso.jl](https://github.com/simonster/Lasso.jl), a pure Julia implementation of the glmnet coordinate descent algorithm that often achieves better performance.

 - [LARS.jl](https://github.com/simonster/LARS.jl), an implementation of least angle regression for fitting entire linear (but not

   generalized linear) Lasso and Elastic Net coordinate paths.