https://github.com/gjjvdburg/sparsestep

SparseStep: Approximating the Counting Norm for Sparse Regularization
https://github.com/gjjvdburg/sparsestep

feature-selection lasso-variants r regularized-linear-regression sparse-regression sparse-regularization

Last synced: 3 months ago
JSON representation

SparseStep: Approximating the Counting Norm for Sparse Regularization

Host: GitHub
URL: https://github.com/gjjvdburg/sparsestep
Owner: GjjvdBurg
License: gpl-3.0
Created: 2017-01-25T13:08:02.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2021-01-12T16:31:17.000Z (over 4 years ago)
Last Synced: 2024-06-11T17:08:26.959Z (11 months ago)
Topics: feature-selection, lasso-variants, r, regularized-linear-regression, sparse-regression, sparse-regularization
Language: R
Homepage: https://arxiv.org/abs/1701.06967
Size: 85.9 KB
Stars: 1
Watchers: 5
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # SparseStep R Package

SparseStep is an R package for sparse regularized regression and provides an 

alternative to methods such as best subset selection, elastic net, lasso, and 

lars. The SparseStep method is introduced in the following paper:

[SparseStep: Approximating the Counting Norm for Sparse 

Regularization](https://arxiv.org/abs/1701.06967) by G.J.J. van den Burg, 

P.J.F. Groenen, and A. Alfons (*Arxiv preprint arXiv:1701.06967 [stat.ME]*, 

2017).

This R package can be easily installed by running 

``install.packages('sparsestep')`` in R. If you use the package in your work, 

please cite the above reference using, for instance, the following BibTeX 

entry:

```bibtex

@article{vandenburg2017sparsestep,

  title = {{SparseStep}: Approximating the Counting Norm for Sparse Regularization},

  author = {{Van den Burg}, G. J. J. and Groenen, P. J. F. and Alfons, A.},

  journal = {arXiv preprint arXiv:1701.06967},

  year = {2017}

}

```

## Introduction

The SparseStep method solves the regression problem regularized with the 

[`l_0` norm](https://en.wikipedia.org/wiki/Lp_space#When_p_=_0). Since the 

`l_0` term is highly non-convex and therefore difficult to optimize, this 

non-convexity is introduced gradually in SparseStep during optimization. As in 

other regularized regression methods such as ridge regression and lasso, a 

regularization parameter ``lambda`` can be specified to control the amount of 

regularization.  The choice of regularization parameter affects how many 

non-zero variables remain in the final model.

We will give a quick guide to SparseStep using the Prostate dataset from the 

book [Elements of Statistical 

Learning](https://web.stanford.edu/~hastie/ElemStatLearn/). 

We will show a few examples of running SparseStep on the Prostate dataset from 

the [lasso2](https://cran.r-project.org/web/packages/lasso2/index.html) 

package. First we load the data and create a data matrix and outcome vector:

```r

> prostate <- read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")

> X <- prostate[prostate$train == T, c(-1, -10)]

> X <- as.matrix(X)

> y <- prostate[prostate$train == T, 1]

> y <- as.vector(y)

```

The easiest way to fit a SparseStep model is to use the ``path.sparsestep`` 

function. This estimates the entire path of solutions for the SparseStep model 

for different values of the regularization parameter using a [golden section 

search](https://en.wikipedia.org/wiki/Golden-section_search) algorithm.

```r

> path <- path.sparsestep(X, y)

Found maximum value of lambda: 2^( 7 )

Found minimum value of lambda: 2^( -3 )

Running search in interval [ -3 , 7 ] ...

Running search in interval [ -3 , 2 ] ...

Running search in interval [ -3 , -0.5 ] ...

Running search in interval [ -3 , -1.75 ] ...

Running search in interval [ -0.5 , 2 ] ...

Running search in interval [ -0.5 , 0.75 ] ...

Running search in interval [ 0.125 , 0.75 ] ...

Running search in interval [ 2 , 7 ] ...

> plot(path, col=1:nrow(path$beta))     # col specifies colors to matplot

> legend('topleft', legend=rownames(path$beta), lty=1, col=1:nrow(path$beta))

```

In the resulting plot we can see the coefficients of the features that are 

included in the model at different values of ``lambda``:

![SparseStep regression on Prostate dataset](./.github/images/sparsestep_prostate_1.png)

The coefficients of the model can be obtained using ``coef(path)``, which 

returns a sparse matrix:

```r

> coef(path)

9 x 9 sparse Matrix of class "dgCMatrix"

                   s0           s1           s2          s3           s4         s5        s6       s7

Intercept  1.31349155  1.313491553  1.313491553  1.31349155  1.313491553 1.31349155 1.3134916 1.313492

lweight   -0.11336968 -0.113485291  .            .           .           .          .         .

age        0.02010188  0.020182049  0.018605327  0.01491472  0.018704172 0.01623212 .         .

lbph      -0.05698125 -0.059026246 -0.069116923  .           .           .          .         .

svi        0.03511645  .            .            .           .           .          .         .

lcp        0.41845469  0.423398063  0.420516410  0.43806447  0.433449263 0.38174743 0.3887863 .

gleason    0.22438690  0.222333394  0.236944796  0.23503609  .           .          .         .

pgg45     -0.00911273 -0.009084031 -0.008949463 -0.00853420 -0.004328518 .          .         .

lpsa       0.57545508  0.580111724  0.561063637  0.53017309  0.528953966 0.51473225 0.5336907 0.754266

                s8

Intercept 1.313492

lweight   .

age       .

lbph      .

svi       .

lcp       .

gleason   .

pgg45     .

lpsa      .

```

Note that the final model included in ``coef(beta)`` is a intercept-only 

model, which is generally not very useful. Predicting out-of-sample data can 

be done easily using the ``predict`` function.

By default SparseStep centers the regressors and outcome variable ``y`` and 

normalizes the regressors ``X`` to ensure that the regularization is applied 

evenly among them and the intercept is not penalized. If you prefer to use a 

constant term in the regression and penalize this as well, you'll have to 

transform the input data and disable the intercept:

```r

> Z <- cbind(constant=1, X)

> path <- path.sparsestep(Z, y, intercept=F)

...

> plot(path, col=1:nrow(path$beta))

> legend('bottomright', legend=rownames(path$beta), lty=1, col=1:nrow(path$beta))

```

Note that since we add the constant through the data matrix it is subject to 

regularization and therefore sparsity:

![SparseStep regression on Prostate dataset (with constant)](./.github/images/sparsestep_prostate_2.png)

For more information and examples, please see the documentation included with 

the package. In particular, the following pages are good places to start:

```r

> ?'sparsestep-package'

> ?sparsestep

> ?path.sparsestep

```

## Reference

If you use SparseStep in any of your projects, please cite the paper using the 

information available through the R command:

    citation('sparsestep')

or use the following BibTeX code:

    @article{van2017sparsestep,

      title = {{SparseStep}: Approximating the Counting Norm for Sparse Regularization},

      author = {Gerrit J.J. {van den Burg} and Patrick J.F. Groenen and Andreas Alfons},

      journal = {arXiv preprint arXiv:1701.06967},

      archiveprefix = {arXiv},

      year = {2017},

      eprint = {1701.06967},

      url = {https://arxiv.org/abs/1701.06967},

      primaryclass = {stat.ME},

      keywords = {Statistics - Methodology, 62J05, 62J07},

    }

## Notes

This package is licensed under GPLv3. Please see the LICENSE file for more 

information. If you have any questions or comments about this package, please 

open an issue [on GitHub](https://github.com/GjjvdBurg/sparsestep) (don't 

hesitate, you're helping to make this project better for everyone!). If you 

prefer to use email, please write to ``gertjanvandenburg at gmail dot com``.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gjjvdburg/sparsestep

Awesome Lists containing this project

README