https://github.com/davpinto/fastknn

Fast k-Nearest Neighbors Classifier for Large Datasets
https://github.com/davpinto/fastknn
decision-boundaries dimension-reduction feature-engineering feature-extraction kaggle knn nearest-neighbors rstats shrinkage-estimator
Last synced: 3 months ago
JSON representation
Fast k-Nearest Neighbors Classifier for Large Datasets
Host: GitHub
URL: https://github.com/davpinto/fastknn
Owner: davpinto
Created: 2016-11-17T05:01:56.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2017-09-17T15:31:07.000Z (about 8 years ago)
Last Synced: 2024-08-06T03:04:08.574Z (about 1 year ago)
Topics: decision-boundaries, dimension-reduction, feature-engineering, feature-extraction, kaggle, knn, nearest-neighbors, rstats, shrinkage-estimator
Language: R
Homepage: https://davpinto.github.io/fastknn
Size: 8.83 MB
Stars: 67
Watchers: 11
Forks: 18
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project

README

          ---

title: "FastKNN [![Travis-CI Build Status](https://travis-ci.org/davpinto/fastknn.svg?branch=master)](https://travis-ci.org/davpinto/fastknn) [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/davpinto/fastknn?branch=master&svg=true)](https://ci.appveyor.com/project/davpinto/fastknn) [![codecov](https://codecov.io/github/davpinto/fastknn/branch/master/graphs/badge.svg)](https://codecov.io/github/davpinto/fastknn) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/fastknn?color=blue)](https://cran.r-project.org/package=fastknn)"

output: github_document

urlcolor: magenta

---

> Fast k-Nearest Neighbors Classifier with shrinkage estimator for the class membership probabilities

```{r setup, include=FALSE}

## rmarkdown::render('README.Rmd', 'github_document')

## rmarkdown::render(input = 'README.Rmd', output_format = 'html_document', output_file = "index.html", output_dir = "./docs")

## pkgdown::build_site()

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, 

                      fig.align = "center", fig.height = 4, dpi = 120, 

                      cache = TRUE)

```

```{r, echo=FALSE, out.width='156px', fig.align='center'}

knitr::include_graphics('fastknn_logo.png')

```

### Why `fastknn`?

------

The `fastknn` is now available on [Kaggle](https://github.com/Kaggle/docker-rstats). Take a look at this [kernel](https://www.kaggle.com/davidpinto/d/uciml/forest-cover-type-dataset/fastknn-show-to-glm-what-knn-see-0-96) to see an example on how to use `fastknn` to improve your performance on **Kaggle** competitions.

------

1. Build KNN classifiers with **large datasets** (> 100k rows) in a few seconds.

1. Predict more **calibrated probabilities** and reduce log-loss with the `"dist"` estimator.

1. Find the **best k** parameter according to a variety of loss functions, using n-fold cross validation.

1. Plot beautiful classification **decision boundaries** for your dataset.

1. Do **feature engineering** and extract high informative features from your dataset.

1. Compete in **Kaggle**.

Give it a try and let me know what you think!

## Fast Nearest Neighbor Searching

The `fastknn` method implements a k-Nearest Neighbor (KNN) classifier based on the [ANN](https://www.cs.umd.edu/~mount/ANN) library. ANN is written in `C++` and is able to find the k nearest neighbors for every point in a given dataset in `O(N log N)` time. The package [RANN](https://github.com/jefferis/RANN) provides an easy interface to use ANN library in `R`.

## The FastKNN Classifier

The `fastknn` was developed to deal with very large datasets (> 100k rows) and is ideal to [Kaggle](https://www.kaggle.com) competitions. It can be about 50x faster then the popular `knn` method from the `R` package [class](https://cran.r-project.org/web/packages/class), for large datasets. Moreover, `fastknn` provides a shrinkage estimator to the class membership probabilities, based on the inverse distances of the nearest neighbors (see the equations on [fastknn website](https://davpinto.github.io/fastknn/)):

$$

P(x_i \in y_j) = \displaystyle\frac{\displaystyle\sum\limits_{k=1}^K \left( \frac{1}{d_{ik}}\cdot(n_{ik} \in y_j) \right)}{\displaystyle\sum\limits_{k=1}^K \left( \frac{1}{d_{ik}} \right)}

$$

where $x_i$ is the $i^{\text{th}}$ test instance, $y_j$ is the $j^{\text{th}}$ unique class label, $n_{ik}$ is the $k^{\text{th}}$ nearest neighbor of $x_i$, and $d_{ik}$ is the distance between $x_i$ and $n_{ik}$. This estimator can be thought of as a weighted voting rule, where those neighbors that are more close to $x_i$ will have more influence on predicting $x_i$'s label.

In general, the weighted estimator provides more **calibrated probabilities** when compared with the traditional estimator based on the label proportions of the nearest neighbors, and reduces **logarithmic loss** (log-loss).

### How to install `fastknn`?

The package `fastknn` is not on CRAN, so you need to install it directly from GitHub:

```{r, eval=FALSE}

library("devtools")

install_github("davpinto/fastknn")

```

### Required Packages

The base of `fastknn` is the `RANN` package, but other packages are required to make `fastknn` work properly. All of them are automatically installed when you install the `fastknn`.

* `RANN` for fast nearest neighbors searching,

* `foreach` and `doSNOW` to do parallelized cross-validation,

* `Metrics` to measure classification performance,

* `matrixStats` for fast matrix column-wise and row-wise statistics,

* `ggplot2` to plot classification decision boundaries,

* `viridis` for modern color palletes.

### Getting Started

Using `fastknn` is as simple as:

```{r}

## Load packages

library("fastknn")

library("caTools")

## Load toy data

data("chess", package = "fastknn")

## Split data for training and test

set.seed(123)

tr.idx <- which(caTools::sample.split(Y = chess$y, SplitRatio = 0.7))

x.tr   <- chess$x[tr.idx, ]

x.te   <- chess$x[-tr.idx, ]

y.tr   <- chess$y[tr.idx]

y.te   <- chess$y[-tr.idx]

## Fit KNN

yhat <- fastknn(x.tr, y.tr, x.te, k = 10)

## Evaluate model on test set

sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat$class)))

```

## Find the Best k

The `fastknn` provides a interface to select the best `k` using n-fold cross-validation. There are 4 possible **loss functions**:

* Overall classification error rate: `eval.metric = "overall_error"`

* Mean per-class classification error rate: `eval.metric = "mean_error"`

* Mean per-class AUC: `eval.metric = "auc"`

* Cross-entropy / logarithmic loss: `eval.metric = "logloss"`

Cross-validation using the **voting** probability estimator:

```{r, results='hide'}

## Load dataset

library("mlbench")

data("Sonar", package = "mlbench")

x <- data.matrix(Sonar[, -61])

y <- Sonar$Class

## 5-fold CV using log-loss as evaluation metric

set.seed(123)

cv.out <- fastknnCV(x, y, k = 3:15, method = "vote", folds = 5, eval.metric = "logloss")

cv.out$cv_table

```

```{r, echo=FALSE}

pander::pander(cv.out$cv_table)

```

Cross-validation using the **weighted voting** probability estimator:

```{r, results='hide'}

## 5-fold CV using log-loss as evaluation metric

set.seed(123)

cv.out <- fastknnCV(x, y, k = 3:15, method = "dist", folds = 5, eval.metric = "logloss")

cv.out$cv_table

```

```{r, echo=FALSE}

pander::pander(cv.out$cv_table)

```

Note that the mean **log-loss** for the **weighted voting** estimator is lower for every `k` evaluated.

Parallelization is available. You can specify the number of threads via `nthread` parameter.

## Plot Classification Decision Boundary

The `fastknn` provides a plotting function, based on `ggplot2`, to draw bi-dimensional decision boundaries. If your dataset has more than 2 variables, only the first two will be considered. In future versions of `fastknn` the most descriptive variables will be selected automatically beforehand, using a **feature ranking** technique.

### Two-class Problem

```{r}

## Load toy data

data("spirals", package = "fastknn")

## Split data for training and test

set.seed(123)

tr.idx <- which(caTools::sample.split(Y = spirals$y, SplitRatio = 0.7))

x.tr   <- spirals$x[tr.idx, ]

x.te   <- spirals$x[-tr.idx, ]

y.tr   <- spirals$y[tr.idx]

y.te   <- spirals$y[-tr.idx]

## Plot decision boundary

knnDecision(x.tr, y.tr, x.te, y.te, k = 15)

```

### Multi-class Problem

```{r}

## Load toy data

data("multi_spirals", package = "fastknn")

## Split data for training and test

set.seed(123)

tr.idx <- which(caTools::sample.split(Y = multi_spirals$y, SplitRatio = 0.7))

x.tr   <- multi_spirals$x[tr.idx, ]

x.te   <- multi_spirals$x[-tr.idx, ]

y.tr   <- multi_spirals$y[tr.idx]

y.te   <- multi_spirals$y[-tr.idx]

## Plot decision boundary

knnDecision(x.tr, y.tr, x.te, y.te, k = 15)

```

## Performance Test

Here we test the performance of `fastknn` on the [Covertype](https://archive.ics.uci.edu/ml/datasets/Covertype) datset. It is hosted on [UCI](https://archive.ics.uci.edu/ml/) repository and has been already used in a **Kaggle** [competition](https://www.kaggle.com/c/forest-cover-type-prediction). The dataset contains 581012 observations on 54 numeric features, classified into 7 different categories.

All experiments were conducted on a **64-bit Ubuntu 16.04 with Intel Core i7-6700HQ 2.60GHz and 16GB RAM DDR4**.

### Computing Time

Here `fastknn` is compared with the `knn` method from the package `class`. We had to use small samples from the Covertype data because `knn` takes too much time (> 1500s) to fit the entire dataset.

```{r, results='hide'}

#### Load packages

library('class')

library('fastknn')

library('caTools')

#### Load data

data("covertype", package = "fastknn")

covertype$Target <- as.factor(covertype$Target)

#### Test with different sample sizes

N <- nrow(covertype)

sample.frac <- c(10e3, 15e3, 20e3)/N

res <- lapply(sample.frac, function(frac, dt) {

   ## Reduce datset

   set.seed(123)

   sample.idx <- which(sample.split(dt$Target, SplitRatio = frac))

   x <- as.matrix(dt[sample.idx, -55])

   y <- dt$Target[sample.idx]

   

   ## Split data

   set.seed(123)

   tr.idx <- which(sample.split(y, SplitRatio = 0.7))

   x.tr   <- x[tr.idx, ]

   x.te   <- x[-tr.idx, ]

   y.tr   <- y[tr.idx]

   y.te   <- y[-tr.idx]

   

   ## Measure time

   t1 <- system.time({

      yhat1 <- knn(train = x.tr, test = x.te, cl = y.tr, k = 10, prob = TRUE)

   })

   t2 <- system.time({

      yhat2 <- fastknn(xtr = x.tr, ytr = y.tr, xte = x.te, k = 10, method = "dist")

   })

   

   ## Return

   list(

      method = c('knn', 'fastknn'),

      nobs = as.integer(rep(N*frac, 2)),

      time_sec = c(t1[3], t2[3]), 

      accuracy = round(100 * c(sum(yhat1 == y.te), sum(yhat2$class == y.te)) / length(y.te), 2)

   )

}, dt = covertype)

res <- do.call('rbind.data.frame', res)

res

```

```{r, echo=FALSE}

pander::pander(res)

```

The `fastknn` takes **about 5s** to fit the entire dataset.

### Probability Prediction

We compared the `voting` estimator with the `weighted voting` estimator:

**Voting**

```{r, results='hide'}

#### Extract input variables and response variable

x <- as.matrix(covertype[, -55])

y <- as.factor(covertype$Target)

#### 5-fold cross-validation

set.seed(123)

res <- fastknnCV(x, y, k = 10, method = "vote", folds = 5, eval.metric = "logloss")

res$cv_table

```

```{r, echo=FALSE}

pander::pander(res$cv_table)

```

**Weighted Voting**

```{r, results='hide'}

#### 5-fold cross-validation

set.seed(123)

res <- fastknnCV(x, y, k = 10, method = "dist", folds = 5, eval.metric = "logloss")

res$cv_table

```

```{r, echo=FALSE}

pander::pander(res$cv_table)

```

## Feature Engineering

The **fastknn** provides a function to do **feature extraction** using KNN. It 

generates `k * c` new features, where `c` is the number of class labels. The new 

features are computed from the distances between the observations and their `k` 

nearest neighbors inside each class. The following example shows that the 

**KNN features** carry information that can not be extracted from data by a 

linear learner, like a GLM model:

```{r, results='hide'}

library("mlbench")

library("caTools")

library("fastknn")

library("glmnet")

#### Load data

data("Ionosphere", package = "mlbench")

x <- data.matrix(subset(Ionosphere, select = -Class))

y <- Ionosphere$Class

#### Remove near zero variance columns

x <- x[, -c(1,2)]

#### Split data

set.seed(123)

tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))

x.tr <- x[tr.idx,]

x.te <- x[-tr.idx,]

y.tr <- y[tr.idx]

y.te <- y[-tr.idx]

#### GLM with original features

glm <- glmnet(x = x.tr, y = y.tr, family = "binomial", lambda = 0)

yhat <- drop(predict(glm, x.te, type = "class"))

yhat1 <- factor(yhat, levels = levels(y.tr))

#### Generate KNN features

set.seed(123)

new.data <- knnExtract(xtr = x.tr, ytr = y.tr, xte = x.te, k = 3)

#### GLM with KNN features

glm <- glmnet(x = new.data$new.tr, y = y.tr, family = "binomial", lambda = 0)

yhat <- drop(predict(glm, new.data$new.te, type = "class"))

yhat2 <- factor(yhat, levels = levels(y.tr))

#### Performance

sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat1)))

sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat2)))

```

```{r, echo=FALSE}

sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat1)))

sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat2)))

```

We can see that the **KNN features** improved a lot the classification performance 

of the GLM model.

The `knnExtract()` function is based on the ideas presented in the  

[winner solution](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov) of the [Otto Group Product Classification Challenge](https://www.kaggle.com/c/otto-group-product-classification-challenge) on **Kaggle**.

Parallelization is available. You can specify the number of threads via `nthread` parameter.

### Understanding the KNN Features

KNN makes a nonlinear mapping of the original space and project it into a linear 

one, in which the classes are linearly separable.

**Mapping the *chess* dataset**

```{r, results='hide'}

library("caTools")

library("fastknn")

library("ggplot2")

library("gridExtra")

## Load data

data("chess")

x <- data.matrix(chess$x)

y <- chess$y

## Split data

set.seed(123)

tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))

x.tr <- x[tr.idx,]

x.te <- x[-tr.idx,]

y.tr <- y[tr.idx]

y.te <- y[-tr.idx]

## Feature extraction with KNN

set.seed(123)

new.data <- knnExtract(x.tr, y.tr, x.te, k = 1)

## Decision boundaries

g1 <- knnDecision(x.tr, y.tr, x.te, y.te, k = 10) +

   labs(title = "Original Features")

g2 <- knnDecision(new.data$new.tr, y.tr, new.data$new.te, y.te, k = 10) +

   labs(title = "KNN Features")

grid.arrange(g1, g2, ncol = 2)

```

**Mapping the *spirals* dataset**

```{r, results='hide'}

## Load data

data("spirals")

x <- data.matrix(spirals$x)

y <- spirals$y

## Split data

set.seed(123)

tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))

x.tr <- x[tr.idx,]

x.te <- x[-tr.idx,]

y.tr <- y[tr.idx]

y.te <- y[-tr.idx]

## Feature extraction with KNN

set.seed(123)

new.data <- knnExtract(x.tr, y.tr, x.te, k = 1)

## Decision boundaries

g1 <- knnDecision(x.tr, y.tr, x.te, y.te, k = 10) +

   labs(title = "Original Features")

g2 <- knnDecision(new.data$new.tr, y.tr, new.data$new.te, y.te, k = 10) +

   labs(title = "KNN Features")

grid.arrange(g1, g2, ncol = 2)

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/davpinto/fastknn

Awesome Lists containing this project

README