Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/akoyabio/rtree

R-Trees for point data in R
https://github.com/akoyabio/rtree

Last synced: 18 days ago
JSON representation

R-Trees for point data in R

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
fig.path = "man/figures/README-"
)
```

The `rtree` package offers fast Euclidean within-distance checks and
KNN calculations for points in 2D space.
It offers significant speed-ups vis-a-vis simple implementations
by relying on the [R-tree data structure](https://en.wikipedia.org/wiki/R-tree)
implemented by the
[Boost geometry](https://www.boost.org/doc/libs/1_75_0/libs/geometry/doc/html/geometry/spatial_indexes/introduction.html)
library.

`rtree` was inspired by
[this](https://gallery.rcpp.org/articles/Rtree-examples/)
example in the Rcpp gallery.

## Installation

### From CRAN

```{r eval=FALSE}
install.packages("rtree")
```

### Development version

```{r eval=FALSE}
# install.packages("remotes") # Install if needed
remotes::install_github("hunzikp/rtree")
```

Note: As of version 0.2.0, `rtree` requires R version 4.0.0 or higher.
This is because version 1.75 of `boost::geometry` requires C++14 which is not
well supported in Windows R versions before 4.0.0.

## Usage

Say we have two large sets of points, A and B, stored as 2-column matrices of
Cartesian coordinates:
```{r simulate}
## Simulate point coordinates
set.seed(0)
A_n <- 10^4
A <- cbind(runif(A_n), runif(A_n))
B_n <- 10^4
B <- cbind(runif(B_n), runif(B_n))
colnames(A) <- colnames(B) <- c('x', 'y')
```

### Within-Distance Calculation

For each point of set $A$, $a_i$, we want to know all points of set $B$ that
are within distance $d$ of $a_i$.
To compute this, we first create an R-Tree index on $B$:
```{r index}
library(rtree)

## Set index
B_rtree <- RTree(B)
```

The `RTree()` function creates an S3 object of class `RTree`,
```{r class}
inherits(B_rtree, 'RTree')
```
which essentially just points to a C++ object of class `RTreeCpp`.

Using the `RTree` object, we can now perform our query efficiently:
```{r within}
## Within distance calculation
d <- 0.05
wd_ls <- withinDistance(B_rtree, A, d)
```
`wd_ls` is a list of length `nrow(A)`...
```{r check}
nrow(A)==length(wd_ls)
```
...whereby the $i$th list element contains the row-indices of the points
in $B$ that are within distance $d$ of point $a_i$:
```{r check2}
print(wd_ls[[1]])
```

We can also check the sanity of the result visually:
```{r checkplot, fig.cap = "Within distance sanity check.", message=FALSE, warning=FALSE}
## Plot points in B within distance d of point a_1
a_1 <- A[1,] # Get coords of a_1
plot(a_1[1], a_1[2], xlim=c(a_1[1]-d, a_1[1]+d), ylim=c(a_1[2]-d, a_1[2]+d),
col='black', asp=1, pch=20, xlab='x', ylab='y') # Plot a_1
points(B[,1], B[,2], col='grey') # Plot B in grey
symbols(a_1[1], a_1[2], circles=d, add=TRUE, inches=FALSE) # Draw circle of radius d
b_wd <- B[wd_ls[[1]],] # Get relevant points in B
points(b_wd[,1], b_wd[,2], col='red', pch=20) # Plot relevant points in red
```

### Nearest Neighbor Calculation

For each point of set $A$, $a_i$, we want to know the $k$ points in B
closest to $a_i$.
Recycling the `RTree` object created above, we perform the knn computation...
```{r knn}
## KNN calculation
k <- 10L
knn_ls <- knn(B_rtree, A, k)
```
...which returns a list of the same format as above, with the exception that
each element of `knn_ls` is exactly of length $k$.

Again, we may plot the result to inspect its veracity:
```{r checkplot2, fig.cap = "KNN sanity check.", message=FALSE, warning=FALSE}

## Plot points in B within distance d of point a_1
a_1 <- A[1,] # Get coords of a_1
plot(a_1[1], a_1[2], xlim=c(a_1[1]-d, a_1[1]+d), ylim=c(a_1[2]-d, a_1[2]+d),
col='black', asp=1, pch=20, xlab='x', ylab='y') # Plot a_1
points(B[,1], B[,2], col='grey') # Plot B in grey
b_knn <- B[knn_ls[[1]],] # Get relevant points in B
points(b_knn[,1], b_knn[,2], col='red', pch=20) # Plot relevant points in red
```

## Benchmarking

### Within-Distance Benchmarks

We first compare the within-distance functionality to the `gWithinDistance()`
function offered in [rgeos](https://cran.r-project.org/package=rgeos)
(version `r packageVersion('rgeos')`).
```{r wd_bench, fig.cap = "", message=FALSE, warning=FALSE}
## Load packages
library(sp)
library(rgeos)
library(rbenchmark)

## Simulate data
set.seed(0)
A_n <- 10^3
A <- cbind(runif(A_n), runif(A_n))
B_n <- 10^3
B <- cbind(runif(B_n), runif(B_n))
d <- 0.05

## Encapsulate wd operations in functions, then benchmark
rgeos.wd <- function() {
wd_mat <- gWithinDistance(spgeom1=SpatialPoints(A), spgeom2=SpatialPoints(B),
dist=d, byid=TRUE)
}
rtree.wd <- function() {
wd_ls <- withinDistance(RTree(B), A, d)
}
bm.wd <- benchmark(rtree=rtree.wd(),
rgeos=rgeos.wd(),
replications=10,
columns=c("test", "replications", "elapsed", "relative"))

## Print output
print(bm.wd)

## Plot
barplot(bm.wd$relative, names.arg=bm.wd$test,
ylab="Relative Time Elapsed", cex.main=1.5)
mtext("within distance", line=3, cex=1.5, font=2)
speedup <- round(bm.wd$relative[bm.wd$test=="rgeos"], 1)
mtext(paste("rtree ", speedup, "x faster than rgeos", sep=""),
line=1.5, cex=1.25)
```

### KNN Benchmarks

Next we compare the KNN functionality with the KNN implementation based on
d-trees offered in the [FNN](https://cran.r-project.org/package=FNN)
package (version 1.1).
We don't offer benchmarking statistics against a linear search KNN
implementation, which would obviously be much, much slower.
```{r knn_bench, fig.cap = "", message=FALSE, warning=FALSE}
## Load packages
library(FNN)

## Simulate data
set.seed(0)
A_n <- 10^4
A <- cbind(runif(A_n), runif(A_n))
B_n <- 10^4
B <- cbind(runif(B_n), runif(B_n))
k <- 100L

## Encapsulate knn operations in functions, then benchmark
kdtree.knn <- function() {
nn.idx <- get.knnx(data=B, query=A, k=k, algorithm=c("kd_tree"))
}
rtree.knn <- function() {
nn_ls <- rtree::knn(RTree(B), A, k)
}
bm.knn <- benchmark(rtree=rtree.knn(),
kdtree=kdtree.knn(),
replications=10,
columns=c("test", "replications", "elapsed", "relative"))

## Print output
print(bm.knn)

## Plot
barplot(bm.knn$relative, names.arg=bm.knn$test,
ylab="Relative Time Elapsed", cex.main=1.5)
mtext("KNN", line=3, cex=1.5, font=2)
speedup <- round(bm.knn$relative[bm.knn$test=="kdtree"], 1)
mtext(paste("rtree ", speedup, "x faster than FNN (kd-tree)", sep=""),
line=1.5, cex=1.25)
```