An open API service indexing awesome lists of open source software.

https://github.com/davetang/learning_r

R notes
https://github.com/davetang/learning_r

Last synced: 4 months ago
JSON representation

R notes

Awesome Lists containing this project

README

        

---
title: "Learning R"
output:
github_document:
toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = FALSE)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.path = "img/")
```

```{r install_packages, include=FALSE}
my_packages <- c("ggbeeswarm", "remotes", "git2r")

sapply(my_packages, function(x){
if(!require(x, character.only = TRUE)){
install.packages(x, quiet = TRUE)
}
})
```

# Introduction

The three core features of R are object-orientation, vectorisation, and its
functional programming style.

>"To understand computations in R, two slogans are helpful:
>
>* Everything that exists is an object.
>* Everything that happens is a function call."
>
> --John Chambers

## Functional programming

See [Advanced
R](https://adv-r.hadley.nz/fp.html#functional-programming-languages).

Functional languages have **first-class functions**, which are functions that
behave like any other data structure. This means that you can do many of the
things with a function that you can do with a vector:

* You can assigned them to variables,
* Store them in lists,
* Pass them as arguments to other functions,
* Create them inside functions, and
* Return them as the result of a function.

Secondly, many functional languages requires functions to be **pure**. A
function is pure if it satisfies two properties:

1. The output only depends on the inputs, i.e. if you call the function again
with the same inputs, you will get the same outputs.

2. The function has no side-effects, like changing the value of a global
variable, writing to disk, or displaying to the screen.

Strictly speaking, R is not a functional programming language because it does
not require you to write pure functions.

Generally a functional style is one where a big problem is decomposed into
smaller pieces, where each piece is solved with a function or combinations of
functions. When using a functional style, you strive to decompose components of
the problem into isolated functions that operate independently. Each function
taken by itself is simple and straightforward to understand; complexity is
handled by composing functions in various ways.

## Object-Oriented Programming

See [Advanced R](https://adv-r.hadley.nz/oo.html).

Object-Oriented Programming (OOP) is a little more challenging in R because
there are multiple OOP systems to choose from and there is disagreement about
the relative importance of the systems. Hadley Wickham believes that the most
important are S3, R6, and S4 in that order. S3 and S4 are provided by base R.
R6 is provided by the R6 package, and is similar to the Reference Classes, or
RC for short, from base R. S3 and S4 use generic function OOP which is rather
different from the encapsulated OOP used by most languages popular today.

The main reason to use OOP is **polymorphism**. Polymorphism means that a
developer can consider a function's interface separately from its
implementation, making it possible to use the same function form for different
types of input. This is closely related to the idea of **encapsulation**: the
user doesn't need to worry about details of an object because they are
encapsulated behind a standard interface.

Polymorphism is what allows summary() to produce different outputs for numeric
and factor variables.

```{r summary_polymorphism}
diamonds <- ggplot2::diamonds
summary(diamonds$carat)
summary(diamonds$cut)
```

OO systems call the type of an object its **class**, and an implementation for
a specific class is called a **method**; a class defines the object and methods
describe what the object can do. Classes are organised in a hierarchy so that
if a method does not exist for one class, its parent's method is used, and the
child is said to **inherit** behaviour. For example, an ordered factor inherits
from a regular factor, and a generalised linear model inherits from a linear
model. The process of finding the correct method given a class is called
**method dispatch**.

There are two main paradigms of OOP which differ in how methods and classes are
related. We can call these paradigms encapsulated and functional:

* In **encapsulated** OOP, methods belong to objects or classes, and method
calls typically look like `object.method(arg1, arg2)`. This is called
encapsulated because the object encapsulates both data (with fields) and
behaviour (with methods), and is the paradigm found in most popular
languages.

* In **functional** OOP, methods belong to **generic** functions, and method
calls look like ordinary function calls: `generic(object, arg2, arg3)`. This
is called functional because from the outside it looks like a regular
function call, and internally the components are also functions.

### OOP in R

Base R provides three OOP systems: S3, S4, and reference classes (RC):

* **S3** is R's first OOP system and is an informal implementation of
functional OOP. It relies on common conventions rather than ironclad
guarantees. This makes it easy to get started with, providing a low cost way
of solving many simple problems.

* **S4** is a formal and rigorous rewrite of S3. It requires more upfront work
than S3, but in return provides more guarantees and greater encapsulation. S4
is implemented in the base **methods** package, which is always installed
with R.

(There is no S1 or S2 because S3 and S4 were named according to the versions of
S that they accompanied. The first two versions of S did not have any OOP
framework.)

* **RC** implements encapsulated OO. RC objects are a special type of S4
objects that are also **mutable**, i.e., instead of using R's usual
copy-on-modify semantics, they can be modified in place. This makes them
harder to reason about, but allows them to solve problems that are difficult
to solve in the functional OOP style of S3 and S4.

A number of other OOP systems are provided by CRAN packages:

* **R6** implements encapsulated OOP like RC, but resolves some important
issues.

* **R.oo** provides some formalism on top of S3, and makes it possible to have
mutable S3 objects.

* **proto** implements another style of OOP based on the idea of
**prototypes**, which blur the distinctions between classes and instances of
classes (objects).

## Packages

If you look up the help page for `require` or `library`, you will be provided
with the "Loading/Attaching and Listing of Packages" help page. The description
is that `library` and `require` load and attach add-on packages. When you type
`library(tidyverse)` the `tidyverse` package is attached to the search path. It
works in the same way as when you add another directory to your `PATH`
environment variable in Linux.

We can use `base::search()` to get the names of environments attached to the
search path.

```{r base_search_before}
base::search()
```

The code below will attach the `tidyverse` and `modeldata` packages (installing
them first if they haven't been installed yet).

```{r load_package, message=FALSE, warning=FALSE}
.libPaths('/packages')
my_packages <- c('tidyverse', 'modeldata', 'bench')

using<-function(...) {
# https://stackoverflow.com/a/44660688
libs<-unlist(list(...))
req<-unlist(lapply(libs,require,character.only=TRUE))
need<-libs[req==FALSE]
if(length(need)>0){
install.packages(need)
lapply(need,require,character.only=TRUE)
}
}
using(my_packages)

theme_set(theme_bw())
```

If we run `base::search()` again, we will see the additional packages attached
to the search path.

```{r base_search_after}
base::search()
```

As for the difference between `library` and `require`?

* `library` returns an error by default if the package is not installed
* `require` returns a logical depending on whether a package is attached or not

This is the reason why `require` was used above to check whether a package was
installed or not.

The difference between attaching and loading is a bit technical and you can
read about it on [SO](https://stackoverflow.com/a/56538266).

### Installing a GitHub R package locally

First clone R package.

```{r clone_repo}
repo_url <- "https://github.com/davetang/importbio.git"
repo <- git2r::clone(url = repo_url, local_path = "/home/rstudio/importbio")
```

Install using `remotes::install_local()`.

```{r install_local}
remotes::install_local("/home/rstudio/importbio")
```

Check if you can load package.

```{r importbio}
library("importbio")
packageVersion("importbio")
```

## Vectors

R is a vectorised language and what this means is that you can perform
operations on vectors without having to iterate through each element. A vector
is simply "a single entity consisting of a collection of things" but these
items must all belong to the same class. If you try to create a vector with
different classes, the vector will be coerced in the following order: `logical`
< `integer` < `numeric` < `character`.

```{r vector_coercion}
my_char <- c(1, 3.14, TRUE, 'str')
class(my_char)

my_num <- c(1, 2, 3.14, 4)
class(my_num)
```

You can easily square each number in a vector by applying a exponential
function.

```{r my_int_squared}
my_int <- 1:6
my_int^2
```

You can operate on a vector using another vector too. If the right vector isn't
the same length as the left vector but is a multiple, R performs a procedure
called "recycling" that will re-use the right vector on the next set of values.

```{r vec_recycling}
my_int + c(1, 2)
```

Named vectors (see [Attributes](#attributes)) can be used as simple lookup
tables. Make sure that the names are unique and non-missing.

```{r lookup_table}
my_lookup <- c(
"HKG" = "Hong Kong",
"PNG" = "Papua New Guinea",
"AUS" = "Australia",
"JPN" = "Japan"
)

my_lookup["PNG"]
```

### Attributes

Attributes can be considered as name-value pairs that attach metadata to an
object. Individual attributes can be retrieved and modified with `attr()` or
retrieved en masse with `attributes()`, and set en masse with `structure()`.

```{r attributes}
a <- 1:3
attr(a, "x") <- "abcdef"
attr(a, "x")

attr(a, "y") <- 4:6
str(attributes(a))

a <- structure(
1:3,
x = "abcdef",
y = 4:6
)
str(attributes(a))
```

Attributes should generally be thought of as ephemeral. There are only two
attributes that are routinely preserved:

* **names**, a character vector giving each element a name.
* **dim**, short for dimensions, an integer vector, used to turn vectors into
matrices or arrays.

To preserve other attributes, you will need to create your own S3 class.

Vectors can be named in three ways.

1. When creating it.

```{r name_when_creating}
x <- c(a = 1, b = 2, c = 3)
```

2. By assigning a character vector to `names()`.

```{r name_using_names}
x <- 1:3
names(x) <- c('a', 'b', 'c')
```

3. Using `setNames()`.

```{r name_using_set_names}
x <- setNames(1:3, c('a', 'b', 'c'))
```

Avoid using `attr(x, 'names')` as it requires more typing and is less readable
than `names(x)`. Names can be removed using `x <- unname(x)` or `names(x) <-
NULL`.

Adding a `dim` attribute to a vector allows it to behave like a 2-dimensional
**matrix** or a multi-dimensional **array**. However, matrices and arrays are
primarily mathematical and statistical tools, not programming tools.

Matrices and arrays can be created using `matrix()` and `array()` or by using
the assignment form of `dim()`.

```{r matrix}
x <- matrix(1:6, nrow = 2, ncol = 3)
x
```
```{r array}
y <- array(1:12, c(2, 3, 2))
y
```
```{r dim_assignment}
z <- 1:6
dim(z) <- c(2, 3)
z
```

### S3 atomic vectors

One of the most important vector attributes is `class`, which underlies the S3
object system. Having a class attribute turns an object into an **S3 object**,
which means it will behave differently from a regular vector when passed to a
**generic** function. Every S3 object is built on top of a base type, and often
stores additional information in other attributes.

There are four important S3 vectors used in base R:

1. Categorical data, where values come from a fixed set of levels recorded in
**factor** vectors.

2. Dates (with day resolution), which are recorded in **Date** vectors.

3. Date-times (with second or sub-second resolution), which are stored in
**POSIXct** vectors.

4. Durations, which are stored in **difftime** vectors.

#### Factors

A factor is a vector that can contain only predefined values. It is used to
store categorical data. Factors are built on top of an integer vector with two
attributes: a `class`, "factor", which makes it behave differently from regular
integer vectors, and `levels`, which defines the set of allowed values.

```{r factor_attributes}
x <- factor(c("a", "b", "b", "a"))
x

typeof(x)
attributes(x)
```

Factors are useful when you know the set of possible values even when they are
not present in the data. In contrast to a character vector, when you tabulate a
factor, you will get counts of all categories.

```{r factor_table}
sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))

table(sex_char)
table(sex_factor)
```

Use `ordered()` to create **ordered** factors, where the order of the levels is
meaningful.

#### Dates

Date vectors are built on top of double vectors; they have class "Date" and no
other attributes.

```{r date_attributes}
today <- Sys.Date()

typeof(today)
attributes(today)
```

The value of the double, which can be retrieved by stripping the class,
represents the number of days since 1970-01-01.

```{r date_unclass}
date <- as.Date("1970-02-01")
unclass(date)
```

#### Date-times

Base R provides two ways of storing date-time information:

1. `POSIXct` - POSIX stands for Portable Operating System Interface, which is a
family of cross-platform standards. The "ct" here stands for calendar time.
2. `POSIXlt` - The "lt" here stands for local time.

`POSIXct` is built on top of an atomic vector and is most appropriate for use
in data frames. `POSIXct` vectors are are built on top of double vectors, where
the value represents the number of seconds since 1970-01-01.

```{r posixct_attributes}
now_ct <- as.POSIXct("2018-08-01 22:00", tz = "UTC")
now_ct

typeof(now_ct)
attributes(now_ct)
```

The `tzone` attribute controls only how the date-time is formatted; it does not
control the instant of time represented by the vector. Note that time is not
printed if it is midnight (to prevent werewolves).

```{r posixct_structure}
structure(now_ct, tzone = "Asia/Tokyo")
structure(now_ct, tzone = "America/New_York")
structure(now_ct, tzone = "Australia/Lord_Howe")
structure(now_ct, tzone = "Europe/Paris")
```

#### Durations

Durations, which represent the amount of time between pairs of dates or
date-times, are stored in `difftimes`. `difftimes` are built on top of doubles,
and have a `units` attribute that determines how the integer should be
interpreted.

```{r difftime_attributes}
one_week_1 <- as.difftime(1, units = "weeks")
one_week_1

typeof(one_week_1)
attributes(one_week_1)

one_week_2 <- as.difftime(7, units = "days")
one_week_2

typeof(one_week_2)
attributes(one_week_2)
```

## Lists

Unlike vectors, lists can be used to store heterogeneous things.

```{r my_list}
my_list <- list(
my_func = function(x){x^2},
my_df = data.frame(a = 1:3),
my_vec = 1:6
)

my_list$my_func(my_list$my_vec)
```

`lapply` can be used to apply a function to each item in a list and will return
a list.

```{r lapply}
my_list <- list(
a = 1:3,
b = 4:10,
c = 11:20
)

lapply(my_list, sum)
```

Another handy function is the `do.call` function, which constructs and executes
a function call on a list. The example below is useful for converting a list
into a matrix.

```{r do_call}
my_list <- list(
a = 1:3,
b = 4:6,
c = 7:9
)

# returns a matrix
do.call(what = rbind, args = my_list)
```

Use `utils::stack` to create a data frame from a named list; the vectors are
concatenated and the names are used to create factors.

```{r utils_stack}
utils::stack(my_list)
```

There is also the [purrr::map](https://purrr.tidyverse.org/reference/map.html)
function, that is
[similar](https://jennybc.github.io/purrr-tutorial/ls01_map-name-position-shortcuts.html)
to the apply functions in base R, but you explicitly specify the output type.
The `map_lgl` function will return logicals, i.e. Booleans.

```{r map_lgl}
map_lgl(.x = 1:10, .f = function(x) x > 5)
```

## Functions

Notes from [Advanced R](https://adv-r.hadley.nz/functions.html).

Two important ideas about functions in R need to be understood:

1. Functions can be broken down into three components: arguments, body, and
environment

2. Functions are objects, just as vectors are objects

R functions are objects in their own right, a language property often called
"first-class functions".

### Function components

A function has three parts:

1. The `formals()`, which are the list of arguments that control how you call
the function.

2. The `body()`, which is the code inside the function.

3. The `environment()`, which is the data structure that determines the
namespace.

The environment is based on where you define the function.

```{r f02}
f02 <- function(x, y){
# comment
x + y
}

formals(f02)
body(f02)
environment(f02)
```

Functions contain any number of additional `attributes()` as with all objects
in R. One attribute used by base R is `srcref`, which points to the source code
used to create the function.

```{r f02_srcref}
attr(f02, "srcref")
```

However primitive functions do not have the three components (they return
`NULL`) and call C code directly.

```{r sum}
sum
```

Primitive functions have either type `builtin` or type `special` but have class
`function`.

```{r primitive_type}
class(f02)
class(sum)
class(`[`)

typeof(f02)
typeof(sum)
typeof(`[`)
```

### Lexical scoping

Scoping is the act of finding the value associated with a name. R uses lexical
scoping, which means it looks up the values of names based on how a function is
defined and not by how it is called. Lexical in this context means that the
scoping rules use a parse-time, rather than a run-time, structure. R's lexical
scoping follows four primary rules:

1. Name masking
2. Functions versus variables
3. A fresh start - each time you invoke a function, it starts fresh
4. Dynamic lookup

The basic principle of lexical scoping is that names defined inside a function
mask names defined outside a function. If a name is not defined inside a
function, R looks one level up (all the way up to the global environment).
Lexical scoping determines where, but not when to look for values. _R looks for
values when the function is run and not when the function is created_.

### Lazy evaluation

Function arguments are only evaluated if accessed, i.e. lazily evaluated. This
is a nice feature because it allows the inclusion of potentially expensive
computations in function arguments that will only be evaluated if necessary.

Lazy evaluation is powered by a data structure called a **promise**, which has
three components:

1. An expression, like `x + y`, which gives rise to the delayed computation.

2. An environment where the expression should be evaluated, i.e. the
environment where the function is called.

3. A value, which is computed and cached the first time a promise is accessed
when the expression is evaluated in the specified environment.

### dot-dot-dot

Functions can have a special argument `...`, which is pronounced dot-dot-dot.
In other programming languages, this type of argument is often called `varargs`
(variable arguments) and a function that uses it is said to be variadic.

```{r dot_dot_dot_eg}
i01 <- function(y, z) {
list(y = y, z = z)
}

i02 <- function(x, ...) {
i01(...)
}

str(i02(x = 1, y = 2, z = 3))
```

The `...` is useful when a function takes a function as an argument: you can
pass additional arguments to that function. The downside of using `...` is that
when arguments are used to pass arguments to another function, it is sometimes
not clear to the user. Also a misspelled argument will not raise an error and
this makes it easy for typos to go unnoticed.

### Exiting a function

Most functions exit in one of two ways:

1. They either return a value, indicating success. There are two ways a
function can return a value:
1. Implicitly, where the last evaluated expression is returned.
2. Explicitly by using `return()`.
2. They throw an error, indicating failure.

Most functions return visibly, meaning that the result is printed when
evaluated in an interactive context. The automatic printing can be prevented by
using `invisible()` but the return value still exists.

If a function cannot complete its assigned task, it should throw an error with
`stop()`, which immediately terminates the execution of the function.
`on.exit()` can be used to run some code regardless of how a function exits;
always set `add = TRUE` when using `on.exit()` because if you don't each call
to `on.exit()` will overwrite the previous exit handler.

### Function forms

Function calls come in four varieties:

1. **prefix**: the function name comes before its arguments, e.g. `sum(1:5)`.
These constitute the majority of function calls in R

2. **infix**: the function name comes in between its arguments, e.g. `x + y`.
Infix forms are used for many mathematical operators and for user-defined
functions that begin and end with `%`.

3. **replacement**: functions that replace values by assignment, e.g.
`names(my_df) <- c('a', 'b', 'c')`.

4. **special**: functions like `[[`, `if`, and `for` that do not have a
consistent structure.

All functions can be written in prefix form.

```{r function_forms}
x <- 1900
y <- 84
x + y
`+`(x, y)

df <- data.frame(a = 1, b = 2, c= 3)
`names<-`(df, c("x", "y", "z"))

for(i in 1:3) print(i)
`for`(i, 1:3, print(i))
```

## Objects

[Base types](https://adv-r.hadley.nz/base-types.html).

## Modeling example

Notes from [Tidy Modeling with R](https://www.tmwr.org/base-r.html).

Load `crickets` data set that contains the relationship between the ambient
temperature and the rate of cricket chirps per minute for two species.

```{r crickets}
data(crickets, package = "modeldata")
head(crickets)
```

Plot.

```{r crickets_plot}
ggplot(
crickets,
aes(x = temp, y = rate, color = species, pch = species, lty = species)
) +
geom_point(size = 2) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5) +
scale_color_brewer(palette = "Paired") +
labs(x = "Temperature (C)", y = "Chirp Rate (per minute)")
```

For an inferential model, we might have specified the following null hypotheses
prior to seeing the data:

* Temperature has no effect on the chirp rate
* There are no differences between the species' chirp rate.

The `lm()` function is commonly used to fit an ordinary linear model. Arguments
to this function are a model formula and the data frame that contains the data.
The formula is _symbolic_; the simple formula below specifies that the chirp
rate is the outcome and the temperature is the predictor.

```{r rate_temp}
rate ~ temp
```

If the time of day was also recorded in a column called `time`, the following
formula does not add the time and temperature values together but the formula
symbolically represents that temperature and time should be added as separate
_main effects_ to the model. A main effect is a model term that contains a
single predictor variable.

```{r rate_temp_time}
rate ~ temp + time
```

We can add the species to the model in the same way but since species is not a
quantitative variable, an _indicator variable_ (also known as a dummy variable)
is used in place of the original qualitative value. The model formula will
automatically encode `species` as a numeric by adding a new column that has a
value of zero and one for the two species.

```{r rate_temp_species}
rate ~ temp + species
```

The model formula `rate ~ temp + species` creates a model with different
y-intercepts for each species; the slopes of the regression lines could be
different for each species as well. To accommodate this structure, an
interaction term can be added to the model. This can be specified in a few
different ways and the most basic uses the colon.

```{r interaction_term}
rate ~ temp + species + temp:species

# A shortcut can be used to expand all interactions containing
# interactions with two variables:
rate ~ (temp + species)^2

# Another shortcut to expand factors to include all possible
# interactions (equivalent for this example):
rate ~ temp * species
```

The model formula also has other nice features:

* _In-line_ functions can be used, e.g. to use the natural log of the
temperature, we can use the formula `rate ~ log(temp)`

* R has many functions that are useful inside formulas, e.g. `poly(x, 3)` adds
linear, quadratic, and cubic terms for `x` to the model as main effects.

* The period shortcut is available for data sets with many predictors. The
period represents main effects for all of the columns that are not on the
left-hand side of the tilde.

Use a two-way interaction model.

```{r interaction_fit}
interaction_fit <- lm(rate ~ (temp + species)^2, data = crickets)

interaction_fit
```

Now we will recompute the model without the interaction term to assess whether
the interaction term is necessary using the `anova()` method.

```{r main_effect_fit}
main_effect_fit <- lm(rate ~ temp + species, data = crickets)

anova(main_effect_fit, interaction_fit)
```

This statistical test generates a p-value of 0.25, which implies that there is
a lack of evidence against the null hypothesis that the interaction term is not
needed by the model.

The `summary()` method can be used to inspect the coefficients, standard
errors, and p-values of each model term.

```{r summary_main_effect}
summary(main_effect_fit)
```

The chirp rate for each species increases by 3.6 chirps as the temperature
increases by a single degree. This term shows strong statistical sifnificance
as evidenced by the p-value. The species term has a value of -10.07, which
indicates that, across all temperature values, _O. niveus_ has a chirp rate
that is about 10 fewer chirps per minute than _O. exclamationis_. The species
effect is also associated with a very small p-value.

The only issue in this analysis is the intercept value that indicates that at 0
degrees Celsius, there are negative chirps per minute for both species. The
data only goes as low as 17.2 degrees Celsius and therefore the conclusions
should be limited to the observed temperature range.

If we needed to estimate the chirp rate at a temperature that was not observed
in the experiment, we could use the `predict()` method, which takes the model
object and a data frame of new values for prediction.

```{r predict_main_effect}
new_values <- data.frame(species = "O. exclamationis", temp = 15:20)
predict(main_effect_fit, new_values)
```

### R formula

The R model formula is used by many modeling packages and it usually serves
multiple purposes:

* The formula defines the columns that the model uses.

* The standard R machinery uses the formula to encode the columns into an
appropriate format, e.g. create indicator variables.

* The roles of the columns are defined by the formula.

For example, the following formula indicates that there are two predictors and
the model should contain their main effects and the two-way interactions.

```{r r_formula_eg}
rate ~ (temp + species)^2
```

## Exceptions

Notes from [Debugging, condition handling, and defensive
programming](http://adv-r.had.co.nz/Exceptions-Debugging.html).

`try()` gives you the ability to continue execution even when an error occurs.

```{r try_log}
try_log <- function(x) {
try(log(x))
10
}

try_log('a')
```

`tryCatch()` is a general tool for handling conditions: in addition to errors,
you can take different actions for warnings, messages, and interrupts.

```{r try_catch}
show_condition <- function(code) {
tryCatch(
code,
error = function(x) "error",
warning = function(x) "warning",
message = function(x) "message"
)
}

show_condition(log('a'))
show_condition(stop("!"))
show_condition(stopifnot(2 + 2 == 5))
show_condition(stopifnot(4 + 1980 == 1984))
show_condition(warning("?!"))
show_condition(message("?"))
show_condition(10)
```

## Measuring performance

Notes from [Measuring performance](https://adv-r.hadley.nz/perf-measure.html).

### Microbenchmarking

A [microbenchmark](https://adv-r.hadley.nz/perf-measure.html#microbenchmarking)
is a measurement of the performance of a very small piece of code, something
that might take milliseconds, microseconds, or nanoseconds to run.

```{r bench_mark}
set.seed(1984)
x <- runif(100)
(lb <- bench::mark(
sqrt(x),
x ^ 0.5
))
```

`for` versus `map_int` versus `sapply`.

```{r map_vs_sapply}
my_num <- 1:10000
for_loop <- function(n){
v <- vector(mode = "integer")
for(i in n){
v[i] <- i^2
}
return(v)
}

(ms <- bench::mark(
for_loop(my_num),
map_int(my_num, function(x) x^2),
sapply(my_num, function(x) x^2)
))
```

* `min` - The minimum execution time.
* `median` - The sample median of execution time.
* `itr/sec` - The estimated number of executions performed per second.
* `mem_alloc` - Total amount of memory allocated by R while running the
expression.
* `gc/sec` - The number of garbage collections per second.
* `n_itr` - Total number of iterations after filtering garbage collections (if
filter_gc == TRUE).
* `n_gc` - Total number of garbage collections performed over all iterations.
* `total_time` - The total time to perform the benchmarks.
* `result` - A list column of the object(s) returned by the evaluated
expression(s).
* `memory` - A list column with results from Rprofmem().
* `time` - A list column of bench_time vectors for each evaluated expression.
* `gc` - A list column with tibbles containing the level of garbage collection
(0-2, columns) for each iteration (rows).

```{r plot_ms}
plot(ms)
```

Use `system.time` to measure CPU time.

```{r with_replicate}
system.time(
x <- replicate(1000, for_loop(my_num))
)
system.time(
y <- replicate(1000, map_int(my_num, function(x) x^2))
)
system.time(
z <- replicate(1000, sapply(my_num, function(x) x^2))
)

all.equal(x, y)
all.equal(x, z)
```

## General

Assign a data frame column `NULL` to delete it.

```{r exorcise}
my_df <- data.frame(
a = 1:3,
b = 4:6,
c = c(6, 6, 6)
)

my_df$c <- NULL

my_df
```

Include an additional directory (`/packages`) to look for and install R
packages.

```{r lib_paths}
.libPaths('/packages')
```

Use `identical` to check whether two objects are exactly equal. Most times it
should suffice to just use `all.equal`.

```{r identical}
first <- 1:5
second <- c(1, 2, 3, 4, 5)

# this is false because first is a vector of integers
# and second is a vector of numerics
identical(first, second)

all.equal(first, second)
```

Set `scipen` (default is 0), which is a penalty to be applied when deciding to
print numeric values in fixed or exponential notation, to determine when to
print in exponential notation. (`.Options` contains all other options
settings.)

```{r scipen}
options(scipen=0)
10e4
options(scipen=1)
10e4
```

Use `system.time()` to measure how long a block of codes takes to execute.

```{r system_time}
system.time(
for (i in 1:100000000){}
)
```

The `with` function evaluates an expression with data.

```{r with_example}
my_df <- data.frame(
a=1:10,
b=11:20,
c=21:30
)
wanted <- with(my_df, a > 5 & c > 27)
my_df[wanted, ]
```

The `which` function is a very useful for returning indicates that are `TRUE`
and works with matrices.

```{r which}
my_mat <- matrix(1:9, nrow=3, byrow = TRUE)

# note that the results are ordered by col
which(my_mat > 5, arr.ind = TRUE)
```

The `match` function can be used with vectors to return the indexes of matching
items and an `NA` is no match was found.

```{r match_xy}
x <- c('b', 'c', 'a', 'd')
y <- letters[1:3]

match(x, y)
```

You can use `match` to subset and order a data frame.

```{r match_subset}
my_df <- data.frame(
a = 1:10,
b = letters[1:10]
)

x <- c(2, 10, 5, 6)
x_match <- match(x, my_df$a)

my_df[x_match, ]
```

Use the `complete.cases` function to list observations that have no missing
values, i.e. NA values.

```{r complete_cases}
my_df <- data.frame(
a = 1:3,
b = c(4, NA, 6),
c = 7:9
)

complete.cases(my_df)
```

Use `commandArgs` to accept command line arguments without having to install an
external package like `optparse`.

```{r command_args}
args <- commandArgs(TRUE)
```

## Useful plots

Visualise a table.

```{r mosaicplot}
mosaicplot(table(ChickWeight$Time, ChickWeight$Diet), main = "Timepoint versus diet")
```

### Getting help

Get help on a class.

```{r numeric_class_help, eval=FALSE}
?"numeric-class"
```

Get information on a package.

```{r stringr_help, eval=FALSE}
library(help="stringr")
```

Finding out what methods are available for a class.

```{r lm_methods}
methods(class="lm")
```

Search the help pages.

```{r help_search, eval=FALSE}
help.search("cross tabulate")
```

Search for function containing keyword.

```{r apropos}
apropos("mutate")
```

## Hacks

There are probably better ways to do the following, which is why I have
labelled them as hacks, so follow at your own peril.

### Makevars

You can set/modify preprocessor options (for example include paths and
definitions) for C/C++ files using a
[Makevars](https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Using-Makevars)
file. I had a problem [installing the ranger
package](https://github.com/imbs-hl/ranger/issues/669) because R (my version is
4.3.0) sets the default C++
[standard](https://gcc.gnu.org/projects/cxx-status.html) to C++11 and `ranger`
requires C++14. I wrote this Bash snippet (I use a Bash script to install my
packages) to overcome this problem.

```bash
makevars=${HOME}/.R/Makevars
if [[ ! -e ${makevars} ]]; then
mkdir -p ${HOME}/.R
touch ${makevars} && echo "CXX = g++ -std=gnu++14" >> ${makevars}
install_ranger
rm ${makevars}
else
if [[ $(grep -c "^CXX = g++ -std=gnu++14\b") == 0 ]]; then
cp ${makevars} ${HOME}/.R/Makevars_bk
echo "CXX = g++ -std=gnu++14" >> ${makevars}
install_ranger
mv -f ${HOME}/.R/Makevars_bk ${makevars}
fi
fi
```

### Library paths

Some R packages require libraries not included in the default library path. Use
`Sys.setenv` to include additional library paths. First we'll get the default
path.

```{r get_ld_library_path}
Sys.getenv("LD_LIBRARY_PATH")
```

Now we will add `/usr/include` to `LD_LIBRARY_PATH` and get the updated library
path.

```{r set_ld_library_path}
new_path <- paste0(Sys.getenv("LD_LIBRARY_PATH"), ":", "/usr/include")
Sys.setenv("LD_LIBRARY_PATH" = new_path)
Sys.getenv("LD_LIBRARY_PATH")
```

### Variables and objects

Sometimes you want to create objects with values stored in variables. This can
be achieved using `assign()`.

```{r assign}
my_varname <- 'one_to_ten'
my_values <- 1:10
assign(my_varname, my_values)

one_to_ten
```

Likewise, sometimes you want to store an object's name into a variable. This
can be achieved using `substitute` (returns the parse tree for an unevaluated
expression) and `deparse` (turns unevaluated expressions into character
strings).

```{r obj_to_string}
obj_to_string <- function(x){
deparse(substitute(x))
}

my_obj_name <- 1984
my_var <- obj_to_string(my_obj_name)

my_var
```

To evaluating a string, use `parse` (returns an unevaluated expression) with a
`text` argument specifying the character vector and `eval` (evaluates an
unevaluated expression).

```{r eval}
eval(parse(text = my_var))
```

## Useful tips

A lot of R books are free to read; check out the [bookdown](https://bookdown.org/) page to see some of the best R books.

R has four special values:

1. `NA` - used for representing missing data.
2. `NULL` - represents not having a value and unlike `NA`, it is its own object
and cannot be used in a vector.
3. `Inf`/`-Inf` - used for representing numbers too big for R (see below).
4. `NaN` - used for storing results that are not a number.

Check the `.Machine` variable to find out the numerical characteristics of the
machine R is running on, such as the largest double or integer and the
machine's precision.

```{r machine}
noquote(unlist(format(.Machine)))
```

When asking for help online, it is useful to include a minimal example that
includes some data specific to your question. To easily convert data into code,
use the `dput()` function. The example below is just for illustrative purposes
since the `women` dataset is included with R, so you would not need to generate
code for it.

```{r dput}
dput(women)
```

Show all the functions of a package.

```{r stringr_functions}
ls("package:stringr")
```

Search is useful to list the search path, i.e. where R will look, for R objects such as functions.

```{r search_path}
search()
```

Save all functions in the global environment into a file (that you can source later)!

```{r dump_functions}
dump(list = lsf.str(), file = "functions.R")
unlink('functions.R')
```

## Session info

This README was generated by running `rmd_to_md.sh` with `readme.Rmd`.

```{r time, echo=FALSE}
Sys.time()
```

Session info.

```{r session_info, echo=FALSE}
sessionInfo()
```