https://github.com/davetang/learning_r

R notes
https://github.com/davetang/learning_r
Last synced: 6 months ago
JSON representation
R notes
Host: GitHub
URL: https://github.com/davetang/learning_r
Owner: davetang
License: mit
Created: 2022-12-27T00:05:36.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-11-03T13:39:36.000Z (7 months ago)
Last Synced: 2024-12-01T02:49:18.424Z (6 months ago)
Language: R
Size: 423 KB
Stars: 6
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project

jimsghstars - davetang/learning_r - R notes (R)
README

        ---

title: "Learning R"

output:

   github_document:

      toc: true

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(cache = FALSE)

knitr::opts_chunk$set(echo = TRUE)

knitr::opts_chunk$set(fig.path = "img/")

```

```{r install_packages, include=FALSE}

my_packages <- c("ggbeeswarm", "remotes", "git2r")

sapply(my_packages, function(x){

  if(!require(x, character.only = TRUE)){

    install.packages(x, quiet = TRUE)

  }

})

```

# Introduction

The three core features of R are object-orientation, vectorisation, and its

functional programming style.

>"To understand computations in R, two slogans are helpful:

>

>* Everything that exists is an object.

>* Everything that happens is a function call."

>

> --John Chambers

## Functional programming

See [Advanced

R](https://adv-r.hadley.nz/fp.html#functional-programming-languages).

Functional languages have **first-class functions**, which are functions that

behave like any other data structure. This means that you can do many of the

things with a function that you can do with a vector:

* You can assigned them to variables,

* Store them in lists,

* Pass them as arguments to other functions,

* Create them inside functions, and

* Return them as the result of a function.

Secondly, many functional languages requires functions to be **pure**. A

function is pure if it satisfies two properties:

1. The output only depends on the inputs, i.e. if you call the function again

   with the same inputs, you will get the same outputs.

2. The function has no side-effects, like changing the value of a global

   variable, writing to disk, or displaying to the screen.

Strictly speaking, R is not a functional programming language because it does

not require you to write pure functions.

Generally a functional style is one where a big problem is decomposed into

smaller pieces, where each piece is solved with a function or combinations of

functions. When using a functional style, you strive to decompose components of

the problem into isolated functions that operate independently. Each function

taken by itself is simple and straightforward to understand; complexity is

handled by composing functions in various ways.

## Object-Oriented Programming

See [Advanced R](https://adv-r.hadley.nz/oo.html).

Object-Oriented Programming (OOP) is a little more challenging in R because

there are multiple OOP systems to choose from and there is disagreement about

the relative importance of the systems. Hadley Wickham believes that the most

important are S3, R6, and S4 in that order. S3 and S4 are provided by base R.

R6 is provided by the R6 package, and is similar to the Reference Classes, or

RC for short, from base R. S3 and S4 use generic function OOP which is rather

different from the encapsulated OOP used by most languages popular today.

The main reason to use OOP is **polymorphism**. Polymorphism means that a

developer can consider a function's interface separately from its

implementation, making it possible to use the same function form for different

types of input. This is closely related to the idea of **encapsulation**: the

user doesn't need to worry about details of an object because they are

encapsulated behind a standard interface.

Polymorphism is what allows summary() to produce different outputs for numeric

and factor variables.

```{r summary_polymorphism}

diamonds <- ggplot2::diamonds

summary(diamonds$carat)

summary(diamonds$cut)

```

OO systems call the type of an object its **class**, and an implementation for

a specific class is called a **method**; a class defines the object and methods

describe what the object can do. Classes are organised in a hierarchy so that

if a method does not exist for one class, its parent's method is used, and the

child is said to **inherit** behaviour. For example, an ordered factor inherits

from a regular factor, and a generalised linear model inherits from a linear

model. The process of finding the correct method given a class is called

**method dispatch**.

There are two main paradigms of OOP which differ in how methods and classes are

related. We can call these paradigms encapsulated and functional:

* In **encapsulated** OOP, methods belong to objects or classes, and method

  calls typically look like `object.method(arg1, arg2)`. This is called

  encapsulated because the object encapsulates both data (with fields) and

  behaviour (with methods), and is the paradigm found in most popular

  languages.

* In **functional** OOP, methods belong to **generic** functions, and method

  calls look like ordinary function calls: `generic(object, arg2, arg3)`. This

  is called functional because from the outside it looks like a regular

  function call, and internally the components are also functions.

### OOP in R

Base R provides three OOP systems: S3, S4, and reference classes (RC):

* **S3** is R's first OOP system and is an informal implementation of

  functional OOP. It relies on common conventions rather than ironclad

  guarantees. This makes it easy to get started with, providing a low cost way

  of solving many simple problems.

* **S4** is a formal and rigorous rewrite of S3. It requires more upfront work

  than S3, but in return provides more guarantees and greater encapsulation. S4

  is implemented in the base **methods** package, which is always installed

  with R.

(There is no S1 or S2 because S3 and S4 were named according to the versions of

S that they accompanied. The first two versions of S did not have any OOP

framework.)

* **RC** implements encapsulated OO. RC objects are a special type of S4

  objects that are also **mutable**, i.e., instead of using R's usual

  copy-on-modify semantics, they can be modified in place. This makes them

  harder to reason about, but allows them to solve problems that are difficult

  to solve in the functional OOP style of S3 and S4.

A number of other OOP systems are provided by CRAN packages:

* **R6** implements encapsulated OOP like RC, but resolves some important

  issues.

* **R.oo** provides some formalism on top of S3, and makes it possible to have

  mutable S3 objects.

* **proto** implements another style of OOP based on the idea of

  **prototypes**, which blur the distinctions between classes and instances of

  classes (objects).

## Packages

If you look up the help page for `require` or `library`, you will be provided

with the "Loading/Attaching and Listing of Packages" help page. The description

is that `library` and `require` load and attach add-on packages. When you type

`library(tidyverse)` the `tidyverse` package is attached to the search path. It

works in the same way as when you add another directory to your `PATH`

environment variable in Linux.

We can use `base::search()` to get the names of environments attached to the

search path.

```{r base_search_before}

base::search()

```

The code below will attach the `tidyverse` and `modeldata` packages (installing

them first if they haven't been installed yet).

```{r load_package, message=FALSE, warning=FALSE}

.libPaths('/packages')

my_packages <- c('tidyverse', 'modeldata', 'bench')

using<-function(...) {

   # https://stackoverflow.com/a/44660688

	libs<-unlist(list(...))

	req<-unlist(lapply(libs,require,character.only=TRUE))

	need<-libs[req==FALSE]

	if(length(need)>0){

		install.packages(need)

		lapply(need,require,character.only=TRUE)

	}

}

using(my_packages)

theme_set(theme_bw())

```

If we run `base::search()` again, we will see the additional packages attached

to the search path.

```{r base_search_after}

base::search()

```

As for the difference between `library` and `require`?

* `library` returns an error by default if the package is not installed

* `require` returns a logical depending on whether a package is attached or not

This is the reason why `require` was used above to check whether a package was

installed or not.

The difference between attaching and loading is a bit technical and you can

read about it on [SO](https://stackoverflow.com/a/56538266).

### Installing a GitHub R package locally

First clone R package.

```{r clone_repo}

repo_url <- "https://github.com/davetang/importbio.git"

repo <- git2r::clone(url = repo_url, local_path = "/home/rstudio/importbio")

```

Install using `remotes::install_local()`.

```{r install_local}

remotes::install_local("/home/rstudio/importbio")

```

Check if you can load package.

```{r importbio}

library("importbio")

packageVersion("importbio")

```

## Vectors

R is a vectorised language and what this means is that you can perform

operations on vectors without having to iterate through each element. A vector

is simply "a single entity consisting of a collection of things" but these

items must all belong to the same class. If you try to create a vector with

different classes, the vector will be coerced in the following order: `logical`

< `integer` < `numeric` < `character`.

```{r vector_coercion}

my_char <- c(1, 3.14, TRUE, 'str')

class(my_char)

my_num <- c(1, 2, 3.14, 4)

class(my_num)

```

You can easily square each number in a vector by applying a exponential

function.

```{r my_int_squared}

my_int <- 1:6

my_int^2

```

You can operate on a vector using another vector too. If the right vector isn't

the same length as the left vector but is a multiple, R performs a procedure

called "recycling" that will re-use the right vector on the next set of values.

```{r vec_recycling}

my_int + c(1, 2)

```

Named vectors (see [Attributes](#attributes)) can be used as simple lookup

tables. Make sure that the names are unique and non-missing.

```{r lookup_table}

my_lookup <- c(

  "HKG" = "Hong Kong",

  "PNG" = "Papua New Guinea",

  "AUS" = "Australia",

  "JPN" = "Japan"

)

my_lookup["PNG"]

```

### Attributes

Attributes can be considered as name-value pairs that attach metadata to an

object. Individual attributes can be retrieved and modified with `attr()` or

retrieved en masse with `attributes()`, and set en masse with `structure()`.

```{r attributes}

a <- 1:3

attr(a, "x") <- "abcdef"

attr(a, "x")

attr(a, "y") <- 4:6

str(attributes(a))

a <- structure(

  1:3,

  x = "abcdef",

  y = 4:6

)

str(attributes(a))

```

Attributes should generally be thought of as ephemeral. There are only two

attributes that are routinely preserved:

* **names**, a character vector giving each element a name.

* **dim**, short for dimensions, an integer vector, used to turn vectors into

  matrices or arrays.

To preserve other attributes, you will need to create your own S3 class.

Vectors can be named in three ways.

1. When creating it.

```{r name_when_creating}

x <- c(a = 1, b = 2, c = 3)

```

2. By assigning a character vector to `names()`.

```{r name_using_names}

x <- 1:3

names(x) <- c('a', 'b', 'c')

```

3. Using `setNames()`.

```{r name_using_set_names}

x <- setNames(1:3, c('a', 'b', 'c'))

```

Avoid using `attr(x, 'names')` as it requires more typing and is less readable

than `names(x)`. Names can be removed using `x <- unname(x)` or `names(x) <-

NULL`.

Adding a `dim` attribute to a vector allows it to behave like a 2-dimensional

**matrix** or a multi-dimensional **array**. However, matrices and arrays are

primarily mathematical and statistical tools, not programming tools.

Matrices and arrays can be created using `matrix()` and `array()` or by using

the assignment form of `dim()`.

```{r matrix}

x <- matrix(1:6, nrow = 2, ncol = 3)

x

```

```{r array}

y <- array(1:12, c(2, 3, 2))

y

```

```{r dim_assignment}

z <- 1:6

dim(z) <- c(2, 3)

z

```

### S3 atomic vectors

One of the most important vector attributes is `class`, which underlies the S3

object system. Having a class attribute turns an object into an **S3 object**,

which means it will behave differently from a regular vector when passed to a

**generic** function. Every S3 object is built on top of a base type, and often

stores additional information in other attributes.

There are four important S3 vectors used in base R:

1. Categorical data, where values come from a fixed set of levels recorded in

   **factor** vectors.

2. Dates (with day resolution), which are recorded in **Date** vectors.

3. Date-times (with second or sub-second resolution), which are stored in

   **POSIXct** vectors.

4. Durations, which are stored in **difftime** vectors.

#### Factors

A factor is a vector that can contain only predefined values. It is used to

store categorical data. Factors are built on top of an integer vector with two

attributes: a `class`, "factor", which makes it behave differently from regular

integer vectors, and `levels`, which defines the set of allowed values.

```{r factor_attributes}

x <- factor(c("a", "b", "b", "a"))

x

typeof(x)

attributes(x)

```

Factors are useful when you know the set of possible values even when they are

not present in the data. In contrast to a character vector, when you tabulate a

factor, you will get counts of all categories.

```{r factor_table}

sex_char <- c("m", "m", "m")

sex_factor <- factor(sex_char, levels = c("m", "f"))

table(sex_char)

table(sex_factor)

```

Use `ordered()` to create **ordered** factors, where the order of the levels is

meaningful.

#### Dates

Date vectors are built on top of double vectors; they have class "Date" and no

other attributes.

```{r date_attributes}

today <- Sys.Date()

typeof(today)

attributes(today)

```

The value of the double, which can be retrieved by stripping the class,

represents the number of days since 1970-01-01.

```{r date_unclass}

date <- as.Date("1970-02-01")

unclass(date)

```

#### Date-times

Base R provides two ways of storing date-time information:

1. `POSIXct` - POSIX stands for Portable Operating System Interface, which is a

   family of cross-platform standards. The "ct" here stands for calendar time.

2. `POSIXlt` - The "lt" here stands for local time.

`POSIXct` is built on top of an atomic vector and is most appropriate for use

in data frames. `POSIXct` vectors are are built on top of double vectors, where

the value represents the number of seconds since 1970-01-01.

```{r posixct_attributes}

now_ct <- as.POSIXct("2018-08-01 22:00", tz = "UTC")

now_ct

typeof(now_ct)

attributes(now_ct)

```

The `tzone` attribute controls only how the date-time is formatted; it does not

control the instant of time represented by the vector. Note that time is not

printed if it is midnight (to prevent werewolves).

```{r posixct_structure}

structure(now_ct, tzone = "Asia/Tokyo")

structure(now_ct, tzone = "America/New_York")

structure(now_ct, tzone = "Australia/Lord_Howe")

structure(now_ct, tzone = "Europe/Paris")

```

#### Durations

Durations, which represent the amount of time between pairs of dates or

date-times, are stored in `difftimes`. `difftimes` are built on top of doubles,

and have a `units` attribute that determines how the integer should be

interpreted.

```{r difftime_attributes}

one_week_1 <- as.difftime(1, units = "weeks")

one_week_1

typeof(one_week_1)

attributes(one_week_1)

one_week_2 <- as.difftime(7, units = "days")

one_week_2

typeof(one_week_2)

attributes(one_week_2)

```

## Lists

Unlike vectors, lists can be used to store heterogeneous things.

```{r my_list}

my_list <- list(

  my_func = function(x){x^2},

  my_df = data.frame(a = 1:3),

  my_vec = 1:6

)

my_list$my_func(my_list$my_vec)

```

`lapply` can be used to apply a function to each item in a list and will return

a list.

```{r lapply}

my_list <- list(

  a = 1:3,

  b = 4:10,

  c = 11:20

)

lapply(my_list, sum)

```

Another handy function is the `do.call` function, which constructs and executes

a function call on a list. The example below is useful for converting a list

into a matrix.

```{r do_call}

my_list <- list(

  a = 1:3,

  b = 4:6,

  c = 7:9

)

# returns a matrix

do.call(what = rbind, args = my_list)

```

Use `utils::stack` to create a data frame from a named list; the vectors are

concatenated and the names are used to create factors.

```{r utils_stack}

utils::stack(my_list)

```

There is also the [purrr::map](https://purrr.tidyverse.org/reference/map.html)

function, that is

[similar](https://jennybc.github.io/purrr-tutorial/ls01_map-name-position-shortcuts.html)

to the apply functions in base R, but you explicitly specify the output type.

The `map_lgl` function will return logicals, i.e. Booleans.

```{r map_lgl}

map_lgl(.x = 1:10, .f = function(x) x > 5)

```

## Functions

Notes from [Advanced R](https://adv-r.hadley.nz/functions.html).

Two important ideas about functions in R need to be understood:

1. Functions can be broken down into three components: arguments, body, and

   environment

2. Functions are objects, just as vectors are objects

R functions are objects in their own right, a language property often called

"first-class functions". 

### Function components

A function has three parts:

1. The `formals()`, which are the list of arguments that control how you call

   the function.

2. The `body()`, which is the code inside the function.

3. The `environment()`, which is the data structure that determines the

   namespace.

The environment is based on where you define the function.

```{r f02}

f02 <- function(x, y){

  # comment

  x + y

}

formals(f02)

body(f02)

environment(f02)

```

Functions contain any number of additional `attributes()` as with all objects

in R. One attribute used by base R is `srcref`, which points to the source code

used to create the function.

```{r f02_srcref}

attr(f02, "srcref")

```

However primitive functions do not have the three components (they return

`NULL`) and call C code directly.

```{r sum}

sum

```

Primitive functions have either type `builtin` or type `special` but have class

`function`.

```{r primitive_type}

class(f02)

class(sum)

class(`[`)

typeof(f02)

typeof(sum)

typeof(`[`)

```

### Lexical scoping

Scoping is the act of finding the value associated with a name. R uses lexical

scoping, which means it looks up the values of names based on how a function is

defined and not by how it is called. Lexical in this context means that the

scoping rules use a parse-time, rather than a run-time, structure. R's lexical

scoping follows four primary rules:

1. Name masking

2. Functions versus variables

3. A fresh start - each time you invoke a function, it starts fresh

4. Dynamic lookup

The basic principle of lexical scoping is that names defined inside a function

mask names defined outside a function. If a name is not defined inside a

function, R looks one level up (all the way up to the global environment).

Lexical scoping determines where, but not when to look for values. _R looks for

values when the function is run and not when the function is created_.

### Lazy evaluation

Function arguments are only evaluated if accessed, i.e. lazily evaluated. This

is a nice feature because it allows the inclusion of potentially expensive

computations in function arguments that will only be evaluated if necessary.

Lazy evaluation is powered by a data structure called a **promise**, which has

three components:

1. An expression, like `x + y`, which gives rise to the delayed computation.

2. An environment where the expression should be evaluated, i.e. the

   environment where the function is called.

3. A value, which is computed and cached the first time a promise is accessed

   when the expression is evaluated in the specified environment.

### dot-dot-dot

Functions can have a special argument `...`, which is pronounced dot-dot-dot.

In other programming languages, this type of argument is often called `varargs`

(variable arguments) and a function that uses it is said to be variadic.

```{r dot_dot_dot_eg}

i01 <- function(y, z) {

  list(y = y, z = z)

}

i02 <- function(x, ...) {

  i01(...)

}

str(i02(x = 1, y = 2, z = 3))

```

The `...` is useful when a function takes a function as an argument: you can

pass additional arguments to that function. The downside of using `...` is that

when arguments are used to pass arguments to another function, it is sometimes

not clear to the user. Also a misspelled argument will not raise an error and

this makes it easy for typos to go unnoticed.

### Exiting a function

Most functions exit in one of two ways:

1. They either return a value, indicating success. There are two ways a

   function can return a value:

    1. Implicitly, where the last evaluated expression is returned.

    2. Explicitly by using `return()`.

2. They throw an error, indicating failure.

Most functions return visibly, meaning that the result is printed when

evaluated in an interactive context. The automatic printing can be prevented by

using `invisible()` but the return value still exists.

If a function cannot complete its assigned task, it should throw an error with

`stop()`, which immediately terminates the execution of the function.

`on.exit()` can be used to run some code regardless of how a function exits;

always set `add = TRUE` when using `on.exit()` because if you don't each call

to `on.exit()` will overwrite the previous exit handler.

### Function forms

Function calls come in four varieties:

1. **prefix**: the function name comes before its arguments, e.g. `sum(1:5)`.

   These constitute the majority of function calls in R

2. **infix**: the function name comes in between its arguments, e.g. `x + y`.

   Infix forms are used for many mathematical operators and for user-defined

   functions that begin and end with `%`.

3. **replacement**: functions that replace values by assignment, e.g.

   `names(my_df) <- c('a', 'b', 'c')`.

4. **special**: functions like `[[`, `if`, and `for` that do not have a

   consistent structure.

All functions can be written in prefix form.

```{r function_forms}

x <- 1900

y <- 84

x + y

`+`(x, y)

df <- data.frame(a = 1, b = 2, c= 3)

`names<-`(df, c("x", "y", "z"))

for(i in 1:3) print(i)

`for`(i, 1:3, print(i))

```

## Objects

[Base types](https://adv-r.hadley.nz/base-types.html).

## Modeling example

Notes from [Tidy Modeling with R](https://www.tmwr.org/base-r.html).

Load `crickets` data set that contains the relationship between the ambient

temperature and the rate of cricket chirps per minute for two species.

```{r crickets}

data(crickets, package = "modeldata")

head(crickets)

```

Plot.

```{r crickets_plot}

ggplot(

  crickets, 

  aes(x = temp, y = rate, color = species, pch = species, lty = species)

) + 

  geom_point(size = 2) + 

  geom_smooth(method = lm, se = FALSE, alpha = 0.5) + 

  scale_color_brewer(palette = "Paired") +

  labs(x = "Temperature (C)", y = "Chirp Rate (per minute)")

```

For an inferential model, we might have specified the following null hypotheses

prior to seeing the data:

* Temperature has no effect on the chirp rate

* There are no differences between the species' chirp rate.

The `lm()` function is commonly used to fit an ordinary linear model. Arguments

to this function are a model formula and the data frame that contains the data.

The formula is _symbolic_; the simple formula below specifies that the chirp

rate is the outcome and the temperature is the predictor.

```{r rate_temp}

rate ~ temp

```

If the time of day was also recorded in a column called `time`, the following

formula does not add the time and temperature values together but the formula

symbolically represents that temperature and time should be added as separate

_main effects_ to the model. A main effect is a model term that contains a

single predictor variable.

```{r rate_temp_time}

rate ~ temp + time

```

We can add the species to the model in the same way but since species is not a

quantitative variable, an _indicator variable_ (also known as a dummy variable)

is used in place of the original qualitative value. The model formula will

automatically encode `species` as a numeric by adding a new column that has a

value of zero and one for the two species.

```{r rate_temp_species}

rate ~ temp + species

```

The model formula `rate ~ temp + species` creates a model with different

y-intercepts for each species; the slopes of the regression lines could be

different for each species as well. To accommodate this structure, an

interaction term can be added to the model. This can be specified in a few

different ways and the most basic uses the colon.

```{r interaction_term}

rate ~ temp + species + temp:species

# A shortcut can be used to expand all interactions containing

# interactions with two variables:

rate ~ (temp + species)^2

# Another shortcut to expand factors to include all possible

# interactions (equivalent for this example):

rate ~ temp * species

```

The model formula also has other nice features:

* _In-line_ functions can be used, e.g. to use the natural log of the

  temperature, we can use the formula `rate ~ log(temp)`

* R has many functions that are useful inside formulas, e.g. `poly(x, 3)` adds

  linear, quadratic, and cubic terms for `x` to the model as main effects.

* The period shortcut is available for data sets with many predictors. The

  period represents main effects for all of the columns that are not on the

  left-hand side of the tilde.

Use a two-way interaction model.

```{r interaction_fit}

interaction_fit <- lm(rate ~ (temp + species)^2, data = crickets)

interaction_fit

```

Now we will recompute the model without the interaction term to assess whether

the interaction term is necessary using the `anova()` method.

```{r main_effect_fit}

main_effect_fit <- lm(rate ~ temp + species, data = crickets)

anova(main_effect_fit, interaction_fit)

```

This statistical test generates a p-value of 0.25, which implies that there is

a lack of evidence against the null hypothesis that the interaction term is not

needed by the model.

The `summary()` method can be used to inspect the coefficients, standard

errors, and p-values of each model term.

```{r summary_main_effect}

summary(main_effect_fit)

```

The chirp rate for each species increases by 3.6 chirps as the temperature

increases by a single degree. This term shows strong statistical sifnificance

as evidenced by the p-value. The species term has a value of -10.07, which

indicates that, across all temperature values, _O. niveus_ has a chirp rate

that is about 10 fewer chirps per minute than _O. exclamationis_. The species

effect is also associated with a very small p-value.

The only issue in this analysis is the intercept value that indicates that at 0

degrees Celsius, there are negative chirps per minute for both species. The

data only goes as low as 17.2 degrees Celsius and therefore the conclusions

should be limited to the observed temperature range.

If we needed to estimate the chirp rate at a temperature that was not observed

in the experiment, we could use the `predict()` method, which takes the model

object and a data frame of new values for prediction.

```{r predict_main_effect}

new_values <- data.frame(species = "O. exclamationis", temp = 15:20)

predict(main_effect_fit, new_values)

```

### R formula

The R model formula is used by many modeling packages and it usually serves

multiple purposes:

* The formula defines the columns that the model uses.

* The standard R machinery uses the formula to encode the columns into an

  appropriate format, e.g. create indicator variables.

* The roles of the columns are defined by the formula.

For example, the following formula indicates that there are two predictors and

the model should contain their main effects and the two-way interactions.

```{r r_formula_eg}

rate ~ (temp + species)^2

```

## Exceptions

Notes from [Debugging, condition handling, and defensive

programming](http://adv-r.had.co.nz/Exceptions-Debugging.html).

`try()` gives you the ability to continue execution even when an error occurs.

```{r try_log}

try_log <- function(x) {

  try(log(x))

  10

}

try_log('a')

```

`tryCatch()` is a general tool for handling conditions: in addition to errors,

you can take different actions for warnings, messages, and interrupts.

```{r try_catch}

show_condition <- function(code) {

  tryCatch(

    code,

    error = function(x) "error",

    warning = function(x) "warning",

    message = function(x) "message"

  )

}

show_condition(log('a'))

show_condition(stop("!"))

show_condition(stopifnot(2 + 2 == 5))

show_condition(stopifnot(4 + 1980 == 1984))

show_condition(warning("?!"))

show_condition(message("?"))

show_condition(10)

```

## Measuring performance

Notes from [Measuring performance](https://adv-r.hadley.nz/perf-measure.html).

### Microbenchmarking

A [microbenchmark](https://adv-r.hadley.nz/perf-measure.html#microbenchmarking)

is a measurement of the performance of a very small piece of code, something

that might take milliseconds, microseconds, or nanoseconds to run.

```{r bench_mark}

set.seed(1984)

x <- runif(100)

(lb <- bench::mark(

  sqrt(x),

  x ^ 0.5

))

```

`for` versus `map_int` versus `sapply`.

```{r map_vs_sapply}

my_num <- 1:10000

for_loop <- function(n){

  v <- vector(mode = "integer")

  for(i in n){

    v[i] <- i^2

  }

  return(v)

}

(ms <- bench::mark(

  for_loop(my_num),

  map_int(my_num, function(x) x^2),

  sapply(my_num, function(x) x^2)

))

```

* `min` - The minimum execution time.

* `median` - The sample median of execution time.

* `itr/sec` - The estimated number of executions performed per second.

* `mem_alloc` - Total amount of memory allocated by R while running the

  expression.

* `gc/sec` - The number of garbage collections per second.

* `n_itr` - Total number of iterations after filtering garbage collections (if

  filter_gc == TRUE).

* `n_gc` - Total number of garbage collections performed over all iterations.

* `total_time` - The total time to perform the benchmarks.

* `result` - A list column of the object(s) returned by the evaluated

  expression(s).

* `memory` - A list column with results from Rprofmem().

* `time` - A list column of bench_time vectors for each evaluated expression.

* `gc` - A list column with tibbles containing the level of garbage collection

  (0-2, columns) for each iteration (rows).

```{r plot_ms}

plot(ms)

```

Use `system.time` to measure CPU time.

```{r with_replicate}

system.time(

  x <- replicate(1000, for_loop(my_num))

)

system.time(

  y <- replicate(1000, map_int(my_num, function(x) x^2))

)

system.time(

  z <- replicate(1000, sapply(my_num, function(x) x^2))

)

all.equal(x, y)

all.equal(x, z)

```

## General

Assign a data frame column `NULL` to delete it.

```{r exorcise}

my_df <- data.frame(

  a = 1:3,

  b = 4:6,

  c = c(6, 6, 6)

)

my_df$c <- NULL

my_df

```

Include an additional directory (`/packages`) to look for and install R

packages.

```{r lib_paths}

.libPaths('/packages')

```

Use `identical` to check whether two objects are exactly equal. Most times it

should suffice to just use `all.equal`.

```{r identical}

first <- 1:5

second <- c(1, 2, 3, 4, 5)

# this is false because first is a vector of integers

# and second is a vector of numerics

identical(first, second)

all.equal(first, second)

```

Set `scipen` (default is 0), which is a penalty to be applied when deciding to

print numeric values in fixed or exponential notation, to determine when to

print in exponential notation. (`.Options` contains all other options

settings.)

```{r scipen}

options(scipen=0)

10e4

options(scipen=1)

10e4

```

Use `system.time()` to measure how long a block of codes takes to execute.

```{r system_time}

system.time(

  for (i in 1:100000000){}

)

```

The `with` function evaluates an expression with data.

```{r with_example}

my_df <- data.frame(

  a=1:10,

  b=11:20,

  c=21:30

)

wanted <- with(my_df, a > 5 & c > 27)

my_df[wanted, ]

```

The `which` function is a very useful for returning indicates that are `TRUE`

and works with matrices.

```{r which}

my_mat <- matrix(1:9, nrow=3, byrow = TRUE)

# note that the results are ordered by col

which(my_mat > 5, arr.ind = TRUE)

```

The `match` function can be used with vectors to return the indexes of matching

items and an `NA` is no match was found.

```{r match_xy}

x <- c('b', 'c', 'a', 'd')

y <- letters[1:3]

match(x, y)

```

You can use `match` to subset and order a data frame.

```{r match_subset}

my_df <- data.frame(

  a = 1:10,

  b = letters[1:10]

)

x <- c(2, 10, 5, 6)

x_match <- match(x, my_df$a)

my_df[x_match, ]

```

Use the `complete.cases` function to list observations that have no missing

values, i.e. NA values.

```{r complete_cases}

my_df <- data.frame(

  a = 1:3,

  b = c(4, NA, 6),

  c = 7:9

)

complete.cases(my_df)

```

Use `commandArgs` to accept command line arguments without having to install an

external package like `optparse`.

```{r command_args}

args <- commandArgs(TRUE)

```

## Useful plots

Visualise a table.

```{r mosaicplot}

mosaicplot(table(ChickWeight$Time, ChickWeight$Diet), main = "Timepoint versus diet")

```

### Getting help

Get help on a class.

```{r numeric_class_help, eval=FALSE}

?"numeric-class"

```

Get information on a package.

```{r stringr_help, eval=FALSE}

library(help="stringr")

```

Finding out what methods are available for a class.

```{r lm_methods}

methods(class="lm")

```

Search the help pages.

```{r help_search, eval=FALSE}

help.search("cross tabulate")

```

Search for function containing keyword.

```{r apropos}

apropos("mutate")

```

## Hacks

There are probably better ways to do the following, which is why I have

labelled them as hacks, so follow at your own peril.

### Makevars

You can set/modify preprocessor options (for example include paths and

definitions) for C/C++ files using a

[Makevars](https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Using-Makevars)

file. I had a problem [installing the ranger

package](https://github.com/imbs-hl/ranger/issues/669) because R (my version is

4.3.0) sets the default C++

[standard](https://gcc.gnu.org/projects/cxx-status.html) to C++11 and `ranger`

requires C++14. I wrote this Bash snippet (I use a Bash script to install my

packages) to overcome this problem.

```bash

makevars=${HOME}/.R/Makevars

if [[ ! -e ${makevars} ]]; then

   mkdir -p ${HOME}/.R

   touch ${makevars} && echo "CXX = g++ -std=gnu++14" >> ${makevars}

   install_ranger

   rm ${makevars}

else

   if [[ $(grep -c "^CXX = g++ -std=gnu++14\b") == 0 ]]; then

      cp ${makevars} ${HOME}/.R/Makevars_bk

      echo "CXX = g++ -std=gnu++14" >> ${makevars}

      install_ranger

      mv -f ${HOME}/.R/Makevars_bk ${makevars}

   fi

fi

```

### Library paths

Some R packages require libraries not included in the default library path. Use

`Sys.setenv` to include additional library paths. First we'll get the default

path.

```{r get_ld_library_path}

Sys.getenv("LD_LIBRARY_PATH")

```

Now we will add `/usr/include` to `LD_LIBRARY_PATH` and get the updated library

path.

```{r set_ld_library_path}

new_path <- paste0(Sys.getenv("LD_LIBRARY_PATH"), ":", "/usr/include")

Sys.setenv("LD_LIBRARY_PATH" = new_path)

Sys.getenv("LD_LIBRARY_PATH")

```

### Variables and objects

Sometimes you want to create objects with values stored in variables. This can

be achieved using `assign()`.

```{r assign}

my_varname <- 'one_to_ten'

my_values <- 1:10

assign(my_varname, my_values)

one_to_ten

```

Likewise, sometimes you want to store an object's name into a variable. This

can be achieved using `substitute` (returns the parse tree for an unevaluated

expression) and `deparse` (turns unevaluated expressions into character

strings).

```{r obj_to_string}

obj_to_string <- function(x){

   deparse(substitute(x))

}

my_obj_name <- 1984

my_var <- obj_to_string(my_obj_name)

my_var

```

To evaluating a string, use `parse` (returns an unevaluated expression) with a

`text` argument specifying the character vector and `eval` (evaluates an

unevaluated expression).

```{r eval}

eval(parse(text = my_var))

```

## Useful tips

A lot of R books are free to read; check out the [bookdown](https://bookdown.org/) page to see some of the best R books.

R has four special values:

1. `NA` - used for representing missing data.

2. `NULL` - represents not having a value and unlike `NA`, it is its own object

   and cannot be used in a vector.

3. `Inf`/`-Inf` - used for representing numbers too big for R (see below).

4. `NaN` - used for storing results that are not a number.

Check the `.Machine` variable to find out the numerical characteristics of the

machine R is running on, such as the largest double or integer and the

machine's precision.

```{r machine}

noquote(unlist(format(.Machine)))

```

When asking for help online, it is useful to include a minimal example that

includes some data specific to your question. To easily convert data into code,

use the `dput()` function. The example below is just for illustrative purposes

since the `women` dataset is included with R, so you would not need to generate

code for it.

```{r dput}

dput(women)

```

Show all the functions of a package.

```{r stringr_functions}

ls("package:stringr")

```

Search is useful to list the search path, i.e. where R will look, for R objects such as functions.

```{r search_path}

search()

```

Save all functions in the global environment into a file (that you can source later)!

```{r dump_functions}

dump(list = lsf.str(), file = "functions.R")

unlink('functions.R')

```

## Session info

This README was generated by running `rmd_to_md.sh` with `readme.Rmd`.

```{r time, echo=FALSE}

Sys.time()

```

Session info.

```{r session_info, echo=FALSE}

sessionInfo()

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/davetang/learning_r

Awesome Lists containing this project

README