https://github.com/RLesur/crrri

A Chrome Remote Interface written in R
https://github.com/RLesur/crrri
chrome-devtools chrome-headless r r-package rstats
Last synced: 5 months ago
JSON representation
A Chrome Remote Interface written in R
Host: GitHub
URL: https://github.com/RLesur/crrri
Owner: RLesur
License: other
Created: 2018-11-16T18:00:50.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-08-22T15:48:30.000Z (over 2 years ago)
Last Synced: 2024-11-19T23:03:51.577Z (5 months ago)
Topics: chrome-devtools, chrome-headless, r, r-package, rstats
Language: R
Homepage: https://rlesur.github.io/crrri/
Size: 1.16 MB
Stars: 157
Watchers: 10
Forks: 12
Open Issues: 34
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project

jimsghstars - RLesur/crrri - A Chrome Remote Interface written in R (R)
README

        ---

output: github_document

---

```{r setup, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  eval = FALSE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

Sys.unsetenv("DEBUGME")

```

# crrri

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

[![Codecov test coverage](https://codecov.io/gh/RLesur/crrri/branch/master/graph/badge.svg)](https://codecov.io/gh/RLesur/crrri?branch=master)

[![CRAN status](https://www.r-pkg.org/badges/version/crrri)](https://cran.r-project.org/package=crrri)

[![R build status](https://github.com/RLesur/crrri/workflows/R-CMD-check/badge.svg)](https://github.com/RLesur/crrri/actions)

**Work in progress**

The goal of `crrri` is to provide a native Chrome Remote Interface in R using the [Chrome Debugging Protocol](https://chromedevtools.github.io/devtools-protocol/). This is a low-level implementation of the protocol heavily inspired by the [`chrome-remote-interface`](https://github.com/cyrus-and/chrome-remote-interface) JavaScript library written by [Andrea Cardaci](https://github.com/cyrus-and).

This package is intended to R packages developers who need to orchestrate Chromium/Chrome: **with `crrri`, you can easily interact with (headless) Chromium/Chrome using R**. We worked a lot to provide the most simple API. However, you will have the bulk of the work and learn how the Chrome DevTools Protocol works. Interacting with Chromium/Chrome using the DevTools Protocol is a highly technical task and prone to errors: you will be close to the metal and have full power (be cautious!).

This package is built on top of the [`websocket`](https://github.com/rstudio/websocket) and [`promises`](https://cran.r-project.org/package=promises) packages. The default design of the `crrri` functions is asynchronous: they return promises. You can also use `crrri` with callbacks if you prefer. 

We are highly indebted to [Miles McBain](https://github.com/milesmcbain) for his seminal work on [`chradle`](https://github.com/milesmcbain/chradle) that inspired us. Many thanks!

## System requirements

First of all, you **do not need a `node.js` configuration** because **`crrri` is

fully written in R**.

You only need a recent version of Chromium or Chrome. A standalone version works perfectly well on Windows. By default, `crrri` will try to find a chrome binary on your system to use, using the `find_chrome_binary()`. You can tell `crrri` to use a specific version by setting the value of the `HEADLESS_CHROME` environment variable to the path of Chromium or Chrome (this is the same environment variable that is used in [`decapitated`](https://github.com/hrbrmstr/decapitated)). You can check it is set correctly by executing `Sys.getenv("HEADLESS_CHROME")` in your R console.

Otherwise, you can also use the `bin` argument of the `Chrome` class `new()` method to provide the path directly.

```r

chrome <- Chrome$new(bin = "")

```

Note that if ever you don't know where your binary is, you can use directly the `find_chrome_binary()` function, which will try to guess where your binary is (you might neeed to install the package). 

This two calls are equivalent

```r

chrome <- Chrome$new(bin = find_chrome_binary())

# the default

chrome <- Chrome$new(bin = NULL)

```

## Installation

You can install the development version of `crrri` from GitHub with:

```{r, eval=FALSE}

remotes::install_github('rlesur/crrri')

```

## Using `crrri` interactively

The `crrri` package is a low-level interface and **is not intended to be used interactively**: the goal of `crrri` is to provide to R developers a set of classes and helper functions to build higher levels functions.

**However, you can discover headless Chrome automation interactively in your R session using `crrri`**. This will help you to learn the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol), the `crrri` design and develop higher level functions.

### A short-tour

Assuming that you have configured the `HEADLESS_CHROME` environment variable (see [above](#system-requirements)), you can start headless Chrome:

```{r}

library(crrri)

chrome <- Chrome$new()

```

The `Chrome` class constructor is a **synchronous function**. That means the R session is on hold until the command terminates.

The `$connect()` method of the `Chrome` class will connect the R session to headless Chrome. As the connection process can take some time, the R session does not hold^[most of R users should think that this behavior is weird but it is extremely powerful!]: this is an **asynchronous function**. This function returns a promise which is fulfilled when R is connected to Chrome. 

However, you can pass a callback function to the `$connect()` method using its `callback` argument. In this case, the returned object will be a connection object:

```{r}

client <- chrome$connect(callback = function(client) {

  client$inspect()

})

```

The `$inspect()` method of the connection object opens the Chrome DevTools Inspector in RStudio (>= 1.2.1335) or in your default web browser (you can have some trouble if the inspector is not opened in Chromium/Chrome). It is convenient if you need to inspect the content of a web page because all that you need is in RStudio.

![DevTools Inspector in RStudio viewer](https://user-images.githubusercontent.com/19177171/56867255-861c3900-69e3-11e9-88cd-2ef29075070f.png)

In order to discover the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol) commands and events listeners, it is recommended to extract one of the domains^[a domain is a set of commands, events listeners and types.] from the connection object:

```{r}

Page <- client$Page

```

The `Page` object represents the [`Page` domain](https://chromedevtools.github.io/devtools-protocol/tot/Page). It possesses methods to send commands or listen to specific events.

For instance, you can send to Chromium/Chrome the [`Page.navigate`](https://chromedevtools.github.io/devtools-protocol/tot/Page#method-navigate) command as follows:

```{r}

Page$navigate(url = "http://r-project.org")

```

Once the page is loaded by headless Chrome, RStudio looks like this:

![R Project website in headless Chrome](https://user-images.githubusercontent.com/19177171/56867262-8f0d0a80-69e3-11e9-828f-4dddb0bcd492.png)

You will see in the R console:

```

```

This is a promise object that is fulfilled when Chromium/Chrome sends back to R a message telling that the command was well-received. This comes from the fact that the `Page$navigate()` function is also asynchronous. All the asynchronous methods possess a `callback` argument. When the R session receives the result of the command from Chrome, R executes this callback function passing the result object to this function. For instance, you can execute:

```{r}

Page$navigate(url = "https://ropensci.org/", callback = function(result) {

  cat("The R session has received this result from Chrome!\n")

  print(result)

})

```

Once the page is loaded, you will see both the web page and the result object object in RStudio:

![rOpenSci website in headless Chrome](https://user-images.githubusercontent.com/19177171/56867269-9cc29000-69e3-11e9-8fa4-ca238d3b3566.png)

To inspect the result of a command you can pass the `print` function to the `callback` argument:

```{r}

Page$navigate(url = "https://ropensci.org/", callback = print)

```

```

#> $frameId

#> [1] "3BB38B10082F28A946332100964486EC"

#> 

#> $loaderId

#> [1] "9DCF07625678433563CB03FFF1E8A6AB"

```

The result object sent back from Chrome is also the value of the promises once fulfilled. Recall that if you do not use a callback function, you get a promise:

```{r}

async_result <- Page$navigate(url = "http://r-project.org")

```

You can print the value of this promise once fulfilled with:

```{r}

async_result %...>% print()

```

```

#> $frameId

#> [1] "3BB38B10082F28A946332100964486EC"

#> 

#> $loaderId

#> [1] "7B2383E8F2F39273E18E4D918F1852A0"

```

As you can see, this leads to the same result as with a callback function.

To sum up, these two forms perform the same actions:

```{r}

Page$navigate(url = "http://r-project.org", callback = print)

Page$navigate(url = "http://r-project.org") %...>% print()

```

If you interact with headless Chrome in the R console using `crrri`, these two forms are equivalent.  

**However, if you want to use `crrri` to develop higher level functions, the most reliable way is to use promises.**

Do not forget to close headless Chrome with:

```{r}

chrome$close()

```

Since the RStudio viewer has lost the connection, you will see this screen in RStudio:

![headless Chrome closed](https://user-images.githubusercontent.com/19177171/56867276-a4823480-69e3-11e9-8530-831ac4dd144e.png)

Now, you can take some time to discover all the commands and events of the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/). The following examples will introduce some of them.

### Domains, commands and events listeners

While working interactively, you can obtain the list of available domains in your version of Chromium/Chrome.  

First, launch Chromium/Chrome and connect the R session to headless Chromium/Chrome:

```r

chrome <- Chrome$new()

client <- chrome$connect(~ .x$inspect())

```

Once connected, you just have to print the connection object to get informations about the connection and availables domains:

```r

client

```

```

#> 

#> connected to: http://localhost:9222/

#>  target type: "page"

#>    target ID: "9A576420CADEA9A514C5F027D30B410D"

#> 

#> 

#> Accessibility (experimental)

#> 

#> Animation (experimental)

#> 

#> ApplicationCache (experimental)

#> 

#> Audits (experimental): Audits domain allows investigation of page violations and possible improvements.

#> 

#> Browser: The Browser domain defines methods and events for browser managing.

#> 

#> CacheStorage (experimental)

#> 

#> Cast (experimental): A domain for interacting with Cast, Presentation API, and Remote Playback API functionalities.

...

```

These informations are directly retrieved from Chromium/Chrome: you may obtain different informations depending on the Chromium/Chrome version.

In the most recent versions of the Chrome DevTools Protocol, more than 40 domains are available. A domain is a set of commands and events listeners.

In order to work with a domain, it is recommended to extract it from the connection object. For instance, if you want to access to the `Runtime` domain, execute:

```r

Runtime <- client$Runtime

```

If you print this object, this will open the online documentation about this domain in your browser:

```r

Runtime # opens the online documentation in a browser

```

## Presentations about crrri

* **uros2019** - 20/05/2019 ([slides](https://speakerdeck.com/rlesur/headless-chrome-automation-with-r-the-crrri-package))

* **useR!2019** - 12/07/2019 ([slides](https://cderv.gitlab.io/user2019-crrri/))

## Examples

### Generate a PDF

Here is an example that produces a PDF of the [R Project website](https://www.r-project.org/): 

```{r, results='hide'}

library(promises)

library(crrri)

library(jsonlite)

perform_with_chrome(function(client) {

  Page <- client$Page

  Page$enable() %...>% { # await enablement of the Page domain

    Page$navigate(url = "https://www.r-project.org/") 

    Page$loadEventFired() # await the load event

  } %...>% {

    Page$printToPDF() 

  } %...>% { # await PDF reception

    .$data %>% base64_dec() %>% writeBin("r_project.pdf") 

  }

})

```

All the functions of the `crrri` package (commands and event listeners) return promises (as defined in the **promises** package) by default. When building higher level functions, do not forget that you have to deal with promises (those will prevent you to fall into the _Callback Hell_).

For instance, you can write a `save_as_pdf` function as follow:

```{r}

save_url_as_pdf <- function(url) {

  function(client) {

    Page <- client$Page

    Page$enable() %...>% {

      Page$navigate(url = url)

      Page$loadEventFired()

    } %...>% {

      Page$printToPDF()

    } %...>% {

      .$data %>%

        jsonlite::base64_dec() %>%

        writeBin(paste0(httr::parse_url(url)$hostname, ".pdf"))

    }

  }

}

```

You can pass several functions to `perform_with_chrome()`:

```{r}

save_as_pdf <- function(...) {

  list(...) %>%

    purrr::map(save_url_as_pdf) %>%

    perform_with_chrome(.list = .)

}

```

You have created a `save_as_pdf()` function that can handle multiple URLs:

```{r}

save_as_pdf("http://r-project.org", "https://ropensci.org/", "https://rstudio.com")

```

### Transpose `chrome-remote-interface` JS scripts: dump the DOM

With `crrri`, you should be able to transpose with minimal efforts some JS scripts written with the [`chrome-remote-interface`](https://github.com/cyrus-and/chrome-remote-interface) node.js module.

For instance, take [this JS script](https://github.com/cyrus-and/chrome-remote-interface/wiki/Dump-HTML-after-page-load) that dumps the DOM:

```js

const CDP = require('chrome-remote-interface');

CDP(async(client) => {

    const {Network, Page, Runtime} = client;

    try {

        await Network.enable();

        await Page.enable();

        await Network.setCacheDisabled({cacheDisabled: true});

        await Page.navigate({url: 'https://github.com'});

        await Page.loadEventFired();

        const result = await Runtime.evaluate({

            expression: 'document.documentElement.outerHTML'

        });

        const html = result.result.value;

        console.log(html);

    } catch (err) {

        console.error(err);

    } finally {

        client.close();

    }

}).on('error', (err) => {

    console.error(err);

});

```

Using `crrri`, you can write:

```{r, results='hide'}

library(promises)

library(crrri)

async_dump_DOM <- function(client) {

  Network <- client$Network

  Page <- client$Page

  Runtime <- client$Runtime

  Network$enable() %...>% { 

    Page$enable()

  } %...>% {

    Network$setCacheDisabled(cacheDisabled = TRUE)

  } %...>% {

    Page$navigate(url = 'https://github.com')

  } %...>% {

    Page$loadEventFired()

  } %...>% {

    Runtime$evaluate(

      expression = 'document.documentElement.outerHTML'

    )

  } %...>% (function(result) {

    html <- result$result$value

    cat(html, "\n")

  }) 

}

perform_with_chrome(async_dump_DOM)

```

If you want to write a higher level function that dump the DOM, you can embed the main part of this script in a function:

```{r}

dump_DOM <- function(url) {

  perform_with_chrome(function(client) {

    Network <- client$Network

    Page <- client$Page

    Runtime <- client$Runtime

    Network$enable() %...>% { 

      Page$enable()

    } %...>% {

      Network$setCacheDisabled(cacheDisabled = TRUE)

    } %...>% {

      Page$navigate(url = url)

    } %...>% {

      Page$loadEventFired()

    } %...>% {

      Runtime$evaluate(

        expression = 'document.documentElement.outerHTML'

      )

    } %...>% (function(result) {

      html <- result$result$value

      cat(html, "\n")

    }) 

  })

}

```

Now, you can use it for dumping [David Gohel](https://github.com/davidgohel)'s [blog](http://www.ardata.fr/blog/):

```{r, results='hide'}

dumpDOM(url = "http://www.ardata.fr/blog/")

```

You can find many other examples in the [wiki](https://github.com/cyrus-and/chrome-remote-interface/wiki) of the `chrome-remote-interface` module.

## Development

### Logging Messages

In `crrri`, there are two types of messages: 

- Those sent during connection/disconnection (mainly for crrri debugging) 

- Those tracking the exchanges between the R websocket client and the remote

websocket server. These lasts are essential for R devs to develop higher levels

packages, either during the development process and for debugging purposes.

`crrri` uses [`debugme`](https://github.com/r-lib/debugme) for printing those

messages. It is disable by default and you won't see any messages - as a user we

think it is fine. However, if you are a developer, you would expect some

information on what is going on.

You need to add `"crrri"` to the `DEBUGME` environment variable before loading

the package to activate the messaging feature. Currently in `crrri` there is

only one level of message.Also, `debugme` is a Suggested dependency and you may

need to install it manually if not already installed.

## Credits

Andrea Cardaci for `chrome-remote-interface`.

Miles McBain for `chradle`.

Bob Rudis for `decapitated`.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/RLesur/crrri

Awesome Lists containing this project

README