Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hrbrmstr/htmlunit

🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library
https://github.com/hrbrmstr/htmlunit

htmlunit javascript r r-cyber rstats web-scraping

Last synced: 4 months ago
JSON representation

🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

Host: GitHub
URL: https://github.com/hrbrmstr/htmlunit
Owner: hrbrmstr
License: apache-2.0
Created: 2018-12-16T18:54:39.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-08-19T12:55:13.000Z (over 4 years ago)
Last Synced: 2024-10-12T21:24:15.431Z (4 months ago)
Topics: htmlunit, javascript, r, r-cyber, rstats, web-scraping
Language: R
Homepage:
Size: 29.2 MB
Stars: 37
Watchers: 5
Forks: 6
Open Issues: 5
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

        ---

output: 

  rmarkdown::github_document

editor_options: 

  chunk_output_type: console

---

```{r pkg-knitr-opts, include=FALSE}

hrbrpkghelpr::global_opts()

```

```{r badges, results='asis', echo=FALSE, cache=FALSE}

hrbrpkghelpr::stinking_badges()

```

```{r description, results='asis', echo=FALSE, cache=FALSE}

hrbrpkghelpr::yank_title_and_description()

```

## What's Inside The Tin

The following functions are implemented:

### DSL

- `web_client`/`webclient`:	Create a new HtmlUnit WebClient instance



- `wc_go`:	Visit a URL


- `wc_html_nodes`:	Select nodes from web client active page html content

- `wc_html_text`:	Extract attributes, text and tag name from webclient page html content



- `wc_html_attr`:	Extract attributes, text and tag name from webclient page html content

- `wc_html_name`:	Extract attributes, text and tag name from webclient page html content

- `wc_headers`:	Return response headers of the last web request for current page

- `wc_browser_info`:	Retreive information about the browser used to create the 'webclient'

- `wc_content_length`:	Return content length of the last web request for current page

- `wc_content_type`:	Return content type of web request for current page



- `wc_render`:	Retrieve current page contents



- `wc_css`:	Enable/Disable CSS support

- `wc_dnt`:	Enable/Disable Do-Not-Track

- `wc_geo`:	Enable/Disable Geolocation

- `wc_img_dl`:	Enable/Disable Image Downloading

- `wc_load_time`:	Return load time of the last web request for current page

- `wc_resize`:	Resize the virtual browser window

- `wc_status`:	Return status code of web request for current page

- `wc_timeout`:	Change default request timeout

- `wc_title`:	Return page title for current page

- `wc_url`:	Return load time of the last web request for current page

- `wc_use_insecure_ssl`:	Enable/Disable Ignoring SSL Validation Issues

- `wc_wait`:	Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing

### Just the Content (pls)

- `hu_read_html`:	Read HTML from a URL with Browser Emulation & in a JavaScript Context

### Content++

- `wc_inspect`:  Perform a "Developer Tools"-like Network Inspection of a URL

## Installation

```{r install-ex, results='asis', echo=FALSE, cache=FALSE}

hrbrpkghelpr::install_block()

```

## Usage

```{r cache=FALSE}

library(htmlunit)

library(tidyverse) # for some data ops; not req'd for pkg

# current verison

packageVersion("htmlunit")

```

Something `xml2::read_html()` cannot do, read the table from :

![](man/figures/test-url-table.png)

```{r ex1}

test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"

pg <- xml2::read_html(test_url)

html_table(pg)

```

☹️

But, `hu_read_html()` can!

```{r ex2}

pg <- hu_read_html(test_url)

html_table(pg)

```

All without needing a separate Selenium or Splash server instance.

### Content++

We can also get a HAR-like content + metadata dump:

```{r ex3}

xdf <- wc_inspect("https://rstudio.com")

colnames(xdf)

select(xdf, method, url, status_code, content_length, load_time)

group_by(xdf, content_type) %>% 

  summarise(

    total_size = sum(content_length), 

    total_load_time = sum(load_time)/1000

  )

```

### DSL

```{r ex4}

wc <- web_client(emulate = "chrome")

wc %>% wc_browser_info()

wc <- web_client()

wc %>% wc_go("https://usa.gov/")

# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()

wc %>%

  wc_html_nodes("a") %>%

  sapply(wc_html_text, trim = TRUE) %>% 

  head(10)

wc %>%

  wc_html_nodes(xpath=".//a") %>%

  sapply(wc_html_text, trim = TRUE) %>% 

  head(10)

wc %>%

  wc_html_nodes(xpath=".//a") %>%

  sapply(wc_html_attr, "href") %>% 

  head(10)

```

Handy function to get rendered plain text for text mining:

```{r ex5}

wc %>% 

  wc_render("text") %>% 

  substr(1, 300) %>% 

  cat()

```

### htmlunit Metrics

```{r echo=FALSE}

cloc::cloc_pkg_md()

```

## Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.

By participating in this project you agree to abide by its terms.