Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hrbrmstr/htmlunit
🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library
https://github.com/hrbrmstr/htmlunit
htmlunit javascript r r-cyber rstats web-scraping
Last synced: 2 months ago
JSON representation
🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library
- Host: GitHub
- URL: https://github.com/hrbrmstr/htmlunit
- Owner: hrbrmstr
- License: apache-2.0
- Created: 2018-12-16T18:54:39.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-08-19T12:55:13.000Z (over 4 years ago)
- Last Synced: 2024-10-12T21:24:15.431Z (3 months ago)
- Topics: htmlunit, javascript, r, r-cyber, rstats, web-scraping
- Language: R
- Homepage:
- Size: 29.2 MB
- Stars: 37
- Watchers: 5
- Forks: 6
- Open Issues: 5
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
---
output:
rmarkdown::github_document
editor_options:
chunk_output_type: console
---
```{r pkg-knitr-opts, include=FALSE}
hrbrpkghelpr::global_opts()
``````{r badges, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::stinking_badges()
``````{r description, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::yank_title_and_description()
```## What's Inside The Tin
The following functions are implemented:
### DSL
- `web_client`/`webclient`: Create a new HtmlUnit WebClient instance
- `wc_go`: Visit a URL
- `wc_html_nodes`: Select nodes from web client active page html content
- `wc_html_text`: Extract attributes, text and tag name from webclient page html content
- `wc_html_attr`: Extract attributes, text and tag name from webclient page html content
- `wc_html_name`: Extract attributes, text and tag name from webclient page html content- `wc_headers`: Return response headers of the last web request for current page
- `wc_browser_info`: Retreive information about the browser used to create the 'webclient'
- `wc_content_length`: Return content length of the last web request for current page
- `wc_content_type`: Return content type of web request for current page- `wc_render`: Retrieve current page contents
- `wc_css`: Enable/Disable CSS support
- `wc_dnt`: Enable/Disable Do-Not-Track
- `wc_geo`: Enable/Disable Geolocation
- `wc_img_dl`: Enable/Disable Image Downloading
- `wc_load_time`: Return load time of the last web request for current page
- `wc_resize`: Resize the virtual browser window
- `wc_status`: Return status code of web request for current page
- `wc_timeout`: Change default request timeout
- `wc_title`: Return page title for current page
- `wc_url`: Return load time of the last web request for current page
- `wc_use_insecure_ssl`: Enable/Disable Ignoring SSL Validation Issues
- `wc_wait`: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing### Just the Content (pls)
- `hu_read_html`: Read HTML from a URL with Browser Emulation & in a JavaScript Context
### Content++
- `wc_inspect`: Perform a "Developer Tools"-like Network Inspection of a URL
## Installation
```{r install-ex, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::install_block()
```## Usage
```{r cache=FALSE}
library(htmlunit)
library(tidyverse) # for some data ops; not req'd for pkg# current verison
packageVersion("htmlunit")```
Something `xml2::read_html()` cannot do, read the table from :
![](man/figures/test-url-table.png)
```{r ex1}
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"pg <- xml2::read_html(test_url)
html_table(pg)
```☹️
But, `hu_read_html()` can!
```{r ex2}
pg <- hu_read_html(test_url)html_table(pg)
```All without needing a separate Selenium or Splash server instance.
### Content++
We can also get a HAR-like content + metadata dump:
```{r ex3}
xdf <- wc_inspect("https://rstudio.com")colnames(xdf)
select(xdf, method, url, status_code, content_length, load_time)
group_by(xdf, content_type) %>%
summarise(
total_size = sum(content_length),
total_load_time = sum(load_time)/1000
)
```### DSL
```{r ex4}
wc <- web_client(emulate = "chrome")wc %>% wc_browser_info()
wc <- web_client()
wc %>% wc_go("https://usa.gov/")
# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()
wc %>%
wc_html_nodes("a") %>%
sapply(wc_html_text, trim = TRUE) %>%
head(10)wc %>%
wc_html_nodes(xpath=".//a") %>%
sapply(wc_html_text, trim = TRUE) %>%
head(10)wc %>%
wc_html_nodes(xpath=".//a") %>%
sapply(wc_html_attr, "href") %>%
head(10)
```Handy function to get rendered plain text for text mining:
```{r ex5}
wc %>%
wc_render("text") %>%
substr(1, 300) %>%
cat()
```### htmlunit Metrics
```{r echo=FALSE}
cloc::cloc_pkg_md()
```## Code of Conduct
Please note that this project is released with a Contributor Code of Conduct.
By participating in this project you agree to abide by its terms.