{"id":13400888,"url":"https://github.com/hrbrmstr/htmlunit","last_synced_at":"2025-03-21T12:31:00.871Z","repository":{"id":141238320,"uuid":"162031125","full_name":"hrbrmstr/htmlunit","owner":"hrbrmstr","description":"🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library","archived":false,"fork":false,"pushed_at":"2020-08-19T12:55:13.000Z","size":30630,"stargazers_count":37,"open_issues_count":5,"forks_count":6,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-01T06:11:12.430Z","etag":null,"topics":["htmlunit","javascript","r","r-cyber","rstats","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hrbrmstr.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-12-16T18:54:39.000Z","updated_at":"2023-09-08T17:48:18.000Z","dependencies_parsed_at":"2024-01-18T11:16:18.443Z","dependency_job_id":null,"html_url":"https://github.com/hrbrmstr/htmlunit","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fhtmlunit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fhtmlunit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fhtmlunit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fhtmlunit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hrbrmstr","download_url":"https://codeload.github.com/hrbrmstr/htmlunit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244135909,"owners_count":20403798,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["htmlunit","javascript","r","r-cyber","rstats","web-scraping"],"created_at":"2024-07-30T19:00:56.687Z","updated_at":"2025-03-21T12:30:58.603Z","avatar_url":"https://github.com/hrbrmstr.png","language":"R","readme":"---\noutput: \n  rmarkdown::github_document\neditor_options: \n  chunk_output_type: console\n---\n```{r pkg-knitr-opts, include=FALSE}\nhrbrpkghelpr::global_opts()\n```\n\n```{r badges, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::stinking_badges()\n```\n\n```{r description, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::yank_title_and_description()\n```\n\n## What's Inside The Tin\n\nThe following functions are implemented:\n\n### DSL\n\n- `web_client`/`webclient`:\tCreate a new HtmlUnit WebClient instance\u003cbr/\u003e\u003cbr/\u003e\n\n- `wc_go`:\tVisit a URL\u003cbr/\u003e\n\n- `wc_html_nodes`:\tSelect nodes from web client active page html content\n- `wc_html_text`:\tExtract attributes, text and tag name from webclient page html content\u003cbr/\u003e\u003cbr/\u003e\n- `wc_html_attr`:\tExtract attributes, text and tag name from webclient page html content\n- `wc_html_name`:\tExtract attributes, text and tag name from webclient page html content\n\n- `wc_headers`:\tReturn response headers of the last web request for current page\n- `wc_browser_info`:\tRetreive information about the browser used to create the 'webclient'\n- `wc_content_length`:\tReturn content length of the last web request for current page\n- `wc_content_type`:\tReturn content type of web request for current page\u003cbr/\u003e\u003cbr/\u003e\n\n- `wc_render`:\tRetrieve current page contents\u003cbr/\u003e\u003cbr/\u003e\n\n- `wc_css`:\tEnable/Disable CSS support\n- `wc_dnt`:\tEnable/Disable Do-Not-Track\n- `wc_geo`:\tEnable/Disable Geolocation\n- `wc_img_dl`:\tEnable/Disable Image Downloading\n- `wc_load_time`:\tReturn load time of the last web request for current page\n- `wc_resize`:\tResize the virtual browser window\n- `wc_status`:\tReturn status code of web request for current page\n- `wc_timeout`:\tChange default request timeout\n- `wc_title`:\tReturn page title for current page\n- `wc_url`:\tReturn load time of the last web request for current page\n- `wc_use_insecure_ssl`:\tEnable/Disable Ignoring SSL Validation Issues\n- `wc_wait`:\tBlock HtlUnit final rendering blocks until all background JavaScript tasks have finished executing\n\n### Just the Content (pls)\n\n- `hu_read_html`:\tRead HTML from a URL with Browser Emulation \u0026 in a JavaScript Context\n\n### Content++\n\n- `wc_inspect`:  Perform a \"Developer Tools\"-like Network Inspection of a URL\n\n## Installation\n\n```{r install-ex, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::install_block()\n```\n\n## Usage\n\n```{r cache=FALSE}\nlibrary(htmlunit)\nlibrary(tidyverse) # for some data ops; not req'd for pkg\n\n# current verison\npackageVersion(\"htmlunit\")\n\n```\n\nSomething `xml2::read_html()` cannot do, read the table from \u003chttps://hrbrmstr.github.io/htmlunitjars/index.html\u003e:\n\n![](man/figures/test-url-table.png)\n\n```{r ex1}\ntest_url \u003c- \"https://hrbrmstr.github.io/htmlunitjars/index.html\"\n\npg \u003c- xml2::read_html(test_url)\n\nhtml_table(pg)\n```\n\n☹️\n\nBut, `hu_read_html()` can!\n\n```{r ex2}\npg \u003c- hu_read_html(test_url)\n\nhtml_table(pg)\n```\n\nAll without needing a separate Selenium or Splash server instance.\n\n### Content++\n\nWe can also get a HAR-like content + metadata dump:\n\n```{r ex3}\nxdf \u003c- wc_inspect(\"https://rstudio.com\")\n\ncolnames(xdf)\n\nselect(xdf, method, url, status_code, content_length, load_time)\n\ngroup_by(xdf, content_type) %\u003e% \n  summarise(\n    total_size = sum(content_length), \n    total_load_time = sum(load_time)/1000\n  )\n```\n\n### DSL\n\n```{r ex4}\nwc \u003c- web_client(emulate = \"chrome\")\n\nwc %\u003e% wc_browser_info()\n\nwc \u003c- web_client()\n\nwc %\u003e% wc_go(\"https://usa.gov/\")\n\n# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()\n\nwc %\u003e%\n  wc_html_nodes(\"a\") %\u003e%\n  sapply(wc_html_text, trim = TRUE) %\u003e% \n  head(10)\n\nwc %\u003e%\n  wc_html_nodes(xpath=\".//a\") %\u003e%\n  sapply(wc_html_text, trim = TRUE) %\u003e% \n  head(10)\n\nwc %\u003e%\n  wc_html_nodes(xpath=\".//a\") %\u003e%\n  sapply(wc_html_attr, \"href\") %\u003e% \n  head(10)\n```\n\nHandy function to get rendered plain text for text mining:\n\n```{r ex5}\nwc %\u003e% \n  wc_render(\"text\") %\u003e% \n  substr(1, 300) %\u003e% \n  cat()\n```\n\n### htmlunit Metrics\n\n```{r echo=FALSE}\ncloc::cloc_pkg_md()\n```\n\n## Code of Conduct\n\nPlease note that this project is released with a Contributor Code of Conduct.\nBy participating in this project you agree to abide by its terms.\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fhtmlunit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhrbrmstr%2Fhtmlunit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fhtmlunit/lists"}