Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alastairrushworth/htmldf
🖥 ✂️ 📁 Simple scraping and tidy webpage summaries
https://github.com/alastairrushworth/htmldf
Last synced: about 2 months ago
JSON representation
🖥 ✂️ 📁 Simple scraping and tidy webpage summaries
- Host: GitHub
- URL: https://github.com/alastairrushworth/htmldf
- Owner: alastairrushworth
- Created: 2020-03-01T11:13:45.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2022-08-07T06:33:41.000Z (over 2 years ago)
- Last Synced: 2024-08-13T07:15:11.400Z (4 months ago)
- Language: R
- Homepage:
- Size: 27.8 MB
- Stars: 82
- Watchers: 5
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
- jimsghstars - alastairrushworth/htmldf - 🖥 ✂️ 📁 Simple scraping and tidy webpage summaries (R)
README
---
output: github_document
---# htmldf
![build](https://github.com/alastairrushworth/htmldf/workflows/R-CMD-check/badge.svg)
[![codecov](https://codecov.io/gh/alastairrushworth/htmldf/branch/master/graph/badge.svg)](https://app.codecov.io/gh/alastairrushworth/htmldf)
[![CRAN status](https://www.r-pkg.org/badges/version/htmldf)](https://CRAN.R-project.org/package=htmldf)
[![](https://cranlogs.r-pkg.org/badges/htmldf)](https://CRAN.R-project.org/package=htmldf)
[![cran checks](https://cranchecks.info/badges/summary/htmldf)](https://cran.r-project.org/web/checks/check_results_htmldf.html)Overview
---The package `htmldf` contains a single function `html_df()` which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html. The result is returned as a `tibble` where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:
+ page title
+ inferred language (uses Google's compact language detector)
+ RSS feeds
+ tables coerced to tibbles, where possible
+ hyperlinks
+ image links
+ social media profiles
+ the inferred programming language of any text with code tags
+ page size, generator and server
+ page accessed date
+ page published or last updated dates
+ HTTP status code
+ full page source htmlInstallation
---To install the CRAN version of the package:
```{r, eval=FALSE}
install.packages('htmldf')
```To install the development version of the package:
```{r, eval=FALSE}
remotes::install_github('alastairrushworth/htmldf')
```Usage
---First define a vector of URLs you want to gather information from. The function `html_df()` returns a `tibble` where each row corresponds to a webpage, and each column corresponds to an attribute of that webpage:
```{r, message=FALSE, warning=FALSE}
library(htmldf)
library(dplyr)# An example vector of URLs to fetch data for
urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
"https://medium.com/dair-ai/pytorch-1-2-introduction-guide-f6fa9bb7597c",
"https://www.tensorflow.org/tutorials/images/cnn",
"https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/")# use html_df() to gather data
z <- html_df(urlx, show_progress = FALSE)# have a quick look at the first page
glimpse(z[1, ])
```To see the page titles, look at the `titles` column.
```{r}
z %>% select(title, url2)
```Where there are tables embedded on a page in the `` tag, these will be gathered into the list column `tables`. `html_df` will attempt to coerce each table to `tibble` - where that isn't possible, the raw html is returned instead.
```{r}
z$tables
````html_df()` does its best to find RSS feeds embedded in the page:
```{r}
z$rss
````html_df()` will try to parse out any social profiles embedded or mentioned on the page. Currently, this includes profiles for the sites
`bitbucket`, `dev.to`, `discord`, `facebook`, `github`, `gitlab`, `instagram`, `kakao`, `keybase`, `linkedin`, `mastodon`, `medium`, `orcid`, `patreon`, `researchgate`, `stackoverflow`, `reddit`, `telegram`, `twitter`, `youtube`
```{r}
z$social
```Code language is inferred from `
` chunks using a preditive model. The `code_lang` column contains a numeric score where values near 1 indicate mostly R code, values near -1 indicate mostly Python code:
```{r}
z %>% select(code_lang, url2)
```Publication dates
```{r}
z %>% select(published, url2)
```Comments? Suggestions? Issues?
---Any feedback is welcome! Feel free to write a github issue or send me a message on [twitter](https://twitter.com/rushworth_a).