https://github.com/alastairrushworth/htmldf

🖥 ✂️ 📁 Simple scraping and tidy webpage summaries
https://github.com/alastairrushworth/htmldf

Last synced: 3 months ago
JSON representation

🖥 ✂️ 📁 Simple scraping and tidy webpage summaries

Host: GitHub
URL: https://github.com/alastairrushworth/htmldf
Owner: alastairrushworth
Created: 2020-03-01T11:13:45.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2022-08-07T06:33:41.000Z (almost 3 years ago)
Last Synced: 2025-03-27T03:51:15.132Z (4 months ago)
Language: R
Homepage:
Size: 27.8 MB
Stars: 82
Watchers: 4
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

jimsghstars - alastairrushworth/htmldf - 🖥 ✂️ 📁 Simple scraping and tidy webpage summaries (R)

README

        ---

output: github_document

---

# htmldf 

![build](https://github.com/alastairrushworth/htmldf/workflows/R-CMD-check/badge.svg)

[![codecov](https://codecov.io/gh/alastairrushworth/htmldf/branch/master/graph/badge.svg)](https://app.codecov.io/gh/alastairrushworth/htmldf)

[![CRAN status](https://www.r-pkg.org/badges/version/htmldf)](https://CRAN.R-project.org/package=htmldf)

[![](https://cranlogs.r-pkg.org/badges/htmldf)](https://CRAN.R-project.org/package=htmldf)

[![cran checks](https://cranchecks.info/badges/summary/htmldf)](https://cran.r-project.org/web/checks/check_results_htmldf.html)  

Overview

---

The package `htmldf` contains a single function `html_df()` which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html.  The result is returned as a `tibble` where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:

+ page title

+ inferred language (uses Google's compact language detector)

+ RSS feeds

+ tables coerced to tibbles, where possible

+ hyperlinks

+ image links

+ social media profiles

+ the inferred programming language of any text with code tags

+ page size, generator and server

+ page accessed date

+ page published or last updated dates

+ HTTP status code

+ full page source html

Installation

---  

To install the CRAN version of the package:

```{r, eval=FALSE}

install.packages('htmldf')

```

To install the development version of the package:

```{r, eval=FALSE}

remotes::install_github('alastairrushworth/htmldf')

```

Usage

---  

First define a vector of URLs you want to gather information from.  The function `html_df()` returns a `tibble` where each row corresponds to a webpage, and each column corresponds to an attribute of that webpage:

```{r, message=FALSE, warning=FALSE}

library(htmldf)

library(dplyr)

# An example vector of URLs to fetch data for

urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",

          "https://medium.com/dair-ai/pytorch-1-2-introduction-guide-f6fa9bb7597c",

          "https://www.tensorflow.org/tutorials/images/cnn", 

          "https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/")

# use html_df() to gather data

z <- html_df(urlx, show_progress = FALSE)

# have a quick look at the first page

glimpse(z[1, ])

```

To see the page titles, look at the `titles` column.  

```{r}

z %>% select(title, url2)

```

Where there are tables embedded on a page in the `` tag, these will be gathered into the list column `tables`.  `html_df` will attempt to coerce each table to `tibble` - where that isn't possible, the raw html is returned instead.

```{r}

z$tables

```

`html_df()` does its best to find RSS feeds embedded in the page:

```{r}

z$rss

```

`html_df()` will try to parse out any social profiles embedded or mentioned on the page.  Currently, this includes profiles for the sites

`bitbucket`, `dev.to`, `discord`, `facebook`, `github`, `gitlab`, `instagram`, `kakao`, `keybase`, `linkedin`, `mastodon`, `medium`, `orcid`, `patreon`, `researchgate`, `stackoverflow`, `reddit`, `telegram`, `twitter`, `youtube`

```{r}

z$social

```

Code language is inferred from `` chunks using a preditive model.  The `code_lang` column contains a numeric score where values near 1 indicate mostly R code, values near -1 indicate mostly Python code:

```{r}

z %>% select(code_lang, url2)

```


Publication dates

```{r}

z %>% select(published, url2)

```

Comments? Suggestions? Issues?

---  

Any feedback is welcome! Feel free to write a github issue or send me a message on [twitter](https://twitter.com/rushworth_a).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alastairrushworth/htmldf

Awesome Lists containing this project

README