Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/JBGruber/paperboy

A comprehensive (eventually) collection of webscraping scripts for news media sites
https://github.com/JBGruber/paperboy

r r-package rstats webscraping

Last synced: 3 months ago
JSON representation

A comprehensive (eventually) collection of webscraping scripts for news media sites

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)

knitr::knit_hooks$set(document = function(x, options)
gsub("[\\#", "[#", x, fixed = TRUE))

options(paperboy_verbose = FALSE)

trim <- function(x, n, e = "...") {
ifelse(nchar(x) > n,
paste0(gsub("\\s+$", "",
gsub("\\w+$", "",
strtrim(x, width = n)
)),
e),
x)
}

knit_print.tbl_df = function(x, ...) {
res = paste(c("", "", knitr::kable(x)), collapse = "\n")
knitr::asis_output(res)
}

registerS3method(
"knit_print", "tbl_df", knit_print.tbl_df,
envir = asNamespace("knitr")
)
```

# paperboy

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/JBGruber/paperboy/workflows/R-CMD-check/badge.svg)](https://github.com/JBGruber/paperboy/actions)
[![Codecov test coverage](https://codecov.io/gh/JBGruber/paperboy/branch/main/graph/badge.svg)](https://codecov.io/gh/JBGruber/paperboy?branch=main)

The philosophy of `paperboy` is that the package is a comprehensive collection of webscraping scripts for news media sites.
Many data scientists and researchers write their own code when they have to retrieve news media content from websites.
At the end of research projects, this code is often collecting digital dust on researchers hard drives instead of being made public for others to employ.
`paperboy` offers writers of webscraping scripts a clear path to publish their code and earn co-authorship on the package (see [For developers](#for-developers) Section).
For users, the promise is simple: `paperboy` delivers news media data from many websites in a consistent format.
Check which domains are already supported in [the table below](#available-scrapers) or with the command `pb_available()`.

## Installation

`paperboy` is not on [CRAN](https://CRAN.R-project.org) yet.
Install via `remotes` (first install `remotes` via `install.packages("remotes")`:

``` r
remotes::install_github("JBGruber/paperboy")
```

## For Users

Say you have a link to a news media article, for example, from [mediacloud.org](https://mediacloud.org/).
Simply supply one or multiple links to a media article to the main function, `pb_deliver`:

```{r example, eval=FALSE}
library(paperboy)
df <- pb_deliver("https://tinyurl.com/386e98k5")
df
```

```{r exampleback, echo=FALSE, message=FALSE}
library(paperboy)
library(dplyr)
df <- pb_deliver("https://tinyurl.com/386e98k5")
df %>%
mutate(across(headline:text, ~ trim(.x, 30)))
```

The returned `data.frame` contains important meta information about the news items and their full text.
Notice, that the function had no problem reading the link, even though it was shortened.
***`paperboy` is an unfinished and highly experimental package at the moment.***
You will therefore often encounter this warning:

```{r nomethod}
pb_deliver("google.com")
```

The function still returns a data.frame, but important information is missing --- in this case because it isn't there.
The other URLs will be processed normally though.
If you have a dead link in your `url` vector, the `status` column will be different from `200` and contain `NA`s.

If you are unhappy with results from the generic approach, you can still use the second function from the package to download raw html code and later parse it yourself:

```{r examplecollect, eval=FALSE}
pb_collect("google.com")
```

```{r examplecollectback, echo=FALSE, message=FALSE}
df <- pb_collect("google.com")
df %>%
mutate(across(expanded_url:status, ~ trim(.x, 30)),
content_raw = paste0(strtrim(content_raw, width = 30), "..."))
```

`pb_collect` uses concurrent requests to download many pages at the same time, making the function very quick to collect large amounts of data.
You can then experiment with `rvest` or another package to extract the information you want from `df$content_raw`.

## For developers

If there is no scraper for a news site and you want to contribute one to this project, you can become a co-author of this package by adding it via a pull request.
First check [available scrapers](#available-scrapers) and open [issues](https://github.com/JBGruber/paperboy/issues) and [pull requests](https://github.com/JBGruber/paperboy/pulls).
Open a new issue or comment on an existing one to communicate that you are working on a scraper (so that work isn't done twice).
Then start by pulling a few articles with `pb_collect` and start to parse the html code in the `content_raw` column (preferably with `rvest`).

Every webscraper should retrieve a `tibble` with the following format:

```{r format, echo=FALSE}
tibble::tribble(
~url, ~expanded_url, ~domain, ~status, ~datetime, ~headline, ~author, ~text, ~misc,
"character", "character", "character", "integer", "as.POSIXct", "character", "character", "character", "list",
"the original url fed to the scraper", "the full url", "the domain", "http status code", "publication datetime", "the headline", "the author", "the full text", "all other information that can be consistently found on a specific outlet"
)
```

Since some outlets will give you additional information, the `misc` column was included so these can be retained.

## Available Scrapers

```{r available, echo=FALSE}
read.csv("./inst/status.csv") %>%
arrange(domain) %>%
select(-rss) %>%
knitr::kable(escape = FALSE)
```

- ![](https://img.shields.io/badge/status-gold-%23ffd700.svg): Runs without known issues
- ![](https://img.shields.io/badge/status-silver-%23C0C0C0.svg): Runs with some issues
- ![](https://img.shields.io/badge/status-broken-%23D8634C)![](https://img.shields.io/badge/status-requested-lightgrey): Currently not working, fix has been requested