Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/koheiw/newspapers

R package to import articles from newspaper databases
https://github.com/koheiw/newspapers

r text-analysis

Last synced: 6 days ago
JSON representation

R package to import articles from newspaper databases

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "man/images/",
fig.width = 6,
fig.height = 2,
dpi = 150
)
```

# newspapers

Import files downloaded from various newspaper databases

## Supported databases and formats

Database | File type | Function
-- | -- | --
Asahi Shimbun Kikuzo | .html | `import_kikuzo()`
Yomiuri Shimbun Yomidasu | .html | `import_yomidasu()`
Dow Jones Faciva | .html | `import_factiva()`
Nexis UK | .html | `import_nexis()`
Lexis Advance | .docx | `import_lexis()`
Integrum | .doc .xlsx .html | `import_integrum()`

## Install the package

```{r eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/newspapers")
```

## Load files

```{r}
require(newspapers, quietly = TRUE)
dat <- import_nexis("tests/data/nexis/guardian_1986-01-01_0001.html")
```

```{r, asis=TRUE, echo=FALSE}
knitr::kable(head(dat[c("pub", "edition", "date", "byline", "length", "section", "head")]))
```

## Parse date

The functions return datelines as character, so users should parse them using `stri_datetime_parse()` with appropriate locale and time-zone settings.

```{r}
require(stringi)
dat_en <- import_nexis("tests/data/nexis/guardian_1986-01-01_0001.html")
dat_en$date2 <- as.Date(stri_datetime_parse(dat_en$date, "MM, dd, uuuu", tz = "GMT"))
head(dat_en[c("date", "date2")])

dat_de <- import_nexis('tests/data/nexis/spiegel_2012-02-01_0001.html')
dat_de$date2 <- as.Date(stri_datetime_parse(dat_de$date, "dd. MM uuuu", tz = "CET",
locale = "de_DE"))
head(dat_de[c("date", "date2")])
```

## Download files

### Asahi Shimbun Kikuzo

In the Kikuzo database, you can display up to 100 full-texts news articles in your browser. However, you cannot download the articles as an HTML file in a format this package can read, becasue news articles are in frames. To save the articles, you should use Firefox's context menu "Save Frame As" (or other browsers' equivalent) to save pages in a frame.

![](images/kikuzo.png)

### Dow Jones Faciva

The Factiva database offers a button to display full-texts in a pop-up window: check all boxes, click "Format for Saving" (Floppy Disk icon) and select "Article Format". You only need to save the HTML file in the pop-up window by "Save Page" (or press Ctrl + S).

![](images/factiva.png)