Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/koheiw/newspapers
R package to import articles from newspaper databases
https://github.com/koheiw/newspapers
r text-analysis
Last synced: 6 days ago
JSON representation
R package to import articles from newspaper databases
- Host: GitHub
- URL: https://github.com/koheiw/newspapers
- Owner: koheiw
- Created: 2018-03-22T12:07:43.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2024-02-29T21:33:55.000Z (9 months ago)
- Last Synced: 2024-10-28T03:40:38.491Z (15 days ago)
- Topics: r, text-analysis
- Language: HTML
- Homepage:
- Size: 5.99 MB
- Stars: 14
- Watchers: 3
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: github_document
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "man/images/",
fig.width = 6,
fig.height = 2,
dpi = 150
)
```# newspapers
Import files downloaded from various newspaper databases
## Supported databases and formats
Database | File type | Function
-- | -- | --
Asahi Shimbun Kikuzo | .html | `import_kikuzo()`
Yomiuri Shimbun Yomidasu | .html | `import_yomidasu()`
Dow Jones Faciva | .html | `import_factiva()`
Nexis UK | .html | `import_nexis()`
Lexis Advance | .docx | `import_lexis()`
Integrum | .doc .xlsx .html | `import_integrum()`## Install the package
```{r eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/newspapers")
```## Load files
```{r}
require(newspapers, quietly = TRUE)
dat <- import_nexis("tests/data/nexis/guardian_1986-01-01_0001.html")
``````{r, asis=TRUE, echo=FALSE}
knitr::kable(head(dat[c("pub", "edition", "date", "byline", "length", "section", "head")]))
```## Parse date
The functions return datelines as character, so users should parse them using `stri_datetime_parse()` with appropriate locale and time-zone settings.
```{r}
require(stringi)
dat_en <- import_nexis("tests/data/nexis/guardian_1986-01-01_0001.html")
dat_en$date2 <- as.Date(stri_datetime_parse(dat_en$date, "MM, dd, uuuu", tz = "GMT"))
head(dat_en[c("date", "date2")])dat_de <- import_nexis('tests/data/nexis/spiegel_2012-02-01_0001.html')
dat_de$date2 <- as.Date(stri_datetime_parse(dat_de$date, "dd. MM uuuu", tz = "CET",
locale = "de_DE"))
head(dat_de[c("date", "date2")])
```## Download files
### Asahi Shimbun Kikuzo
In the Kikuzo database, you can display up to 100 full-texts news articles in your browser. However, you cannot download the articles as an HTML file in a format this package can read, becasue news articles are in frames. To save the articles, you should use Firefox's context menu "Save Frame As" (or other browsers' equivalent) to save pages in a frame.
![](images/kikuzo.png)
### Dow Jones Faciva
The Factiva database offers a button to display full-texts in a pop-up window: check all boxes, click "Format for Saving" (Floppy Disk icon) and select "Article Format". You only need to save the HTML file in the pop-up window by "Save Page" (or press Ctrl + S).
![](images/factiva.png)