Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hrbrmstr/newsflash
Tools to Work with the Internet Archive and GDELT Television Explorer in R
https://github.com/hrbrmstr/newsflash
gdelt-television-explorer internet-archive r r-cyber rstats
Last synced: 3 months ago
JSON representation
Tools to Work with the Internet Archive and GDELT Television Explorer in R
- Host: GitHub
- URL: https://github.com/hrbrmstr/newsflash
- Owner: hrbrmstr
- Created: 2017-01-26T03:57:28.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2022-12-01T14:40:11.000Z (about 2 years ago)
- Last Synced: 2024-10-12T21:24:02.209Z (4 months ago)
- Topics: gdelt-television-explorer, internet-archive, r, r-cyber, rstats
- Language: R
- Size: 3.59 MB
- Stars: 90
- Watchers: 8
- Forks: 9
- Open Issues: 8
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: rmarkdown::github_document
editor_options:
chunk_output_type: console
---*** BREAKING CHANGES ***
# newsflash
Tools to Work with the Internet Archive and GDELT Television Explorer
## Description
Ref:
-
-TV Explorer:
>_"In collaboration with the Internet Archive's Television News Archive, GDELT's Television Explorer allows you to keyword search the closed captioning streams of the Archive's 6 years of American television news and explore macro-level trends in how America's television news is shaping the conversation around key societal issues. Unlike the Archive's primary Television News interface, which returns results at the level of an hour or half-hour "show," the interface here reaches inside of those six years of programming and breaks the more than one million shows into individual sentences and counts how many of those sentences contain your keyword of interest. Instead of reporting that CNN had 24 hour-long shows yesterday that mentioned Donald Trump, the interface here will count how many sentences uttered on CNN yesterday mentioned his name - a vastly more accurate metric for assessing media attention."_Third Eye:
>_The TV News Archive's Third Eye project captures the chyrons–or narrative text–that appear on the lower third of TV news screens and turns them into downloadable data and a Twitter feed for research, journalism, online tools, and other projects. At project launch (September 2017) we are collecting chyrons from BBC News, CNN, Fox News, and MSNBC–more than four million collected over just two weeks."_An advantage of using this over the TV Explorer interactive selector & downloader or Third Eye API is that you get tidy tibbles with this package, ready to use in R.
NOTE: While I don't claim that this alpha-package is anywhere near perfect, the IA/GDELT TV API hiccups every so often so when there are critical errors run the same query in their web interface before submitting an issue. I kept getting errors when searching all affiliate markets for the "mexican president" query that also generate errors on the web site when JSON is selected as output (it's fine on the web site if the choice is interactive browser visualizations). Submit those errors to them, not here.
## What's Inside The Tin
The following functions are implemented:
- `list_chyrons`: Retrieve Third Eye chyron index
- `list_networks`: Helper function to identify station/network keyword and corpus date range for said market
- `newsflash`: Tools to Work with the Internet Archive and GDELT Television Explorer
- `query_tv`: Issue a query to the TV Explorer
- `read_chyrons`: Retrieve TV News Archive chyrons from the Internet Archive's Third Eye project
- `gd_top_trending`: Top Trending (GDELT)
- `iatv_top_trending: Top Trending Topics (Internet Archive TV Archive)
- `word_cloud`: Retrieve top words that appear most frequently in clips matching your search## Installation
```{r eval=FALSE}
devtools::install_github("hrbrmstr/newsflash")
``````{r message=FALSE, warning=FALSE, error=FALSE}
options(width=120)
```## Usage
```{r message=FALSE, warning=FALSE, error=FALSE}
library(newsflash)
library(ggalt)
library(hrbrthemes)
library(tidyverse)# current verison
packageVersion("newsflash")
```### "Third Eye" Chyrons are simpler so we'll start with them first:
```{r fig.width=8, fig.height=5, cache=TRUE}
list_chyrons()ch <- read_chyrons("2018-04-13")
mutate(
ch,
hour = lubridate::hour(ts),
text = tolower(text),
mention = grepl("comey", text)
) %>%
filter(mention) %>%
count(hour, channel) %>%
ggplot(aes(hour, n)) +
geom_segment(aes(xend=hour, yend=0), color = "lightslategray", size=1) +
scale_x_continuous(name="Hour (GMT)", breaks=seq(0, 23, 6),
labels=sprintf("%02d:00", seq(0, 23, 6))) +
scale_y_continuous(name="# Chyrons", limits=c(0,20)) +
facet_wrap(~channel, scales="free") +
labs(title="Chyrons mentioning 'Comey' per hour per channel",
caption="Source: Internet Archive Third Eye project & ") +
theme_ipsum_rc(grid="Y")
```## Now for the TV Explorer:
### See what networks & associated corpus date ranges are available:
```{r}
list_networks(widget=FALSE)
```### Basic search:
```{r fig.width=8, fig.height=7, cache=TRUE}
comey <- query_tv('comey', start_date = "2018-04-01")comey
query_tv('comey', start_date = "2018-04-01") %>%
arrange(date) %>%
ggplot(aes(date, value, group=network)) +
ggalt::geom_xspline(aes(color=network)) +
ggthemes::scale_color_tableau(name=NULL) +
labs(x=NULL, y="Volume Metric", title="'Comey' Trends Across National Networks") +
facet_wrap(~network) +
theme_ipsum_rc(grid="XY") +
theme(legend.position="none")
``````{r cache=TRUE}
query_tv("comey Network:CNN", mode = "TimelineVol", start_date = "2018-01-01") %>%
arrange(date) %>%
ggplot(aes(date, value, group=network)) +
ggalt::geom_xspline(color="lightslategray") +
ggthemes::scale_color_tableau(name=NULL) +
labs(x=NULL, y="Volume Metric", title="'Comey' Trend on CNN") +
theme_ipsum_rc(grid="XY")
```### Relative Network Attention To Syria since January 1, 2018
```{r cache=TRUE}
query_tv('syria Market:"National"', mode = "StationChart", start_date = "2018-01-01") %>%
arrange(desc(count)) %>%
knitr::kable("markdown")
```### Video Clips
```{r cache=TRUE}
clips <- query_tv('comey Market:"National"', mode = "ClipGallery", start_date = "2018-01-01")clips
````r clips$show_date[1]` | `r clips$station[1]` | `r clips$show[1]`
`r clips$snippet[1]`
### "Word Cloud" (top associated words to the query)
```{r fig.height=8, fig.width=8, cache=TRUE}
wc <- query_tv('hannity Market:"National"', mode = "WordCloud", start_date = "2018-04-13")ggplot(wc, aes(x=1, y=1)) +
ggrepel::geom_label_repel(aes(label=label, size=count), segment.colour="#00000000", segment.size=0) +
scale_size_continuous(trans="sqrt") +
labs(x=NULL, y=NULL) +
theme_ipsum_rc(grid="") +
theme(axis.text=element_blank()) +
theme(legend.position="none")
```### Last 15 Minutes Top Trending
```{r}
gd_top_trending()
```### Top Overall Trending from the Internet Archive TV Archive (2017 and earlier)
```{r}
iatv_top_trending("2017-12-01 18:00", "2017-12-02 06:00")
```