https://github.com/jimjam-slam/drumguests

Tidy appearance frequency data for ABC's The Drum.
https://github.com/jimjam-slam/drumguests

news open-data politics r rstats tidy-data

Last synced: 7 months ago
JSON representation

Tidy appearance frequency data for ABC's The Drum.

Host: GitHub
URL: https://github.com/jimjam-slam/drumguests
Owner: jimjam-slam
Created: 2018-07-30T08:47:36.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2019-05-14T00:22:50.000Z (about 7 years ago)
Last Synced: 2024-01-27T06:38:23.071Z (over 2 years ago)
Topics: news, open-data, politics, r, rstats, tidy-data
Language: HTML
Homepage:
Size: 474 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>"

)

library(tidyverse)

library(magrittr)

library(lubridate)

library(rvest)

library(glue)

library(stringr)

filter = dplyr::filter

select = dplyr::select

```

# Analysis of The Drum hosts, panellists and guests

This script scrapes data on the hosts, panellists and guests of [The Drum](www.abc.net.au/news/programs/the-drum) from the ABC website. If you just want to grab some tidy data, it's currently in [`drum_tidy.csv`](drum_tidy.csv). It goes back to 27 April 2018 (as at `r Sys.Date()`).

**Note:** the formatted datetimes in the `dt` column are in UTC! You'll need to convert them to `"Australia/Sydney"` before using them.

To grab the data from the ABC site yourself, run this notebook!

Let's scrape data from the ABC website and find out how often people appear!

```{r}

drum_url = 'http://www.abc.net.au/news/programs/the-drum/'

pages = 1:10

episodes_id = 'collectionId-4'

# download data

episodes =

  map_dfr(pages, function(x) {

    episode_page =

      read_html(glue('{drum_url}?page={x}')) %>%

      html_nodes(glue('#{episodes_id} article'))

    tibble(

      title       = episode_page %>% html_nodes('h3') %>% html_text(),

      description = episode_page %>% html_nodes('p') %>% html_text())

  }) %>%

  print()

  

```

Okay, let's tidy it up and get the good bits out (regex makes me cry).

```{r}

episodes %<>%

  # format the date

  mutate(

    ep_date = str_replace_all(title, c("\n\n The Drum " = "", " \n" = "", "- " = "", "\\s$" = "")),

    dt = parse_date_time(ep_date, orders = "A, B d", tz = "Australia/Sydney")) %>%

  # isolate the host and people

  mutate(

    host = str_extract(description, regex("(?<=Host: )(.*)(?= Panel:)",

      dotall = TRUE)),

    panel = str_extract(description,

      regex(paste0("(?<=Panel: )(.*)(?=( Guest:| Guests:| Interview with:|",

        "The panel|We have))"),

      ignore_case = TRUE, dotall = TRUE))) %>%

  # separate guest and/or interviewees...

  mutate(

    guest = str_extract(panel, regex("(?<=Guest: )(.*)$", dotall = TRUE, ignore_case = TRUE)),

    interviewee = str_extract(panel, regex("(?<=Interview with: )(.*)$", dotall = TRUE, ignore_case = TRUE))) %>%

  # ... and remove them from the panel

  mutate(

    panel = str_replace(panel, regex("Guest: (.*)$"), ""),

    panel = str_replace(panel, regex("Interview with: (.*)$"), ""),

    panel = str_replace(panel, "\\.$", "")) %>%

  select(ep_date, dt, host, panel, guest, interviewee) %>%

  print()

```

Okay, now let's break these names up:

```{r}

episodes %<>%

  gather(key = "role", value = "name", host, panel, guest, interviewee) %>%

  separate_rows(name, sep = ", and |, | and ") %>%

  # remove any trailing spaces that snuck in

  mutate(name = str_replace_all(name, "\\s$", "")) %T>%

  write_csv('drum_tidy.csv') %T>%

  print()

```

Nowe we can visualise. For example, here are hosts by frequency:

```{r vis-hosts-freq}

episodes %>%

  filter(role == "host") %>%

  group_by(name) %>%

  summarise(n = n()) %>%

  ungroup() %>%

  drop_na(name) %T>%

  print() %>%

  {

    ggplot(., aes(x = reorder(name, n), y = n)) +

      geom_col() +

      coord_flip() +

      theme_minimal() +

      labs(

        x = 'Host',

        y = 'Number of appearances',

        title = 'The Drum hosts by appearance over the last year')

  }

```

And here's guests, panellists and interviewees:

```{r vis-others-freq, fig.height = 12, fig.width = 6}

episodes %>%

  filter(role != "host") %>%

  group_by(name, role) %>%

  summarise(n = n()) %>%

  ungroup() %>%

  drop_na(name) %>%

  top_n(30, n) %T>%

  print() %>%

  {

    ggplot(., aes(x = reorder(name, n), y = n)) +

      geom_col() +

      coord_flip() +

      theme_minimal(base_size = 8) +

      labs(

        x = 'Host',

        y = 'Number of appearances',

        title = 'Top 30 non-host appearance on THe Drum over the last year')

  }

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jimjam-slam/drumguests

Awesome Lists containing this project

README