Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tjmahr/subtitlefreq


https://github.com/tjmahr/subtitlefreq

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

---
output:
github_document:
keep_html: false
html_preview: false
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(tidyverse)
```

# subtitlefreq

The goal of subtitlefreq is to provide alternative word frequency counts from
the SUBTLEX-US corpus. My problem with the SUBTLEX-US corpus is that it
separates "word+t" into "word" + "t", so that the contractions "isn't",
"aren't", "don't", etc. count as two words. This project is a naive attempt
to rebuild some frequency counts from scratch.

⚠️ This is not a drop-in replacement for the SUBTLEX-US corpus. I just wanted to
estimate how frequent "don't" is so I could compute some neighborhood frequency
measures.

## obtaining the raw SUBTLEX-US data

As a matter of caution, I won't provide the original subtitle corpus. But you
can download it the same way that I did.

- Go to the following URL: and download the
corpus.
- Move `Subtlex US.rar` into `data` folder.

We can test our download by reading the embedded readme file from it.

```{r}
readme <- archive::archive_read(
"data/Subtlex US.rar",
"Subtlex US/readme.txt"
)

writeLines(readLines(readme))
```

That last line is why I don't redistribute the corpus.

## then run targets

Assuming the data are downloaded and the appropriate packages are
installed, then running `targets::tar_make()` should count the words in
the corpus. targets also [downloads](https://osf.io/djpqz/) and prepares the published version of
the SUBTLEX-US frequency counts.

```{r}
data_subtlexus <- targets::tar_read(data_subtlexus)
data_subtlexus
```

Data processing here works in three stages. First, there is a raw tibble of
lines with row per subtitle line. The `batch` column is used for splitting the
corpus into batches so that words can be counted in parallel.

```{r}
library(tidyverse)
data_raw <- targets::tar_read(data_raw_corpus)
data_raw
```

These lines are patched (to remove some garbage I discovered where markup was
included in the dialogue), and then the words are in batch are counted.

The counts from each batch are pooled together to give the following frequency
counts:

```{r}
data_pooled <- targets::tar_read(data_counts_pooled)
data_pooled |> print(n = 20)
```

You might notice that there are around 200,000 words here, instead of
the 60,000-70,000 words you might see elsewhere. This is probably because we
did not lump any inflections together or anything.

If I see ever any typos in the counts, I try to store a correction in the csv
file `data/patches.csv`. This is probably a futile, never-ending endeavor, but
hey, at least anyone can add to it.

```{r}
data_patches <- targets::tar_read("data_patches")
data_patches
```

These individual patches are followed by regular-expression-based
patches. These are stored in a `data/patches-regex.csv` so that they can
have a `comment` column that describes their behavior.

```{r}
data_patches_regex <- targets::tar_read("data_patches_regex")
data_patches_regex
```

After applying the patches, we obtain the following counts:

```{r}
data <- targets::tar_read("data_counts_patched")
data
```

We can rationalize our patching activity by looking at how many words are
affected:

```{r}
data_pooled |>
anti_join(data, by = "word") |>
print(n = 20)
```

## comparisons

What text-cleaning did they originally do?

> We rejected all files with more than 2.5% type errors according to the
> spelling checker Aspell. In the end, 8,388 films and television
> episodes were retained, with a total of 51.0 million words (16.1
> million from television series, 14.3 million from films before 1990,
> and 20.6 million from films after 1990).

I seem to be missing around 3 million tokens.

```{r}
sum(data$n)
51000000 - sum(data$n)
```

Or perhaps I am missing just 2 million words, based on the published
frequencies:

```{r}
sum(data_subtlexus$FREQcount)
sum(data_subtlexus$FREQcount) - sum(data$n)
```

Our raw text has lots of segmentation errors where multiple words are
combined together. For example, here are types with "in'" followed by 2
characters. If I split "tryin'to" into "tryin' to", I get 36 new words.
Would systematically fixing lots of segmentation mistakes
rediscover 2,000,000 missing words?

```{r}
data_pooled |>
filter(str_detect(word, "in'..+$")) |>
mutate(sum = sum(n))
```

> For instance, there are 18,081 occurrences of the word *play* in
> SUBTLEXUS, 1,521 of the word *plays*, 2,870 of the word *played*,
> and 7,515 of the word *playing*.

```{r}
data |>
filter(word %in% c("playing", "plays", "play", "played"))
```

This is a pretty good match, but we have a few more *playing* tokens
because we patched `"lng"` words.

The published counts fortunately do not have "lng" words.

```{r}
data_subtlexus |>
filter(str_detect(Word, "lng"))
```

But the combination of treating contractions as separate words and the
l-\>i conversion means that there are a few thousand spurious tokens of
"il":

```{r}
# raw, unpatched
data_pooled |>
filter(str_detect(word, "^il$|'il")) |>
mutate(sum(n))

# published
data_subtlexus |>
filter(str_detect(Word, "^il$")) |>
select(1:2)
```

Because of OCR converting uppercase "I" to lowercase "l", I patched the
corpus to replace lowercase "l" with uppercase "I" when it was inside of
an all-caps word. We can see the big differences in the counts between
my counts and the published ones for certain initialisms.

```{r}
# my counts
data_pooled |>
filter(word %in% c("fbi", "irs", "cia")) |>
arrange(word)

# published
data_subtlexus |>
filter(tolower(Word) %in% c("fbi", "irs", "cia")) |>
arrange(Word) |>
select(1:2)
```

## let's try something

I saw the textstem package as a solution for lemmatizing words.

```{r}
data |>
head(1000) |>
mutate(
lemma = textstem::lemmatize_words(word)
) |>
group_by(lemma) |>
mutate(lemma_n = sum(n)) |>
arrange(desc(lemma_n)) |>
print(n = 20)
```

Hmm, I wish I could skip irregular forms from being lemmatized. I am
also not a fan of "being" is reduced down to "be".

## some comparisons for my own interest

There are two layers of patching. One edits the lines in the corpus and
the other edits the words and recounts the words. I want to compare
counts from three different stages.

```{r}
a <- targets::tar_read(data_counts_pooled_raw_lines) |>
rename(n_a = n)
b <- targets::tar_read(data_counts_pooled) |>
rename(n_b = n)
c <- targets::tar_read(data_counts_patched) |>
rename(n_c = n)

abc <- a |>
full_join(b) |>
full_join(c)
```

Line-patching tries to correct OCR errors and contractions with extra
spacing ("isn 't "). These are words that would not exist if were not
for the line patches. For example, *HIV* appears because OCR reads the
word as `"HlV"`.

```{r}
abc |>
filter(is.na(n_a)) |>
mutate(
n_b_tokens = sum(n_b, na.rm = TRUE),
n_c_tokens = sum(n_c, na.rm = TRUE)
)
```

Words that were lost in the line patching. Some subtitles would have
annotations like `"(SlNGlNG)"` or `"(lN ENGLlSH)"` so the OCR correction
repaired these.

```{r}
abc |>
filter(!is.na(n_a), is.na(n_b)) |>
mutate(
n_a_tokens = sum(n_a, na.rm = TRUE)
)
```

Words that increased during line-patching. "Kitt", by the way, is the
car from *Knight Rider*.

```{r}
diffs <- abc |>
tidyr::replace_na(list(n_a = 0, n_b = 0, n_c = 0)) |>
mutate(
b_vs_a = n_b - n_a,
# proportion change
b_vs_a_prop = b_vs_a / (n_a + 1),
c_vs_b = n_c - n_b,
# proportion change
c_vs_b_prop = c_vs_b / (n_b + 1),
)

diffs |>
select(-n_c, -starts_with("c_vs")) |>
arrange(desc(b_vs_a))

diffs |>
select(-n_c, -starts_with("c_vs")) |>
arrange(desc(b_vs_a_prop))
```

Words that decreased during line-patching.

```{r}
diffs |>
select(-n_c, -starts_with("c_vs")) |>
arrange(b_vs_a)

diffs |>
select(-n_c, -starts_with("c_vs")) |>
arrange(b_vs_a_prop)
```

Finally, these are words that changed with word-level patching.

```{r}
diffs |>
select(-n_a, -starts_with("b_vs")) |>
arrange(desc(c_vs_b))

diffs |>
select(-n_a, -starts_with("b_vs")) |>
arrange(desc(c_vs_b_prop))
```

Words that decreased from word-patching.

```{r}
diffs |>
select(-n_a, -starts_with("b_vs")) |>
arrange(c_vs_b)
```

### contraction counts

This was the motivation for this whole exercise. I can find some 715K
`n't` contractions. The published frequencies have around 733K `t` tokens.
That's a pretty good match. But I sum the contraction stems in the published
counts, I get 900K tokens. That's because the stem of `can't` is itself a common
word `can`.

```{r}
contraction_stems <- c(
"isn", "aren", "ain", "wasn",
"can",
"don", "doesn", "didn",
"couldn", "shouldn", "wouldn",
"hasn", "haven", "hadn",
"won",
"mustn", "needn", "shan"
)

data_my_counts <- data %>%
filter(
word %in% paste0(contraction_stems, "'t")
) |>
mutate(total = sum(n))

data_subtlexus_counts <- data_subtlexus |>
select(
subtlexus_word = Word,
subtlexus_count = FREQcount
) |>
filter(subtlexus_word %in% c("t", contraction_stems)) |>
mutate(
word = ifelse(
subtlexus_word != "t",
paste0(subtlexus_word, "'t"),
subtlexus_word
),
sum_of_non_t_items = sum(subtlexus_count) - 733338
)

data_my_counts |>
full_join(data_subtlexus_counts, by = "word")
```

If I count the stems in my counts, I can see about how many
contractions I have missed. We would get around 300 contractions with some
patching.

```{r}
data %>%
filter(
word %in% paste0(contraction_stems),
) |>
group_by(
could_be_false_positive = word %in% c("can", "won", "haven", "don")
) |>
mutate(total = sum(n))
```

A quick check of what these remaining errors look like:

```{r}
data_patched_lines <- targets::tar_read("data_raw_corpus_patched")
didn_lines <- data_patched_lines |>
filter(str_detect(line, "didn[^'t]"))

data_patched_lines |>
filter(str_detect(line, "\\w\\\"\\w")) |> tail()
```

## how to spot errors

My "process" for screening the corpus is to randomly sample words and
see what sticks out. When I noticed `"cla"` in one of these samples, I
figured out the `"I"` to `"l"` OCR problem in all-caps words.

```{r}
data_pooled$word |>
sample(size = 10)
```

To feel better about things, you could weight by frequency:

```{r}
data_pooled$word |>
sample(size = 10, prob = data_pooled$n)
```

## open questions (so far)

I'm not sure what's going on with the encoding and whether that matters. If you
find all lines with the string "Zellweger", you will find some `"Ren\xe9e
Zellweger"` and some `"Renee Zellweger"` tokens.

Numerals show up. I imagine that one would want to decompose them into subwords
so that, e.g., "1993" is "nineteen", "ninety", and "three".

Stray observations:

- There are lots of URLs/usernames in the corpus because the subtitles can
be signed or sourced.

- If a TV show has a theme song, then you can find many repeated lines.

- other OCR errors to consider: i/l to 1

```{r, echo = FALSE, eval = FALSE}
data_raw$line |>
stringr::str_subset(
stringr::fixed("violence in movies", ignore_case = TRUE)
) |>
substr(1, 30)
```

```{r, echo = FALSE}
# ther eyou
# mayb eyou
# Than kyou
# YourX-rays
# offt he
# ofJ ohn
# Splash ofJ &B? J ust
# ofk nife
# god ofd arkness
# de fi init
# insec't
```