Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tjmahr/subtitlefreq
https://github.com/tjmahr/subtitlefreq
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/tjmahr/subtitlefreq
- Owner: tjmahr
- Created: 2022-04-04T20:33:14.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-01T19:46:15.000Z (8 months ago)
- Last Synced: 2024-05-02T13:41:35.948Z (8 months ago)
- Language: R
- Size: 261 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output:
github_document:
keep_html: false
html_preview: false
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(tidyverse)
```# subtitlefreq
The goal of subtitlefreq is to provide alternative word frequency counts from
the SUBTLEX-US corpus. My problem with the SUBTLEX-US corpus is that it
separates "word+t" into "word" + "t", so that the contractions "isn't",
"aren't", "don't", etc. count as two words. This project is a naive attempt
to rebuild some frequency counts from scratch.⚠️ This is not a drop-in replacement for the SUBTLEX-US corpus. I just wanted to
estimate how frequent "don't" is so I could compute some neighborhood frequency
measures.## obtaining the raw SUBTLEX-US data
As a matter of caution, I won't provide the original subtitle corpus. But you
can download it the same way that I did.- Go to the following URL: and download the
corpus.
- Move `Subtlex US.rar` into `data` folder.We can test our download by reading the embedded readme file from it.
```{r}
readme <- archive::archive_read(
"data/Subtlex US.rar",
"Subtlex US/readme.txt"
)writeLines(readLines(readme))
```That last line is why I don't redistribute the corpus.
## then run targets
Assuming the data are downloaded and the appropriate packages are
installed, then running `targets::tar_make()` should count the words in
the corpus. targets also [downloads](https://osf.io/djpqz/) and prepares the published version of
the SUBTLEX-US frequency counts.```{r}
data_subtlexus <- targets::tar_read(data_subtlexus)
data_subtlexus
```Data processing here works in three stages. First, there is a raw tibble of
lines with row per subtitle line. The `batch` column is used for splitting the
corpus into batches so that words can be counted in parallel.```{r}
library(tidyverse)
data_raw <- targets::tar_read(data_raw_corpus)
data_raw
```These lines are patched (to remove some garbage I discovered where markup was
included in the dialogue), and then the words are in batch are counted.The counts from each batch are pooled together to give the following frequency
counts:```{r}
data_pooled <- targets::tar_read(data_counts_pooled)
data_pooled |> print(n = 20)
```You might notice that there are around 200,000 words here, instead of
the 60,000-70,000 words you might see elsewhere. This is probably because we
did not lump any inflections together or anything.If I see ever any typos in the counts, I try to store a correction in the csv
file `data/patches.csv`. This is probably a futile, never-ending endeavor, but
hey, at least anyone can add to it.```{r}
data_patches <- targets::tar_read("data_patches")
data_patches
```These individual patches are followed by regular-expression-based
patches. These are stored in a `data/patches-regex.csv` so that they can
have a `comment` column that describes their behavior.```{r}
data_patches_regex <- targets::tar_read("data_patches_regex")
data_patches_regex
```After applying the patches, we obtain the following counts:
```{r}
data <- targets::tar_read("data_counts_patched")
data
```We can rationalize our patching activity by looking at how many words are
affected:```{r}
data_pooled |>
anti_join(data, by = "word") |>
print(n = 20)
```## comparisons
What text-cleaning did they originally do?
> We rejected all files with more than 2.5% type errors according to the
> spelling checker Aspell. In the end, 8,388 films and television
> episodes were retained, with a total of 51.0 million words (16.1
> million from television series, 14.3 million from films before 1990,
> and 20.6 million from films after 1990).I seem to be missing around 3 million tokens.
```{r}
sum(data$n)
51000000 - sum(data$n)
```Or perhaps I am missing just 2 million words, based on the published
frequencies:```{r}
sum(data_subtlexus$FREQcount)
sum(data_subtlexus$FREQcount) - sum(data$n)
```Our raw text has lots of segmentation errors where multiple words are
combined together. For example, here are types with "in'" followed by 2
characters. If I split "tryin'to" into "tryin' to", I get 36 new words.
Would systematically fixing lots of segmentation mistakes
rediscover 2,000,000 missing words?```{r}
data_pooled |>
filter(str_detect(word, "in'..+$")) |>
mutate(sum = sum(n))
```> For instance, there are 18,081 occurrences of the word *play* in
> SUBTLEXUS, 1,521 of the word *plays*, 2,870 of the word *played*,
> and 7,515 of the word *playing*.```{r}
data |>
filter(word %in% c("playing", "plays", "play", "played"))
```This is a pretty good match, but we have a few more *playing* tokens
because we patched `"lng"` words.The published counts fortunately do not have "lng" words.
```{r}
data_subtlexus |>
filter(str_detect(Word, "lng"))
```But the combination of treating contractions as separate words and the
l-\>i conversion means that there are a few thousand spurious tokens of
"il":```{r}
# raw, unpatched
data_pooled |>
filter(str_detect(word, "^il$|'il")) |>
mutate(sum(n))# published
data_subtlexus |>
filter(str_detect(Word, "^il$")) |>
select(1:2)
```Because of OCR converting uppercase "I" to lowercase "l", I patched the
corpus to replace lowercase "l" with uppercase "I" when it was inside of
an all-caps word. We can see the big differences in the counts between
my counts and the published ones for certain initialisms.```{r}
# my counts
data_pooled |>
filter(word %in% c("fbi", "irs", "cia")) |>
arrange(word)# published
data_subtlexus |>
filter(tolower(Word) %in% c("fbi", "irs", "cia")) |>
arrange(Word) |>
select(1:2)
```## let's try something
I saw the textstem package as a solution for lemmatizing words.
```{r}
data |>
head(1000) |>
mutate(
lemma = textstem::lemmatize_words(word)
) |>
group_by(lemma) |>
mutate(lemma_n = sum(n)) |>
arrange(desc(lemma_n)) |>
print(n = 20)
```Hmm, I wish I could skip irregular forms from being lemmatized. I am
also not a fan of "being" is reduced down to "be".## some comparisons for my own interest
There are two layers of patching. One edits the lines in the corpus and
the other edits the words and recounts the words. I want to compare
counts from three different stages.```{r}
a <- targets::tar_read(data_counts_pooled_raw_lines) |>
rename(n_a = n)
b <- targets::tar_read(data_counts_pooled) |>
rename(n_b = n)
c <- targets::tar_read(data_counts_patched) |>
rename(n_c = n)abc <- a |>
full_join(b) |>
full_join(c)
```Line-patching tries to correct OCR errors and contractions with extra
spacing ("isn 't "). These are words that would not exist if were not
for the line patches. For example, *HIV* appears because OCR reads the
word as `"HlV"`.```{r}
abc |>
filter(is.na(n_a)) |>
mutate(
n_b_tokens = sum(n_b, na.rm = TRUE),
n_c_tokens = sum(n_c, na.rm = TRUE)
)
```Words that were lost in the line patching. Some subtitles would have
annotations like `"(SlNGlNG)"` or `"(lN ENGLlSH)"` so the OCR correction
repaired these.```{r}
abc |>
filter(!is.na(n_a), is.na(n_b)) |>
mutate(
n_a_tokens = sum(n_a, na.rm = TRUE)
)
```Words that increased during line-patching. "Kitt", by the way, is the
car from *Knight Rider*.```{r}
diffs <- abc |>
tidyr::replace_na(list(n_a = 0, n_b = 0, n_c = 0)) |>
mutate(
b_vs_a = n_b - n_a,
# proportion change
b_vs_a_prop = b_vs_a / (n_a + 1),
c_vs_b = n_c - n_b,
# proportion change
c_vs_b_prop = c_vs_b / (n_b + 1),
)diffs |>
select(-n_c, -starts_with("c_vs")) |>
arrange(desc(b_vs_a))diffs |>
select(-n_c, -starts_with("c_vs")) |>
arrange(desc(b_vs_a_prop))
```Words that decreased during line-patching.
```{r}
diffs |>
select(-n_c, -starts_with("c_vs")) |>
arrange(b_vs_a)diffs |>
select(-n_c, -starts_with("c_vs")) |>
arrange(b_vs_a_prop)
```Finally, these are words that changed with word-level patching.
```{r}
diffs |>
select(-n_a, -starts_with("b_vs")) |>
arrange(desc(c_vs_b))diffs |>
select(-n_a, -starts_with("b_vs")) |>
arrange(desc(c_vs_b_prop))
```Words that decreased from word-patching.
```{r}
diffs |>
select(-n_a, -starts_with("b_vs")) |>
arrange(c_vs_b)
```### contraction counts
This was the motivation for this whole exercise. I can find some 715K
`n't` contractions. The published frequencies have around 733K `t` tokens.
That's a pretty good match. But I sum the contraction stems in the published
counts, I get 900K tokens. That's because the stem of `can't` is itself a common
word `can`.```{r}
contraction_stems <- c(
"isn", "aren", "ain", "wasn",
"can",
"don", "doesn", "didn",
"couldn", "shouldn", "wouldn",
"hasn", "haven", "hadn",
"won",
"mustn", "needn", "shan"
)data_my_counts <- data %>%
filter(
word %in% paste0(contraction_stems, "'t")
) |>
mutate(total = sum(n))
data_subtlexus_counts <- data_subtlexus |>
select(
subtlexus_word = Word,
subtlexus_count = FREQcount
) |>
filter(subtlexus_word %in% c("t", contraction_stems)) |>
mutate(
word = ifelse(
subtlexus_word != "t",
paste0(subtlexus_word, "'t"),
subtlexus_word
),
sum_of_non_t_items = sum(subtlexus_count) - 733338
)data_my_counts |>
full_join(data_subtlexus_counts, by = "word")
```If I count the stems in my counts, I can see about how many
contractions I have missed. We would get around 300 contractions with some
patching.```{r}
data %>%
filter(
word %in% paste0(contraction_stems),
) |>
group_by(
could_be_false_positive = word %in% c("can", "won", "haven", "don")
) |>
mutate(total = sum(n))
```A quick check of what these remaining errors look like:
```{r}
data_patched_lines <- targets::tar_read("data_raw_corpus_patched")
didn_lines <- data_patched_lines |>
filter(str_detect(line, "didn[^'t]"))data_patched_lines |>
filter(str_detect(line, "\\w\\\"\\w")) |> tail()
```## how to spot errors
My "process" for screening the corpus is to randomly sample words and
see what sticks out. When I noticed `"cla"` in one of these samples, I
figured out the `"I"` to `"l"` OCR problem in all-caps words.```{r}
data_pooled$word |>
sample(size = 10)
```To feel better about things, you could weight by frequency:
```{r}
data_pooled$word |>
sample(size = 10, prob = data_pooled$n)
```## open questions (so far)
I'm not sure what's going on with the encoding and whether that matters. If you
find all lines with the string "Zellweger", you will find some `"Ren\xe9e
Zellweger"` and some `"Renee Zellweger"` tokens.Numerals show up. I imagine that one would want to decompose them into subwords
so that, e.g., "1993" is "nineteen", "ninety", and "three".Stray observations:
- There are lots of URLs/usernames in the corpus because the subtitles can
be signed or sourced.- If a TV show has a theme song, then you can find many repeated lines.
- other OCR errors to consider: i/l to 1
```{r, echo = FALSE, eval = FALSE}
data_raw$line |>
stringr::str_subset(
stringr::fixed("violence in movies", ignore_case = TRUE)
) |>
substr(1, 30)
``````{r, echo = FALSE}
# ther eyou
# mayb eyou
# Than kyou
# YourX-rays
# offt he
# ofJ ohn
# Splash ofJ &B? J ust
# ofk nife
# god ofd arkness
# de fi init
# insec't
```