Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tjmahr/mfa-dict


https://github.com/tjmahr/mfa-dict

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r}
#| include: false
library(tidyverse)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
dpi = 300,
dev = "ragg_png"
)
```

# mfa-dict

This repository hosts the words we have had to add to the English
pronunciation dictionary for the Montreal Forced Aligner
([Docs][mfa-docs], [GitHub][mfa-github].

## contents

Here are the contents of the custom dictionary, as an R data-frame,
where I have named unimportant columns `"ignoreX"`.

```{r}
"english_us_arpa_extras.dict" |>
readr::read_tsv(
col_names = c("word", "prob", "ignore0", "ignore1", "ignore2", "phones"),
col_types = "cddddc"
) |>
print(n = Inf)
```

To use the dictionary, copy and paste the contents of file
`english_us_arpa_extras.dict` onto the end of the dictionary file used
by the MFA.

## updating/modifying the dictionary

I used the following to force an update to my English models.

```
mfa model download dictionary english_us_arpa --ignore_cache
mfa model download acoustic english_us_arpa --ignore_cache
```

So, for my Windows computer, the dictionary was saved to:

`C:/Users/Tristan/Documents/MFA/pretrained_models/dictionary/english_us_arpa.dict`

Then in R I can read in the MFA dictionary, our custom one, and combine them and
overwrite the MFA one.

```{r, eval = FALSE}
# Using the convention that ~ points to a user's `Documents` folder on Windows
path_mfa_dict <- "~/MFA/pretrained_models/dictionary/english_us_arpa.dict"

current <- readr::read_tsv(
path_mfa_dict,
col_names = FALSE,
col_types = "cddddc"
)

new <- readr::read_tsv(
"english_us_arpa_extras.dict",
col_names = FALSE,
col_types = "cddddc"
)

# Add or overwrite rows in `current` with rows from `new`
dplyr::rows_upsert(current, new, by = c("X1", "X6")) |>
readr::write_tsv(path_mfa_dict, col_names = FALSE)
```

### the dictionary format

The format for the probabilistic pronunciation dictionaries is
[described here][mfa-pron-prob]. Consider the word *the*:

```
the 0.99 0.01 1.21 0.97 DH AH0
the 0.01 0.12 1.87 0.84 DH AH1
the 0.17 0.11 1.27 0.96 DH IY0
```

The docs tell us:

> The first float column is the probability of the pronunciation, the
> next float is the probability of silence following the pronunciation,
> and the final two floats are correction terms for preceding silence
> and non-silence. Given that each entry in a dictionary is independent
> and there is no way to encode information about the preceding context,
> the correction terms are calculated as how much more common was
> silence or non-silence compared to what we would expect factoring out
> the likelihood of silence from the previous word.

I am not sure what this means, but the first column seems to be the most
important (marginal) probability of pronunciation.

[mfa-docs]: https://montreal-forced-aligner.readthedocs.io/en/latest/index.html "Read The Docs page for the MFA"

[mfa-github]: https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner "MFA GitHub page"

[mfa-phon-prob]: https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/dictionary.html#silence-probabilities "MFA Dictionary Format"