Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tjmahr/readtextgrid
Read in a 'Praat' 'TextGrid' File
https://github.com/tjmahr/readtextgrid
Last synced: about 1 month ago
JSON representation
Read in a 'Praat' 'TextGrid' File
- Host: GitHub
- URL: https://github.com/tjmahr/readtextgrid
- Owner: tjmahr
- License: gpl-3.0
- Created: 2019-09-05T19:22:02.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-04-26T20:52:18.000Z (8 months ago)
- Last Synced: 2024-04-27T20:50:46.386Z (8 months ago)
- Language: R
- Homepage:
- Size: 326 KB
- Stars: 12
- Watchers: 4
- Forks: 3
- Open Issues: 4
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
- Code of conduct: .github/CODE_OF_CONDUCT.html
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# readtextgrid
[![CRAN status](https://www.r-pkg.org/badges/version/readtextgrid)](https://CRAN.R-project.org/package=readtextgrid)
[![R-CMD-check](https://github.com/tjmahr/readtextgrid/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tjmahr/readtextgrid/actions/workflows/R-CMD-check.yaml)readtextgrid parses Praat textgrids into R dataframes.
## Installation
Install from CRAN:
``` r
install.packages("readtextgrid")
```Install the development version from Github:
``` r
install.packages("remotes")
remotes::install_github("tjmahr/readtextgrid")
```## Basic example
Here is the example textgrid created by Praat. It was created using
`New -> Create TextGrid...` with default settings in Praat.
This textgrid is bundled with this R package. We can locate the file with
`example_textgrid()`. We read in the textgrid with `read_textgrid()`.```{r example, R.options = list(tibble.width = 100)}
library(readtextgrid)# Locates path to an example textgrid bundled with this package
tg <- example_textgrid()read_textgrid(path = tg)
```The dataframe contains one row per annotation: one row for each interval on an
interval tier and one row for each point on a point tier. If a point tier has no
points, it is represented with single row with `NA` values.The columns encode the following information:
- `file` filename of the textgrid. By default this column uses the filename in
`path`. A user can override this value by setting the `file` argument in
`read_textgrid(path, file)`, which can be useful if textgrids are stored in
speaker-specific folders.
- `tier_num` the number of the tier (as in the left margin of the textgrid
editor)
- `tier_name` the name of the tier (as in the right margin of the textgrid
editor)
- `tier_type` the type of the tier. `"IntervalTier"` for interval tiers and
`"TextTier"` for point tiers (this is the terminology used inside of the
textgrid file format).
- `tier_xmin`, `tier_xmax` start and end times of the tier in seconds
- `xmin`, `xmax` start and end times of the textgrid interval or point tier
annotation in seconds
- `text` the text in the annotation
- `annotation_num` the number of the annotation in that tier (1 for the first
annotation, etc.)## Reading in directories of textgrids
Suppose you have data on multiple speakers with one folder of textgrids per
speaker. As an example, this package has a folder called `speaker_data` bundled
with it representing 5 five textgrids from 2 speakers.```
speaker-data
+-- speaker001
| +-- s2T01.TextGrid
| +-- s2T02.TextGrid
| +-- s2T03.TextGrid
| +-- s2T04.TextGrid
| \-- s2T05.TextGrid
\-- speaker002
+-- s2T01.TextGrid
+-- s2T02.TextGrid
+-- s2T03.TextGrid
+-- s2T04.TextGrid
\-- s2T05.TextGrid
```First, we create a vector of file-paths to read into R.
```{r}
# Get the path of the folder bundled with the package
data_dir <- system.file(package = "readtextgrid", "speaker-data")# Get the full paths to all the textgrids
paths <- list.files(
path = data_dir,
pattern = "TextGrid$",
full.names = TRUE,
recursive = TRUE
)
```We can use `purrr::map_dfr()`--*map* the `read_textgrid` function over the
`paths` and combine the dataframes (`_dfr`)---to read all these textgrids into
R. But note that this way loses the speaker information.```{r, R.options = list(tibble.width = 100)}
library(purrr)map_dfr(paths, read_textgrid)
```We can use `purrr::map2_dfr()` and some dataframe manipulation to add the
speaker information.```{r, R.options = list(tibble.width = 100), message = FALSE, warning = FALSE}
library(dplyr)# This tells read_textgrid() to set the file column to the full path
data <- map2_dfr(paths, paths, read_textgrid) |>
mutate(
# basename() removes the folder part from a path,
# dirname() removes the file part from a path
speaker = basename(dirname(file)),
file = basename(file),
) |>
select(
speaker, everything()
)data
```Another strategy would be to read the textgrid dataframes into a list column and
`unnest()` them.```{r}
# Read dataframes into a list column
data_nested <- tibble(
speaker = basename(dirname(paths)),
data = map(paths, read_textgrid)
)# We have one row per textgrid dataframe because `data` is a list column
data_nested# promote the nested dataframes into the main dataframe
tidyr::unnest(data_nested, "data")
```## Pivoting textgrids [dev version]
In the textgrids above, there is a natural nesting or hierarchy to the tiers.
Intervals in `words` tier contain intervals in the `phones` tier. It is often
necessary to group intervals by their parent intervals (group phones by words).
This package provides the `pivot_textgrid_tiers()` function to convert textgrids
into a wide format in a way that respect the nesting/hierarchy of tiers.```{r}
data_wide <- pivot_textgrid_tiers(
data,
tiers = c("words", "phones"),
join_cols = c("speaker", "file")
)data_wide
# more clearly,
data_wide |>
select(
speaker, file, words, phones,
words_xmin, words_xmax, phones_xmin, phones_xmax
)
```Some remarks:
- Each tier in `tiers` becomes a batch of columns. For the rows for `words`
become `words` (the original `text` value), `words_xmin`, `words_xmax`, etc.
- The columns in `join_cols` should uniquely identify a textgrid file, so the
combination of `speaker` and `file` is needed in the case where different
speakers have the same file.
- The tier names in `tiers` should be given in the order of their nesting from
outside to inside (e.g., `words` contain `phones`). Behind the scenes,
`dplyr::left_join(..., relationship = "one-to-many")` is used to constrain
how intervals are combined.This function also works on a single `tiers` value. In this case, the function
returns just the intervals in that tier with the columns renamed and prefixed.```{r}
data |>
pivot_textgrid_tiers(
tiers = "words",
join_cols = c("speaker", "file")
)
```## Other tips
### Speeding things up
Do you have thousands of textgrids to read? The following workflow can speed
things up. We are going to **read the textgrids in parallel**. We use the future
package to manage the parallel computation. We use the furrr package to get
future-friendly versions of the purrr functions. We tell future to use a
`multisession` `plan` for parallelism: Do the extra computation on separate R
sessions in the background. Then everything else is the same. Just replace
`map()` with `future_map()`.```{r, warning = FALSE}
library(future)
library(furrr)
plan(multisession, workers = 4)data_nested <- tibble(
speaker = basename(dirname(paths)),
data = future_map(paths, read_textgrid)
)
```By default, readtextgrid uses `readr::guess_encoding()` to determine the
encoding of the textgrid before reading it in. But if you know the encoding
beforehand, you can skip this guessing. In my limited testing, I found
that **setting the encoding** could reduce benchmark times by 3--4% compared
to guessing the encoding.Here, we read 100 textgrids using different approaches to benchmark the
results.```{r}
paths_bench <- sample(paths, 100, replace = TRUE)
bench::mark(
lapply_guess = lapply(paths_bench, read_textgrid),
lapply_set = lapply(paths_bench, read_textgrid, encoding = "UTF-8"),
future_guess = future_map(paths_bench, read_textgrid),
future_set = future_map(paths_bench, read_textgrid, encoding = "UTF-8"),
min_iterations = 5,
check = TRUE
)
```### Helpful columns
The following columns are often helpful:
- `duration` of an interval
- `xmid` midpoint of an interval
- `total_annotations` total number of annotations on a tierHere is how to create them:
```{r}
data |>
# grouping needed for counting annotations per tier per file per speaker
group_by(speaker, file, tier_num) |>
mutate(
duration = xmax - xmin,
xmid = xmin + (xmax - xmin) / 2,
total_annotations = sum(!is.na(annotation_num))
) |>
ungroup() |>
glimpse()
```### Launching Praat
*This tip is written from the perspective of a Windows user who uses git bash
for a terminal*.To open textgrids in Praat, you can tell R to call Praat from
the command line. You have to know where the location of the Praat binary is
though. I like to keep a copy in my project directories. So, assuming that
Praat.exe in my working folder, the following would open the 10 textgrids in
`paths` in Praat.```{r, eval = FALSE}
system2(
command = "./Praat.exe",
args = c("--open", paths),
wait = FALSE
)
```## Limitations
readtextgrid supports textgrids created by Praat by using `Save as text
file...`. It uses a parsing strategy based on regular expressions targeting
indentation patterns and text flags in the file format. The [formal
specification of the textgrid
format](https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html),
however, is much more flexible. As a result, not every textgrid that Praat can
open---especially the minimal "short text" files---is compatible with this
package.## Acknowledgments
readtextgrid was created to process data from the [WISC Lab
project](https://kidspeech.wisc.edu/). Thus, development of this package was
supported by NIH R01DC009411 and NIH R01DC015653.***
Please note that the 'readtextgrid' project is released with a
[Contributor Code of Conduct](https://www.contributor-covenant.org/version/1/0/0/code-of-conduct.html).
By contributing to this project, you agree to abide by its terms.