https://github.com/montrealcorpustools/benchmarking
Bencharking suites for PolyglotDB
https://github.com/montrealcorpustools/benchmarking
Last synced: 11 days ago
JSON representation
Bencharking suites for PolyglotDB
- Host: GitHub
- URL: https://github.com/montrealcorpustools/benchmarking
- Owner: MontrealCorpusTools
- License: mit
- Created: 2016-06-06T15:25:14.000Z (about 10 years ago)
- Default Branch: main
- Last Pushed: 2021-06-22T17:09:09.000Z (almost 5 years ago)
- Last Synced: 2025-12-13T01:32:44.010Z (6 months ago)
- Language: Python
- Size: 125 KB
- Stars: 0
- Watchers: 5
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# benchmarking
Bencharking scripts for various applications of Montreal Corpus Tools
In the MFA folder, there are several scripts beginning with benchmark\_aligner, one per dataset. There are currently scripts to align the LibriSpeech corpus and the lab datasets for Quebec French, English, and Tagalog. If dict\_path = None, the --nodict option is implemented (as in the Tagalog script). The paths to the relevant directories, as well as the number of jobs, can be changed at the top of the scripts. The models from alignment are stored in zip folders.
The reorganize\_french\_corpus.py script restructures the Quebec French dataset into a usable format for alignment.
The librispeech\_to\_chapters.py script organizes the LibriSpeech corpus into speaker folders that contain textgrids for each chapter.
The comparetextgrids.py script takes two paths to aligned corpora as command line arguments and outputs a csv file showing the average differences in word, phone, and segment-of-interest alignment, as well as the difference in counts of 'sil' segments. If a textgrid in one dataset does not have a corresponding one in the other dataset, nothing is outputted. If segments of interest are not indicated in a textgrid, there will be a blank space in the SOI column of the csv. In cases where the two alignments have different phone counts, the two counts will be listed and no average difference will be given.