Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/brawer/pronunbot
Tools for uploading recorded pronunciations to Wikimedia Commons and Wikidata
https://github.com/brawer/pronunbot
Last synced: 29 days ago
JSON representation
Tools for uploading recorded pronunciations to Wikimedia Commons and Wikidata
- Host: GitHub
- URL: https://github.com/brawer/pronunbot
- Owner: brawer
- License: mit
- Created: 2018-11-27T07:15:13.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2018-12-01T09:28:35.000Z (about 6 years ago)
- Last Synced: 2024-10-13T14:14:46.927Z (2 months ago)
- Language: Python
- Size: 126 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PronunBot
PronunBot is a tool for uploading a batch of recorded pronunciations
to [Wikimedia Commons](https://commons.wikimedia.org/) and
[Wikidata](https://www.wikidata.org).## Background
We’ve built this tool at the [Plurilinguism
Hackathon](https://forum-helveticum.ch/en/hackathon/) in November
2018. [Lia Rumantscha](http://www.liarumantscha.ch/?changeLang=_en)
contributed recorded pronunciations of about 5000 phrases in the [Sursilvan
variant](https://en.wikipedia.org/wiki/Sursilvan_dialects_(Romansh))
of the [Romansh
language](https://en.wikipedia.org/wiki/Romansh_language) to the
hackathon. Back in March 2007, the pronunciations had been recorded
as language training material; at the 2018 hackathon, Lia Rumantscha kindly
gave permission to upload them to Wikidata under the Creative Commons Zero
license.## Setup
We’ve used a Macintosh laptop with
[Docker](https://docs.docker.com/docker-for-mac/install/) running a
Linux container. For setup instructions, see the comments in `Dockerfile`.## Splitting multi-word phrases
Many of the original recordings are multi-word phrases.
An example is the phrase [“jeu savess prender” 🔉](https://cdn.jsdelivr.net/gh/brawer/PronunBot/testdata/split_phrases/jeu%20savess%20prender.mp3). Because
the initial recording was done for language training, the words are often
separated by spans of silence; this is rather unusual in recorded
speech. Also, the original recordings often contain a few seconds of silence
before and after the spoken phrase.For using the sound snippets in Wikidata lexemes, however, we need a
separate sound snippet for every word without surrounding silence.
The tool `split_phrases.py` helps to solve this problem: it goes over the
input files, calls [FFmpeg](https://www.ffmpeg.org/) to detect
silences, and then applies a simple heuristic to split the sound file
into single words. Finally, the tool will tag each snippet with
metadata (such as license, performer, or language) and compress the
sound in the lossless [FLAC format](https://en.wikipedia.org/wiki/FLAC).To run the splitting script, we’ve used the following command inside
the Linux container:```
python split_phrases.py -o split \
--language=rm-sursilv --date=2007-03-09 \
--performer="Erwin Ardüser" \
--organization="Lia Rumantscha / Conradin Klaiss, 7001 Chur, Switzerland" \
--copyright="2007 Lia Rumantscha" \
--license="Creative Commons Zero v1.0 Universal" \
/recordings
```Some input files, for example the recorded phrase [“bien
di” 🔉](https://cdn.jsdelivr.net/gh/brawer/PronunBot/testdata/split_phrases/bien%20di.mp3),
do not have enough silent spans for splitting the phrase into
words. The tool logs the problem cases into `split-failures.txt`
next to the output files.## Quality assessment
To check the quality of recorded phrases, run `python3 assess_quality.py split`
on the Mac command line. For each phrase or word, the tool plays all available
recordings. The user then picks the best variant, or `0` if they’re all bad.
The quality assessment gets recorded into a file `qa.txt`.## Uploading sound files to Wikimedia Commons
To upload the recordings to Wikimedia Commons, run this in the Linux container:
```
PYTHONPATH=/pywikibot:$PYTHONPATH python upload_to_commons.py split
```TODO: Find out why `pywikibot` cannot be installed during
container creation. There is a pip package for pywikibot, but it does
not seem to work properly on Python 3; perhaps it just needs to be
updated.## Uploading to Wikidata
TODO
## License
The code in this repository is copyright 2018 by [Sascha
Brawer](http://www.brawer.ch), and has been released as free software
under the [MIT license](https://spdx.org/licenses/MIT.html).