Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/r9y9/jsut-lab
HTS-style full-context labels for JSUT v1.1
https://github.com/r9y9/jsut-lab
dataset hts jsut speech-synthesis text-to-speech tts voice-conversion
Last synced: 19 days ago
JSON representation
HTS-style full-context labels for JSUT v1.1
- Host: GitHub
- URL: https://github.com/r9y9/jsut-lab
- Owner: r9y9
- License: mit
- Created: 2019-09-26T13:09:46.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-04-16T05:59:59.000Z (over 3 years ago)
- Last Synced: 2024-12-03T21:13:27.484Z (about 1 month ago)
- Topics: dataset, hts, jsut, speech-synthesis, text-to-speech, tts, voice-conversion
- Homepage: https://sites.google.com/site/shinnosuketakamichi/publication/jsut
- Size: 21.5 MB
- Stars: 46
- Watchers: 5
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# jsut-lab
[![DOI](https://zenodo.org/badge/211091946.svg)](https://zenodo.org/badge/latestdoi/211091946)
The repository provides HTK/HTS-style alignment files with additional full-context labels for [JSUT (Japanese speech corpus of Saruwatari-lab., University of Tokyo)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) corpus (v1.1). All alignment files (.lab) were extracted by forced-alignment using [Julius](https://github.com/julius-speech/julius) and full-contexts are generated by [OpenJTalk](http://open-jtalk.sp.nitech.ac.jp/).
The label files are expected to be used for speech reseach; e.g., text-to-speech and voice conversion.
Directory structure is exactly same as the JSUT. You can put the label files to the JSUT data directory if you want:
```
tree ~/data/jsut_ver1.1/ -d -L 2
/home/ryuichi/data/jsut_ver1.1/
├── basic5000
│ ├── lab
│ └── wav
├── countersuffix26
│ ├── lab
│ └── wav
├── loanword128
│ ├── lab
│ └── wav
├── onomatopee300
│ ├── lab
│ └── wav
├── precedent130
│ ├── lab
│ └── wav
├── repeat500
│ ├── lab
│ └── wav
├── travel1000
│ ├── lab
│ └── wav
├── utparaphrase512
│ ├── lab
│ └── wav
└── voiceactress100
├── lab
└── wav
```## Label format
Fields: ` `. Time are in 100ns units as same as HTK labels.
```
$ cat basic5000/lab/BASIC5000_0773.lab | head
0 2525000 xx^xx-sil+s=a/A:xx+xx+xx/B:xx-xx_xx/C:xx_xx+xx/D:18+xx_xx/E:xx_xx!xx_xx-xx/F:xx_xx#xx_xx@xx_xx|xx_xx/G:6_3%0_xx_xx/H:xx_xx/I:xx-xx@xx+xx&xx-xx|xx+xx/J:1_6/K:3+6-32
2525000 3825000 xx^sil-s+a=N/A:-2+1+6/B:xx-xx_xx/C:18_xx+xx/D:24+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
3825000 4825000 sil^s-a+N=g/A:-2+1+6/B:xx-xx_xx/C:18_xx+xx/D:24+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
4825000 5825000 s^a-N+g=i/A:-1+2+5/B:xx-xx_xx/C:18_xx+xx/D:24+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
5825000 6125000 a^N-g+i=i/A:0+3+4/B:xx-xx_xx/C:18_xx+xx/D:24+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
6125000 7524999 N^g-i+i=N/A:0+3+4/B:xx-xx_xx/C:18_xx+xx/D:24+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
7524999 8125000 g^i-i+N=w/A:1+4+3/B:xx-xx_xx/C:18_xx+xx/D:24+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
8125000 8425000 i^i-N+w=a/A:2+5+2/B:xx-xx_xx/C:18_xx+xx/D:24+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
8425000 10125000 i^N-w+a=pau/A:3+6+1/B:18-xx_xx/C:24_xx+xx/D:07+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
10125000 11325000 N^w-a+pau=d/A:3+6+1/B:18-xx_xx/C:24_xx+xx/D:07+xx_xx/E:xx_xx!xx_xx-xx/F:6_3#0_xx@1_1|1_6/G:3_1%0_xx_0/H:xx_xx/I:1-6@1+3&1-6|1+32/J:2_10/K:3+6-32
```For details, please refer to HTS documents: http://hts.sp.nitech.ac.jp
## What can I do with this?
If you want to make traditional DNN-based TTS systems, please check out the tutorials at https://r9y9.github.io/nnmnkwii/latest/. You can use alignment and full-context labels to generate linguistic features.
If you are intersted in end-to-end approaches, please have a look at https://github.com/espnet/espnet. The labels are used at the preprocessing stage for the JSUT recipe (see also https://r9y9.github.io/blog/2017/11/12/jsut_ver1/ to know why we need alignments for end-to-end TTS).
Happy speech hacking!
## Source code to generate labels
https://github.com/r9y9/segmentation-kit/tree/jsut3
## Notice
- Alignments are likely to have mistakes because they were automatically generated by Julius. Note that they are not hand-annotated labels.
## References
- [JSUT (Japanese speech corpus of Saruwatari-lab., University of Tokyo)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut)
- [HTS](http://hts.sp.nitech.ac.jp)
- [Julius](https://github.com/julius-speech/julius)
- [OpenJTalk](http://open-jtalk.sp.nitech.ac.jp/)
- [日本語 End-to-end 音声合成に使えるコーパス JSUT の前処理 [arXiv:1711.00354]](https://r9y9.github.io/blog/2017/11/12/jsut_ver1/)
- [pyopenjtalk](https://github.com/r9y9/pyopenjtalk)
- [nnmnkwii](https://github.com/r9y9/nnmnkwii)
- [sarulab-speech/jsut-label](https://github.com/sarulab-speech/jsut-label) Hand-annotated phonetic and prosodic information from Saruwatari-lab.