https://github.com/aflorithmic/viseme-to-video

Creates video from TTS output and viseme images.
https://github.com/aflorithmic/viseme-to-video

Last synced: 7 days ago
JSON representation

Creates video from TTS output and viseme images.

Host: GitHub
URL: https://github.com/aflorithmic/viseme-to-video
Owner: aflorithmic
License: mit
Created: 2022-05-31T17:33:00.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-06-18T23:30:06.000Z (about 3 years ago)
Last Synced: 2024-08-05T09:15:16.951Z (11 months ago)
Language: Python
Size: 1.41 MB
Stars: 11
Watchers: 1
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

![viseme-to-video](https://user-images.githubusercontent.com/60350867/171479529-1d754e88-0934-45cd-a9ce-796e7aaa6534.png)

[![](https://circleci.com/gh/aflorithmic/viseme-to-video.svg?style=svg)](https://app.circleci.com/pipelines/github/aflorithmic/viseme-to-video?branch=main&filter=all) ![contributions-welcome](https://img.shields.io/badge/contributions-welcome-ff69b4) ![GitHub](https://img.shields.io/github/license/aflorithmic/viseme-to-video)

This Python module creates video from viseme images and TTS audio output. I created this for testing the sync accuracy between synthesised audio and duration predictions extracted from FastSpeech2 hidden states.

https://user-images.githubusercontent.com/60350867/172639184-0696ffbc-ca98-49b5-9831-33c420b0a5d9.mp4

https://user-images.githubusercontent.com/60350867/172639478-d795896e-88d1-4581-84dc-3ad01e7dfd7e.mp4

## Running viseme-to-video

To use this module, first install dependencies using by running the command:

`pip install -r requirements.txt`

The tool can be run directly from the command line using the command:

`python viseme_to_video.py`

## Repo contents

This repo contains the following resources:

### **image/**

Two image sets:

-- **speaker1/** from [Occulus developer doc 'Viseme reference'](https://developer.oculus.com/documentation/unity/audio-ovrlipsync-viseme-reference/ )

-- **mouth1/** adapted from icSpeech guide ['Mouth positions for English pronunciation'](https://icspeech.com/mouth-positions.html)

A different viseme image directory can be specified on the command line using the flag `--im_dir`.

### **metadata/**

**24.json**: A viseme metadata JSON file we produced during FastSpeech2 inference by:

- extracting the phoneme sequence produced by the text normalisation frontend module
- mapping this to a sequence of visemes
- extracting hidden state durations (in n frames) from FS2
- converting durations from frames to milliseconds using the formula
- writing this information (phoneme, viseme, duration, offset)

The tool will automatically generate video for all JSON metadata files stored in the `metadata/` folder.

### **map/**

**viseme_map.json**: A JSON file containing mappings between the visemes in viseme metadata files and the image filenames. Mapping visemes was necessary since the viseme set we use to generate our metadata files contained upper/lower-case distinctions, which file naming doesn't support. (I.e. you can't have two files named 't.jpeg' and 'T.jpeg' stored in the same folder.)

A different mapping file can be specified on the command line using the flag `--map`.

### **audio/**

**24.wav** - An audio sample generated from FastSpeech2 ([using kan-bayashi's ESPnet framework](https://github.com/espnet/espnet)). This sample uses a [Harvard sentence](https://harvardsentences.com/) as text input (list 3, sentence 5: 'The beauty of the view stunned the young boy').

Audio can be toggled on/off with the argument `--no_audio`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aflorithmic/viseme-to-video

Awesome Lists containing this project

README