https://github.com/petercorke/vtt-clean
Python script to clean VTT files generated by Microsoft Stream
https://github.com/petercorke/vtt-clean
captioning microsoft-stream timecode vtt vtt-subtitles
Last synced: about 1 month ago
JSON representation
Python script to clean VTT files generated by Microsoft Stream
- Host: GitHub
- URL: https://github.com/petercorke/vtt-clean
- Owner: petercorke
- Created: 2021-01-15T05:40:52.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-01-15T05:52:16.000Z (over 4 years ago)
- Last Synced: 2025-02-01T08:45:43.616Z (3 months ago)
- Topics: captioning, microsoft-stream, timecode, vtt, vtt-subtitles
- Language: Python
- Homepage:
- Size: 1.95 KB
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# VTT-clean
Remove cruft from a Microsoft Stream generated transcript file.
Turns this
```
WEBVTTNOTE duration:"00:09:50.5320000"
NOTE language:en-us
NOTE Confidence: 0.892255544662476
6c478ecc-cbcd-4c6b-9a25-df9adba3799f
00:00:05.510 --> 00:00:08.516
Let's discuss intelligence here.
Examples of two organisms, bothNOTE Confidence: 0.892255544662476
de8e238d-692c-472c-9200-e502e9bd2825
00:00:08.516 --> 00:00:12.190
of which are intelligent but in
very different ways. We useNOTE Confidence: 0.892255544662476
aa265655-5aef-4fb2-987f-1da913dda503
00:00:12.190 --> 00:00:15.530
Albert Einstein as an example
for all human beings, andNOTE Confidence: 0.892255544662476
```into this
```
WEBVTT00:00:05.510 --> 00:00:08.516
Let's discuss intelligence here. Examples of two organisms, both00:00:08.516 --> 00:00:12.190
of which are intelligent but in very different ways. We use00:00:12.190 --> 00:00:15.530
Albert Einstein as an example for all human beings, and
```* removes the `NOTE` lines
* remove duplicated blank lines
* removes the big hex strings which I think are some kind of autogenerated cue identifier or label for the following time coded text.
* merges the short lines into longer ones with proper text wrapping## Run it
```
python original.vtt > vtt_clean.vtt
```