https://github.com/petercorke/vtt-clean

Python script to clean VTT files generated by Microsoft Stream
https://github.com/petercorke/vtt-clean

captioning microsoft-stream timecode vtt vtt-subtitles

Last synced: about 1 month ago
JSON representation

Python script to clean VTT files generated by Microsoft Stream

Host: GitHub
URL: https://github.com/petercorke/vtt-clean
Owner: petercorke
Created: 2021-01-15T05:40:52.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-01-15T05:52:16.000Z (over 4 years ago)
Last Synced: 2025-02-01T08:45:43.616Z (3 months ago)
Topics: captioning, microsoft-stream, timecode, vtt, vtt-subtitles
Language: Python
Homepage:
Size: 1.95 KB
Stars: 0
Watchers: 3
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# VTT-clean

Remove cruft from a Microsoft Stream generated transcript file.

Turns this

```
WEBVTT

NOTE duration:"00:09:50.5320000"

NOTE language:en-us

NOTE Confidence: 0.892255544662476

6c478ecc-cbcd-4c6b-9a25-df9adba3799f
00:00:05.510 --> 00:00:08.516
Let's discuss intelligence here.
Examples of two organisms, both

NOTE Confidence: 0.892255544662476

de8e238d-692c-472c-9200-e502e9bd2825
00:00:08.516 --> 00:00:12.190
of which are intelligent but in
very different ways. We use

NOTE Confidence: 0.892255544662476

aa265655-5aef-4fb2-987f-1da913dda503
00:00:12.190 --> 00:00:15.530
Albert Einstein as an example
for all human beings, and

NOTE Confidence: 0.892255544662476
```

into this

```
WEBVTT

00:00:05.510 --> 00:00:08.516
Let's discuss intelligence here. Examples of two organisms, both

00:00:08.516 --> 00:00:12.190
of which are intelligent but in very different ways. We use

00:00:12.190 --> 00:00:15.530
Albert Einstein as an example for all human beings, and
```

* removes the `NOTE` lines
* remove duplicated blank lines
* removes the big hex strings which I think are some kind of autogenerated cue identifier or label for the following time coded text.
* merges the short lines into longer ones with proper text wrapping

## Run it

```
python original.vtt > vtt_clean.vtt
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/petercorke/vtt-clean

Awesome Lists containing this project

README