https://github.com/jina-ai/executor-subtitle-extractor

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/jina-ai/executor-subtitle-extractor
Owner: jina-ai
Created: 2021-10-13T07:05:13.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2021-11-11T07:32:23.000Z (almost 4 years ago)
Last Synced: 2025-03-07T03:46:31.603Z (7 months ago)
Language: Python
Size: 27.3 KB
Stars: 1
Watchers: 26
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # SubtitleExtractor

Subtile extractor helps extracting text from the `.vtt` files and using the heuristics to remove the duplicated text.        

## Usage

Suppose you have already downloaded a `.vtt` file:

```python

data_fn = PATH_OF_YOUR_VTT_FILE

docs = DocumentArray([Document(uri=str(data_fn.absolute()))])

f = Flow().add(uses='jinahub://SubtitleExtractor')

with f:

    f.post(inputs=docs)

```

Your input should be `.vtt` files, and for documents in docs, it will be divided into several chunks, each chunk contains their text and time information like follow:

```

{'id': '2b5d7ac6-3180-11ec-9f3a-acde48001122_0', 'mime_type': 'text/plain', 'tags': {'beg_in_seconds': 1.04, 'end_in_seconds': 4.789, 'vid': 'zvXkQkqd2I8.vtt'}, 'text': 'hi my name is han founder and ceo of jina ai', 'granularity': 1, 'parent_id': '2b5d7ac6-3180-11ec-9f3a-acde48001122', 'location': [1040, 4789]}

```

#### via Docker image (recommended)

```python

from jina import Flow

	

f = Flow().add(uses='jinahub+docker://SubtitleExtractor')

```

#### via source code

```python

from jina import Flow

	

f = Flow().add(uses='jinahub://SubtitleExtractor')

```

- To override `__init__` args & kwargs, use `.add(..., uses_with: {'key': 'value'})`

- To override class metas, use `.add(..., uses_metas: {'key': 'value})`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jina-ai/executor-subtitle-extractor

Awesome Lists containing this project

README