Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/elizagamedev/vobsubocr
Blazingly fast and accurate DVD VobSub to SRT subtitle conversion
https://github.com/elizagamedev/vobsubocr
Last synced: 12 days ago
JSON representation
Blazingly fast and accurate DVD VobSub to SRT subtitle conversion
- Host: GitHub
- URL: https://github.com/elizagamedev/vobsubocr
- Owner: elizagamedev
- License: gpl-3.0
- Created: 2021-11-07T09:31:02.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-01-04T19:18:36.000Z (10 months ago)
- Last Synced: 2024-10-30T03:52:43.628Z (15 days ago)
- Language: Rust
- Size: 116 KB
- Stars: 27
- Watchers: 8
- Forks: 6
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-blazingly-fast - vobsubocr - Blazingly fast and accurate DVD VobSub to SRT subtitle conversion (Rust)
README
# vobsubocr
`vobsubocr` is a blazingly fast and accurate DVD VobSub to SRT subtitle conversion tool.
## Background
DVD subtitles are unfortunately encoded essentially as a series of images. This
presents problems when needing a text representation of the subtitle, e.g. for
language learning. `vobsubocr` can alleviate this problem by generating SRT
subtitles from an input VobSub file, leveraging the power of
[Tesseract](https://github.com/tesseract-ocr/tesseract).## Installation
Install the latest release with cargo:
```sh
cargo install vobsubocr
```Or alternatively, install the development version from git:
```sh
cargo install --git https://github.com/elizagamedev/vobsubocr
```You will need to have Tesseract's development libraries installed; see the
[leptess readme](https://github.com/houqp/leptess) for more details. If you use
Nix, the provided shell.nix provides an environment with all of the necessary
dependencies.## Usage
```sh
# Convert simplified Chinese vobsub subtitles and print them to stdout.
vobsubocr -l chi_sim shrek_chi.idx# Convert English vobsub subtitles and write them to a file named "shrek_eng.srt".
vobsubocr -l eng -o shrek_eng.srt shrek_eng.idx
```We can also specify more advanced configuration options for Tesseract with `-c`.
```sh
# Convert subtitles and blacklist the specified characters from being (mistakenly) recognized.
vobsubocr -l eng -c tessedit_char_blacklist='|\/`_~' shrek_eng.idx
```## How does it work/compare to similar tools?
The most comparable tool to `vobsubocr` is
[VobSub2SRT](https://github.com/ruediger/VobSub2SRT), but `vobsubocr` has
significantly better output, especially for non-English languages, mainly
because `VobSub2SRT` does not do much preprocessing of the image at all before
sending it to Tesseract. For example, Tesseract 4.0 expects black text on a
white background, which `VobSub2SRT` does not guarantee, but `vobsubocr` does.
Additionally, `vobsubocr` splits each line into separate images to take
advantage of page segmentation method 7, which greatly improves accuracy of
non-English languages in particular.Official documentation on how to improve accuracy of Tesseract output can be
viewed [here](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html).## Miscellaneous Notes
From my understanding, the `chi_sim` and `chi_tra` Tesseract models work on both
simplified and traditional Chinese text, but automatically convert said text to
their respective forms.