https://github.com/milahu/translate-richtext
translate rich-text documents between human languages, online or offline
https://github.com/milahu/translate-richtext
argos-translate deepl-translate document-translator google-translate html-translator human-languages offline-translator translate-html translation translator
Last synced: about 2 months ago
JSON representation
translate rich-text documents between human languages, online or offline
- Host: GitHub
- URL: https://github.com/milahu/translate-richtext
- Owner: milahu
- License: mit
- Created: 2024-03-04T12:01:30.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-16T13:59:47.000Z (about 1 year ago)
- Last Synced: 2025-02-06T09:37:11.500Z (3 months ago)
- Topics: argos-translate, deepl-translate, document-translator, google-translate, html-translator, human-languages, offline-translator, translate-html, translation, translator
- Language: JavaScript
- Homepage:
- Size: 143 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: license.txt
Awesome Lists containing this project
README
# translate richtext
translate rich-text documents between human languages, online or offline
solve two conflicting goals:
1. preserve the original structure of the document, including whitespace, newlines, indents
2. preserve sentences across structure boundaries like `hello world.
`## use cases
- translate text documents: html, epub, odt, docx, pdf, rtf, latex
- translate video subtitles: srt, vtt## challenges
### preserve sentences
sentences may be broken by
- newlines in the source text
- markup tagsthis is a problem, because when we feed sentence-parts to translators,
then the translators will return worse quality
than in the case, where we feed full sentences to translators### align similar texts
our solution is using two translations:
1. a "splitted" translation
2. a "joined" translationthe "splitted" translation serves as a "sourcemap",
it has the correct positions of sentence-parts,
but the translation has worse quality,
because sentences are broken into sentence-parts.the "joined" translation provides the translated sentences,
with better quality than the "splitted" translation,
but the locations of sentence-parts are lost.currently, we align the two translations with a "character diff":
```sh
git diff --word-diff=color --word-diff-regex=. --no-index \
$(readlink -f translation.joined.txt) \
$(readlink -f translation.splitted.txt) |
sed -E $'s/\e\[32m.*?\e\[m//g; s/\e\\[[0-9;:]*[a-zA-Z]//g' |
tail -n +6 >translation.aligned.txt
```## related
- [produce sourcemap of translation argos-translate#372](https://github.com/argosopentech/argos-translate/issues/372)
- [Prohibit the translation of pieces of text in Google Translate](https://webapps.stackexchange.com/questions/52668/prohibit-the-translation-of-pieces-of-text-in-google-translate/154694#154694)### similar projects
- [argos-translate](https://github.com/argosopentech/argos-translate) - Open-source offline translation library written in Python
- [argos-translate-files](https://github.com/LibreTranslate/argos-translate-files)
- [argos-translate-html](https://github.com/argosopentech/translate-html) - too simple, no merging of "splitted" and "joined" translations
- [subtitlestranslator.com](https://subtitlestranslator.com/en/translate.php)