https://github.com/nicolay-r/bulk-translate
A tiny Python no-string package for performing translation of a massive stream of texts with native support of pre-annotated fixed-spans that are invariant for translator.
https://github.com/nicolay-r/bulk-translate
arekit framework googletrans iterator pipeline span span-based spreadsheet spreadsheets translate
Last synced: about 1 year ago
JSON representation
A tiny Python no-string package for performing translation of a massive stream of texts with native support of pre-annotated fixed-spans that are invariant for translator.
- Host: GitHub
- URL: https://github.com/nicolay-r/bulk-translate
- Owner: nicolay-r
- License: mit
- Created: 2024-11-10T13:44:50.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-03-10T13:19:28.000Z (over 1 year ago)
- Last Synced: 2025-04-25T22:17:48.133Z (about 1 year ago)
- Topics: arekit, framework, googletrans, iterator, pipeline, span, span-based, spreadsheet, spreadsheets, translate
- Language: Python
- Homepage:
- Size: 390 KB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# bulk-translate 0.25.2

[](https://colab.research.google.com/github/nicolay-r/bulk-translate/blob/master/bulk_translate_demo.ipynb)
[](https://x.com/nicolayr_/status/1871218031709323461)
[](https://pypistats.org/packages/bulk-translate)
Third-party providers hosting↗️
A tiny Python no-string package for performing translation of a massive `CSV`/`JSONL` files that
natively provides support of pre-annotated **fixed-spans** that are invariant for translator.
## Description
### 📘 More on spans
### 📘 `bulk-translate` features
The out-of-the box features of the `bulk-translate` are:
* ✅ Support of the `spans` for annotation / optional translation.
* ✅ Native Implementation of two translation modes:
- `fast-mode`: exploits extra chars that could be used for grouping all the text parts into single batch with further deconstruction.
- `accurate`: performs individual translation of each text part.
* ✅ No strings: you're free to adopt any LM / LLM backend.
- Support `googletrans` by default.
## Installation
From PyPI:
```bash
pip install bulk-translate
```
or latest version from here:
```bash
pip install git+https://github.com/nicolay-r/bulk-translate
```
## Usage
### API
### 👉 [Follow this notebook tutorial at `nlp-thirdgate`](https://github.com/nicolay-r/nlp-thirdgate/blob/master/tutorials/translate_texts_with_spans_via_googletrans.ipynb)
## Command Line / Shell
> **NOTE:** Spans supports only in JSON-lines format.
> **NOTE:** Requires `source_iter` package installation.
For the following [`test.tsv` example data](/test/data/test.tsv) with annotated entities enclosed in square brackets:
```bash
python -m bulk_translate.translate \
--src "test/data/test.tsv" \
--schema '{"translated":"{text}"}' \
--adapter "dynamic:models/googletrans_310a.py:GoogleTranslateModel" \
--output "test-translated.jsonl" \
--batch-size 10 \
%%m \
--src "auto" \
--dest "ru"
```
## Powered by
The pipeline construction components were taken from AREkit [[github]](https://github.com/nicolay-r/AREkit)