Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bitextor/bleualign-cpp
https://github.com/bitextor/bleualign-cpp
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/bitextor/bleualign-cpp
- Owner: bitextor
- License: gpl-3.0
- Created: 2019-08-01T11:07:45.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-03-10T21:02:57.000Z (almost 2 years ago)
- Last Synced: 2024-08-03T16:14:35.674Z (5 months ago)
- Language: C++
- Size: 126 KB
- Stars: 7
- Watchers: 11
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-machine-translation - Bleualign-cpp - A C++ sentence alignment tool based on Bleualign. Bleualign-cpp is expected to be used together with document-aligner. (Aligners 🌌)
README
# Bleualign-cpp
C++ sentence alignment tool based on [Bleualign](https://github.com/rsennrich/Bleualign).
Bleualign-cpp is expected to be used together with [document-aligner](https://github.com/bitextor/bitextor/tree/master/document-aligner).### Requirements
- GCC, C++11 compiler
- [Boost](https://www.boost.org/) 1.58.0 or later
- [CMake](https://cmake.org/download/) 3.7.2 or later
- [GTest](https://github.com/google/googletest) (for tests)
- [kpu/preprocess](https://github.com/kpu/preprocess) (already included in this repository as a submodule)### Compile with CMake
```bash
mkdir build
cd build
cmake .. -DBUILD_TEST=on -DCMAKE_BUILD_TYPE=Release
# use `cmake .. -DBUILD_TEST=on -DCMAKE_BUILD_TYPE=Release -DPREPROCESS_PATH=/home/user/preprocess/` if you use other 'preprocess' folder
make -j 4
tests/test_all
```### Usage
Bleualign-cpp takes two texts in two different languages and aligns them to produce parallel sentences. To this end, it also needs a translation of one of these texts.
Input format is `url1 url2 text1 text2 text1translated [ text2processed ] [ text1metadata text2metadata ]` per line. Every text column is encoded as base64. After decoding text columns, they should contain a single sentence per line. The translation (`text1translated`) should correspond line-by-line with the original text (`text1`). The first line will be a header where, following the explained format, the expected fields are: `src_url trg_url src_text trg_text src_translated [ trg_translated ] [ src_metadata trg_metadata ]`. The first line of the output will contain a header as well which, depending on the provided arguments, the fields will be: `src_url trg_url src_text trg_text bleualign_score [ src_deferred_hash trg_deferred_hash ] [ src_metadata_field_1 trg_metadata_field_1 ... ]`.
Optionally a processed version of `text2` can be provided, as a sixth column, that better matches the processing applied to `text1translated` to help with calculating alignment scores. The output of bleualign will only mention `text1` and `text2`.
Bleualign-cpp outputs aligned sentences to standard output. Output format is (mandatory fields only): `url1 url2 source_sentence target_sentence score` per line.
Bleualign receives input by stdin and writes output to stdout.
##### Optional Parameters
* **--help** - Print help dialog
* **--bleu_threshold** - Sentence-level BLEU score threshold (Default: 0.0)
* **--print-sent-hash** - Print hash for each sentence
* **--metadata-header-fields** - Language agnostic comma separated list of metadata header fields (prefix `src_` and `trg_` will be added after)