{"id":43954141,"url":"https://github.com/czcorpus/ictools","last_synced_at":"2026-02-07T04:06:23.667Z","repository":{"id":57533979,"uuid":"116274387","full_name":"czcorpus/ictools","owner":"czcorpus","description":"A program for calculating corpora alignments using a pivot language","archived":false,"fork":false,"pushed_at":"2024-03-21T14:18:35.000Z","size":248,"stargazers_count":1,"open_issues_count":2,"forks_count":1,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-06-20T10:13:59.102Z","etag":null,"topics":["cmd","corpora","corpus","linguistics","manatee-open","parallel-corpora","translation"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/czcorpus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-04T15:11:41.000Z","updated_at":"2022-11-01T11:02:12.000Z","dependencies_parsed_at":"2024-06-20T09:22:53.786Z","dependency_job_id":null,"html_url":"https://github.com/czcorpus/ictools","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/czcorpus/ictools","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fictools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fictools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fictools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fictools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/czcorpus","download_url":"https://codeload.github.com/czcorpus/ictools/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fictools/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29186091,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-07T03:35:06.566Z","status":"ssl_error","status_checked_at":"2026-02-07T03:34:57.604Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cmd","corpora","corpus","linguistics","manatee-open","parallel-corpora","translation"],"created_at":"2026-02-07T04:06:23.090Z","updated_at":"2026-02-07T04:06:23.658Z","avatar_url":"https://github.com/czcorpus.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ictools - a program for calculating corpora alignments using a pivot language\n\nThis is a faster, less memory-consuming, integrated replacement for legacy *calign.py*,\n*compressrng.py*, *fixgaps.py*, *transalign.py* scripts used to prepare corpora alignment\nnumeric data from lists of structural attribute values mapping between languages. It also fixes\nsome problems with missing ranges for unaligned structures you can encounter when using the scripts above.\nIn addition, it also provides an `export` function for performing reversed operations.\n\nNote: you still need *mkalign* tool distributed along with *Manatee-open* to enable corpora alignments\nin *KonText* (or NoSkE).\n\n## Contents\n\n* [Using ictools](#using_ictools)\n* [How to build ictools](#how_to_build_ictools)\n  * [build helper script](#how_to_build_ictools_helper_script)\n  * [manual variant](#how_to_build_ictools_manual_variant)\n* [Benchmark](#benchmark)\n* [For developers](#for_developers)\n  * [Setting up VSCode debugging/testing environment](#for_developers_setting_up_vscode)\n  * [running tests](#for_developers_running_tests)\n\n\u003ca name=\"using_ictools\"\u003e\u003c/a\u003e\n## Using ictools\n\n*Ictools* provide three operations - import, transalign and export:\n\n### import\n\nImport operation transforms an alignment XML file containing aligned string sentence IDs to a numeric form.\nIt is able to handle non-existing alignments, gaps between ranges (including the last row range where structure\nsize is always used to make sure the whole range is filled in).\n\nIn terms of the input format, a list of `\u0026lt;link\u0026gt;` elements is expected:\n\n```xml\n\u003clink type='0-3' xtargets=';cs:Adams-Holisticka_det_k:0:7:1 cs:Adams-Holisticka_det_k:0:7:2 cs:Adams-Holisticka_det_k:0:7:3' status='man'/\u003e\n```\n\nPlease note that the parser does not care about XML validity (e.g. there is no need for a root element or even\na proper nesting of elements).\n\nIn some cases you may want to *tweak line buffer size* (value is in bytes; by default *bufio.MaxScanTokenSize* = 64 * 1024 is used which may fail in case of some complex alignments and/or long text identifiers). In case the buffer is too\nsmall, ictools will end with fatal log event returning a non-zero value to shell.\n\n```\nictools -line-buffer 250000 -registry-path /var/local/corpora/registry import ....etc...\n```\n\n**Example:**\n\nLet's say we have two files with mappings between Polish and Czech (*intercorp_pl2cs*) and between\nEnglish and Czech (*intercorp.en2cs*) where Czech is a pivot.\n\n```\nictools -registry-path /var/local/corpora/registry import intercorp_v10_pl intercorp_v10_cs s.id /var/local/corpora/aligndef/intercorp_pl2cs \u003e intercorp.pl2cs\n\nictools -registry-path /var/local/corpora/registry import intercorp_v10_en intercorp_v10_cs s.id /var/local/corpora/aligndef/intercorp_en2cs \u003e intercorp.en2cs\n```\n\n### transalign\n\nTransalign operation takes two numeric alignments against a common pivot language and generates\na new alignment between the two non-pivot languages.\n\n**Example:**\n\n```\nictools transalign ./intercorp.pl2cs ./intercorp.en2cs \u003e intercorp.pl2en\n```\n\n### export\n\nThe `export` operation is able to reconstruct the XML-ish source used as an input\nfor the `import` operation using numeric alignment files as produced by\n`import -\u003e transalign` operations. Any grouped intervals are split back to the original\ntext groups.\n\n**Example:**\n\n```\nictools -export-type intercorp export /corpora/registry/intercorp_v12_cs /corpora/registry/intercorp_v12_en s.id /corpora/aligndef/intercorp.cs2en \u003e orig.xml\n```\n\n\n\u003ca name=\"how_to_build_ictools\"\u003e\u003c/a\u003e\n## How to build ictools\n\nICTools come with [manabuild](https://github.com/czcorpus/manabuild) as its dependency. So in case you have\n `~/go/bin` in your `$PATH`, everything needed to build `ictools` is:\n\n```\nmanabuild\n```\n\nIn case Manabuild finds Manatee-open in a non-standard location where system does not look for libraries,\nit produces `ictools.bin` with actual ICTools binary and `ictools` which is a short Bash script\nto set `LD_LIBRARY_PATH` to the path Manabuild found Manatee in and to start the binary. So in this case,\ntwo files must be moved (or copied) to a target installation location (e.g. `/usr/local/bin`).s\n\n\n\n\u003ca name=\"benchmark\"\u003e\u003c/a\u003e\n## Benchmark\n\nUsed data files:\n\n* intercorp_pl2cs (size 1.4GB)\n* intercorp_pl2en (size 1.5GB)\n\nUsed hardware:\n\n* A (a server)\n  * CPU: Intel Xeon E5-2640 v3 @ 2.60GHz\n  * 64GB RAM\n* B (a common Dell desktop)\n  * CPU: Intel Core) i5-2400 @ 3.10GHz\n  * 8GB RAM\n\n| Setup | Used program    | calign+fixgaps+compress [sec] | transalign [sec] | total [sec]  |\n|-------|-----------------|------------------------------:|-----------------:|-------------:|\n| A     | classic scripts |  255                          | 191              | 446          |\n| A     | ictools         |  **164**                      | **55**           | **219**      |\n| B     | classic scripts |  312                          | DNF (RAM)        | DNF          |\n| B     | ictools         |  **175**                      | **63**           | **238**      |\n\nIctools are approximately **twice as fast** as the original Python scripts.\n\nIn terms of **memory usage**, there were no thorough measurements performed but according to the *top*\nutility the *transalign* function in *ictools* consumes about **30-40% of of the memory** consumed\nby the classic scripts. The import function (i.e. calign+fixgaps+compress) in both programs\nconsumes only a little RAM because data read from an input file are (almost) immediately written\nto the output without any unnecessary memory allocation.\n\n\u003ca name=\"for_developers\"\u003e\u003c/a\u003e\n## For developers\n\n\u003ca name=\"for_developers_setting_up_vscode\"\u003e\u003c/a\u003e\n### Setting up VSCode debugging/testing environment\n\nRun\n\n```\nmanabuild -no-build\n```\n\nand copy `CGO_CPPFLAGS=...`, `CGO_CPPFLAGS=...` and `CGO_CXXFLAGS=...`.\n\nOpen *debug* environment (left column) and click the \"gear\" button to edit *launch.json*. Then\nset proper environment variables (just like in the previous paragraph).\n\n```json\n{\n  \"version\": \"0.2.0\",\n  \"configurations\": [\n    {\n      \" ....  parts are omitted here ... \" : \" ... \",\n      \"env\": {\n          \"CGO_LDFLAGS\": \"...\",\n          \"CGO_CPPFLAGS\": \"...\",\n          \"CGO_CXXFLAGS\": \"...\"\n      },\n      \" ....  parts are omitted here ... \" : \" ... \",\n    }\n  ]\n}\n```\n\nWhere the env. variables part is the one copied in the previous step.\n\n\n\u003ca name=\"for_developers_running_tests\"\u003e\u003c/a\u003e\n### Running tests\n\n```\nmanabuild -test\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fczcorpus%2Fictools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fczcorpus%2Fictools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fczcorpus%2Fictools/lists"}