{"id":20259593,"url":"https://github.com/nrc-cnrc/portagetextprocessing","last_synced_at":"2025-10-26T15:47:44.811Z","repository":{"id":44752571,"uuid":"342366840","full_name":"nrc-cnrc/PortageTextProcessing","owner":"nrc-cnrc","description":"Text processing tools that came out of the Portage SMT project — Outils de traitement de texte issus du projet Portage de TAS","archived":false,"fork":false,"pushed_at":"2024-07-09T20:39:09.000Z","size":424,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-01-14T04:12:25.422Z","etag":null,"topics":["machine-translation","mt","natural-language-processing","neural-machine-translation","nlp","nmt","preprocessing","smt","statistical-machine-translation","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Perl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nrc-cnrc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-25T20:12:04.000Z","updated_at":"2024-07-09T20:39:13.000Z","dependencies_parsed_at":"2024-04-03T20:00:58.155Z","dependency_job_id":"1c27f42b-5495-47a5-99a8-98344eb417e3","html_url":"https://github.com/nrc-cnrc/PortageTextProcessing","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nrc-cnrc%2FPortageTextProcessing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nrc-cnrc%2FPortageTextProcessing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nrc-cnrc%2FPortageTextProcessing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nrc-cnrc%2FPortageTextProcessing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nrc-cnrc","download_url":"https://codeload.github.com/nrc-cnrc/PortageTextProcessing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241720002,"owners_count":20008898,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-translation","mt","natural-language-processing","neural-machine-translation","nlp","nmt","preprocessing","smt","statistical-machine-translation","text-processing"],"created_at":"2024-11-14T11:15:31.839Z","updated_at":"2025-10-26T15:47:44.754Z","avatar_url":"https://github.com/nrc-cnrc.png","language":"Perl","funding_links":[],"categories":[],"sub_categories":[],"readme":"[Français](LISEZMOI.md)\n\n# Portage Text Processing\n\nThis repository contains a number of text pre- and post-processing utilities written in\nthe context of the Portage Statistical Machine Translation project.  Since they are\nfrequently useful outside that context, we have separated them into this repository that\nis designed to be trivial to install.\n\n## Installation\n\nClone this repo to the location of your choice and add this line to your .profile or .bashrc:\n\n`source /path/to/PortageTextProcessing/SETUP.bash`\n\n## Dependencies\n\nPortageTextProcessing requires:\n - Perl \u003e= 5.14, as `perl` on your PATH, with the packages listed in cpanfile;\n - any version of Python 3, as `python3` on your PATH, with the packages listed\n   in requirements.txt;\n - `/bin/bash`, `/bin/sh`, `/usr/bin/env`;\n - `xmllint` (comes with libxml2) and `xml_grep` (comes with Perl's XML::Twig).\n\nFirst, check if you already have these dependencies, since they are very common:\ngo to `tests/check-installation/` and run `./run-test.sh`. This test suite will\nflag any missing dependencies.\n\nInstall missing dependencies with the package manager of your choice, ideally\nthe OS's own distro manager, like apt, yum, or brew.\n\nCentOS 7 packages:\n\n    yum install perl-XML-Twig perl-XML-XPath perl-XML-Writer libxml2 python3\n\nUbuntu 20.04 packages:\n\n    apt-get install libxml-twig-perl libxml-xpath-perl libxml-writer-perl\n    apt-get install libxml2-utils xml-twig-tools python3\n\nFor the Python 3 dependencies, with any OS:\n\n    pip3 install -r requirements.txt\n\n## Testing\n\nFor more extensive testing, go to `tests/` and run `./run-all-tests.sh`.  Go into any\ndirectory showing errors and examine `_log.run-test` to see what went wrong, or run\n`./run-test.sh` interactively.\n\nSome test suites are parallelized to run faster. If you have difficulty figuring out\nwhich command caused the error, you can also run `make -B` interactively in any test\nsuite instead of `./run-test.sh`, to run all its test cases sequentially and stop at the\nfirst error.\n\nIf you have installed [PortageClusterUtils](https://github.com/nrc-cnrc/PortageClusterUtils),\nyou can also run all the test suites in parallel with `./run-all-tests.sh -j 12`.\n\n## Documentation\n\nEach script accepts the `-h` option to output its documentation to your terminal.\n\n## List of scripts\n\n| Script                          | Brief Description                                          |\n| ------------------------------- | ---------------------------------------------------------- |\n| `clean-utf8-text.pl`            | Clean up spaces, control chars, hyphen, etc. in utf8 text. |\n| `clean_utf8.py`                 | Yet another utf8 clean up script, now in Python 3.         |\n| `crlf2lf.sh`                    | Convert CRLF (DOS-style) line endings to LF (UNIX-style).  |\n| `diff-round.pl`                 | Like diff, but ignore rounding errors.                     |\n| `expand-auto.pl`                | Like expand, with automatically calculated tab stops.      |\n| `filter-long-lines.pl`          | Filter out long lines.                                     |\n| `filter-parallel.py`            | Filter parallel files by scores.                           |\n| `fix-slashes.pl`                | Separate slash-joined words.                               |\n| `lc-utf8.pl`                    | Map utf8 text to lowercase, regardless of your locale.     |\n| `lfl2tmx.pl`                    | Create a TMX file from plain text aligned files.           |\n| `li-sort.sh`                    | Locale-independent sort.                                   |\n| `lines.py`                      | Extract the given lines from a file.                       |\n| `map-chinese-punct.pl`          | Map Chinese wide punctuation marks to similar narrow ones. |\n| `normalize-iu-spelling.pl`      | Apply Inuktut syllabic character normalization rules.      |\n| `normalize-unicode.pl`          | Normalize unicode input into canonical representations.    |\n| `parallel-uniq.pl`              | Like uniq, but take into consideration parallel files.     |\n| `ridbom.sh`                     | Remove the byte-order marker (BOM) from UTF8 input.        |\n| `second-to-hms.pl`              | Convert from seconds to HH:MM:SS or vice-versa.            |\n| `select-line`                   | Get a given line from a text file.                         |\n| `select-lines.py`               | Extract the given lines from a file.                       |\n| `select-random-chunks.py`       | Sample random chunks from a file or by indices.            |\n| `sort-by-length.pl`             | Sort a text file by line length.                           |\n| `stableuniq.pl`                 | Remove duplicates without sorting.                         |\n| `strip-parallel-blank-lines.py` | Strip parallel blank lines from two line-aligned files.    |\n| `strip-parallel-duplicates.py`  | Strip aligned lines that are the same in both files.       |\n| `tmx2lfl.pl`                    | Convert a TMX file to plain text aligned files.            |\n| `udetokenize.pl`                | Detokenize utf8 text, reversing utokenize.pl.              |\n| `utokenize.pl`                  | Tokenize utf8 text, e.g., for machine translation.         |\n| `which-test.sh`                 | Which-like program with reliable exit status.              |\n\n## Contributing\n\nIf you want to contribute scripts to this repo, please:\n - Make sure they require no compilation or installation (beyond sourcing `SETUP.bash`).\n - Add unit tests for your scripts under `tests/`.\n - Keep them relevant, which means pretty much anything related to text processing goes.\n\n## Citation\n\n```bib\n@misc{Portage_Text_Processing,\nauthor = {Larkin, Samuel and Joanis, Eric and Stewart, Darlene and Simard, Michel and Foster, George and Ueffing, Nicola and Tikuisis, Aaron},\nlicense = {MIT},\ntitle = {{Portage Text Processing}},\nurl = {https://github.com/nrc-cnrc/PortageTextProcessing},\nyear = {2022},\n}\n```\n\n## Copyright\n\nTraitement multilingue de textes / Multilingual Text Processing \\\nCentre de recherche en technologies numériques / Digital Technologies Research Centre \\\nConseil national de recherches Canada / National Research Council Canada \\\nCopyright 2022, Sa Majesté le Roi du Chef du Canada / His Majesty the King in Right of Canada \\\nPublished under the MIT License (see [LICENSE](LICENSE))\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnrc-cnrc%2Fportagetextprocessing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnrc-cnrc%2Fportagetextprocessing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnrc-cnrc%2Fportagetextprocessing/lists"}