{"id":18119084,"url":"https://github.com/zjaume/clean","last_synced_at":"2025-04-14T17:12:01.466Z","repository":{"id":52187517,"uuid":"370719471","full_name":"ZJaume/clean","owner":"ZJaume","description":"A tool for downloading and cleaning parallel corpora","archived":false,"fork":false,"pushed_at":"2024-02-22T11:21:15.000Z","size":48,"stargazers_count":3,"open_issues_count":1,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-14T17:11:53.096Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZJaume.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-05-25T14:21:05.000Z","updated_at":"2024-02-22T11:21:19.000Z","dependencies_parsed_at":"2022-08-24T00:50:14.041Z","dependency_job_id":null,"html_url":"https://github.com/ZJaume/clean","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fclean","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fclean/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fclean/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fclean/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZJaume","download_url":"https://codeload.github.com/ZJaume/clean/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248923765,"owners_count":21183953,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-01T05:14:43.176Z","updated_at":"2025-04-14T17:12:01.438Z","avatar_url":"https://github.com/ZJaume.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# clean\n\nThis is a set of scripts for downloading parallel corpora, fixing and cleaning them.\nIt uses [MTData](https://github.com/thammegowda/mtdata) to download and then some rule-based filtering.\nIt started using the cleaning scripts from [Bergamot students](https://github.com/browsermt/students) repository but I added a few more things.\n\n## Installation\nClone the repository and install python modules:\n```\ngit clone --recursive https://github.com/ZJaume/clean\ncd clean\npip install -r requirements.txt\npip install -r tools/bifixer/requirements.txt\n```\n\nInstall zstd for compression of intermediate files:\n```\nsudo apt install zstd\n```\n\n## Usage\n```\nUsage: create-corpus.sh [options]\nOptions:\n      -B BLOCK        Block size of parallel\n      -c CORPORA      Comma-separated list of corpora\n                      from 'mtdata list -l SRC-TRG'\n      -C CACHE        mtdata cache directory\n                      Default: ./cache\n      -j JOBS         Number of jobs of parallel\n      -l SRC-TRG      Language pair\n      -s SIZE         Number of sentences\n```\n\nFirst of all look at the available corpora with `mtdata`:\n```\n$ mtdata list -l en-mt\n```\n```\nINFO:root:Loaded entries: Statmt.org:355  Paracrawl:59  Tilde:519  JoshuaIndianCoprus:29  GlobalVoices:812  UnitedNations:30  OPUS:53,321  OPUS_JW300:44,663  OPUS100:302  WikiMatrix:1,617  Other:7  Neulab_TEDTalksv1:4,455  Total:106,169\nWARNING:root:Suggestion: Use ISO 639_3 codes eng-mlt instead of en-mt. Let's make a little space for all 7000+ languages of our planet 😢.\nINFO:root:Found 24\nparacrawl_v6    eng-mlt https://s3.amazonaws.com/web-language-models/paracrawl/release6/en-mt.txt.gz\nparacrawl_v7_1  eng-mlt https://s3.amazonaws.com/web-language-models/paracrawl/release7.1/en-mt.txt.gz\nEESC2017        eng-mlt https://tilde-model.s3-eu-west-1.amazonaws.com/EESC2017.en-mt.tmx.zip  *.tmx\nEMA2016 eng-mlt https://tilde-model.s3-eu-west-1.amazonaws.com/EMA2016.en-mt.tmx.zip    *.tmx\necb2017 eng-mlt https://tilde-model.s3-eu-west-1.amazonaws.com/ecb2017.en-mt.tmx.zip    *.tmx\nrapid2016       eng-mlt https://tilde-model.s3-eu-west-1.amazonaws.com/rapid2016.en-mt.tmx.zip  *.tmx\nOPUS_EUconst_v1 eng-mlt http://opus.nlpl.eu/download.php?f=EUconst/v1/moses/en-mt.txt.zip       *.en,*.mt\nOPUS_EUbookshop_v2      eng-mlt http://opus.nlpl.eu/download.php?f=EUbookshop/v2/moses/en-mt.txt.zip    *.en,*.mt\nOPUS_EMEA_v3    eng-mlt http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-mt.txt.zip  *.en,*.mt\nOPUS_ECB_v1     eng-mlt http://opus.nlpl.eu/download.php?f=ECB/v1/moses/en-mt.txt.zip   *.en,*.mt\nOPUS_DGT_v2019  eng-mlt http://opus.nlpl.eu/download.php?f=DGT/v2019/moses/en-mt.txt.zip        *.en,*.mt\nOPUS_wikimedia_v20190628        eng-mlt http://opus.nlpl.eu/download.php?f=wikimedia/v20190628/moses/en-mt.txt.zip      *.en,*.mt\nOPUS_Ubuntu_v14_10      eng-mlt http://opus.nlpl.eu/download.php?f=Ubuntu/v14.10/moses/en-mt.txt.zip    *.en,*.mt\nOPUS_TildeMODEL_v2018   eng-mlt http://opus.nlpl.eu/download.php?f=TildeMODEL/v2018/moses/en-mt.txt.zip *.en,*.mt\nOPUS_Tatoeba_v20190709  eng-mlt http://opus.nlpl.eu/download.php?f=Tatoeba/v20190709/moses/en-mt.txt.zip        *.en,*.mt\nOPUS_QED_v2_0a  eng-mlt http://opus.nlpl.eu/download.php?f=QED/v2.0a/moses/en-mt.txt.zip        *.en,*.mt\nOPUS_ParaCrawl_v5       eng-mlt http://opus.nlpl.eu/download.php?f=ParaCrawl/v5/moses/en-mt.txt.zip     *.en,*.mt\nOPUS_KDE4_v2    eng-mlt http://opus.nlpl.eu/download.php?f=KDE4/v2/moses/en-mt.txt.zip  *.en,*.mt\nOPUS_JRC_Acquis eng-mlt http://opus.nlpl.eu/download.php?f=JRC-Acquis/en-mt.txt.zip     *.en,*.mt\nOPUS_GNOME_v1   eng-mlt http://opus.nlpl.eu/download.php?f=GNOME/v1/moses/en-mt.txt.zip *.en,*.mt\nJW300   eng-mlt http://opus.nlpl.eu/download.php?f=JW300/v1/xml/en-mt.xml.gz    http://opus.nlpl.eu/download.php?f=JW300/v1/xml/en.zip,http://opus.nlpl.eu/download.php?f=JW300/v1/xml/mt.zip\n```\n\nThen, run the main script specifying the desired corpora passing their `mtdata` id's separated by commas:\n```bash\n./create-corpus.sh -l en-mt -c OPUS_TildeMODEL_v2018,JW300,OPUS_Tatoeba_v20190709,OPUS_ECB_v1\n```\n\nThe script will download the corpora with `mtdata`, apply some fixes, clean, concatenation and near-deduplication.\n\n## Customization\nSpecific corpus fixes can be applied adding custom executable scripts at `fixes/` directoryi that will be called inside the pipeline.\nThese scripts must follow the naming `fixes/corpus_id.sh` for processing parallel tab-separated input or `fixes/corpus_id.lang.sh` for processing monolingual data.\nFor example, the `fixes/JW300.mt.sh` reads monolingual data and fixes some tokenization issues present in the JW300 corpus of Maltese:\n```bash\n#!/bin/bash\n\n# Fix Maltese tokenization in JW300 that detokenizer cannot fix\nsed \"s/ - $(echo -ne \\u200b) /-/g\" \\\n    | sed 's/ - /-/g'\n```\n\nOr the `fixes/JW300.sh` that reads the tab-separated input, detokenizes it and then it prints to stdout in the same tab-separated format and fixed:\n```bash\n#!/bin/bash\nset -e\n\n# Detokenize JW300\n\nSRC=$1\nTRG=$2\n\ntemp=$(mktemp -d)\n\ntee \u003e(cut -f1 | sacremoses -j 6 -l $SRC detokenize \u003e$temp/$SRC.detok) \\\n    \u003e(cut -f2 | sacremoses -j 6 -l $TRG detokenize \u003e$temp/$TRG.detok)\n\npaste $temp/$SRC.detok $temp/$TRG.detok\n\nrm -r $temp\n```\n\nNote that the scripts that process parallel data will be called with the language identifiers as arguments.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjaume%2Fclean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjaume%2Fclean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjaume%2Fclean/lists"}