{"id":37077889,"url":"https://github.com/hplt-project/opuscleaner","last_synced_at":"2026-01-14T09:02:21.175Z","repository":{"id":37568239,"uuid":"505451534","full_name":"hplt-project/OpusCleaner","owner":"hplt-project","description":"OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.","archived":false,"fork":false,"pushed_at":"2025-07-09T12:52:44.000Z","size":8088,"stargazers_count":51,"open_issues_count":57,"forks_count":15,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-09-16T09:00:58.494Z","etag":null,"topics":["data-cleaning","machine-translation"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/opuscleaner/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hplt-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-06-20T13:24:45.000Z","updated_at":"2025-07-09T12:46:37.000Z","dependencies_parsed_at":"2023-09-23T10:38:56.371Z","dependency_job_id":"e07be62c-5091-4c93-8262-ec3b13ad4c25","html_url":"https://github.com/hplt-project/OpusCleaner","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/hplt-project/OpusCleaner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2FOpusCleaner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2FOpusCleaner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2FOpusCleaner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2FOpusCleaner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hplt-project","download_url":"https://codeload.github.com/hplt-project/OpusCleaner/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2FOpusCleaner/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414732,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","machine-translation"],"created_at":"2026-01-14T09:02:20.503Z","updated_at":"2026-01-14T09:02:21.167Z","avatar_url":"https://github.com/hplt-project.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OpusCleaner\nOpusCleaner is a machine translation/language model data cleaner and training scheduler. The Training scheduler has moved to [OpusTrainer](https://github.com/hplt-project/OpusTrainer).\n\n## Cleaner\nThe cleaner bit takes care of downloading and cleaning multiple different datasets and preparing them for translation.\n\n```sh\nopuscleaner-clean --parallel 4 data/train-parts/dataset.filter.json | gzip -c \u003e clean.gz\n```\n\n### Installation for cleaning\nIf you just want to use OpusCleaner for cleaning, you can install it from PyPI, and then run it\n\n```sh\npip3 install opuscleaner\nopuscleaner-server\n```\n\nThen you can go to http://127.0.0.1:8000/ to show the interface.\n\nYou can also install and run OpusCleaner on a remote machine, and use [SSH local forwarding](https://www.ssh.com/academy/ssh/tunneling-example) (e.g. `ssh -L 8000:localhost:8000 you@remote.machine`) to access the interface on your local machine.\n\n### Dependencies\n(Mainly listed as shortcuts to documentation)\n\n- [FastAPI](https://fastapi.tiangolo.com) as the base for the backend part.\n- [Pydantic](https://pydantic-docs.helpmanual.io/) for conversion of untyped JSON to typed objects. And because FastAPI automatically supports it and gives you useful error messages if you mess up things.\n- [Vue](https://vuejs.org/guide/introduction.html) for frontend\n\n### Screenshots\n\nList and categorize the datasets you are going to use for training.\n[\u003cimg src=\"https://github.com/hplt-project/OpusCleaner/raw/main/.github/screenshots/list-datasets.png\" width=\"100%\"\u003e](https://github.com/hplt-project/OpusCleaner/blob/main/.github/screenshots/list-datasets.png)\n\nDownload more datasets right from the interface.\n[\u003cimg src=\"https://github.com/hplt-project/OpusCleaner/raw/main/.github/screenshots/add-datasets.png\" width=\"100%\"\u003e](https://github.com/hplt-project/OpusCleaner/blob/main/.github/screenshots/add-datasets.png)\n\nFilter each individual dataset, showing you the results immediately.\n[\u003cimg src=\"https://github.com/hplt-project/OpusCleaner/raw/main/.github/screenshots/filter-datasets.png\" width=\"100%\"\u003e](https://github.com/hplt-project/OpusCleaner/blob/main/.github/screenshots/filter-datasets.png)\n\nCompare the dataset at different stages of filtering to see what the impact is of each filter.\n[\u003cimg src=\"https://github.com/hplt-project/OpusCleaner/raw/main/.github/screenshots/diff-filter-output.png\" width=\"100%\"\u003e](https://github.com/hplt-project/OpusCleaner/blob/main/.github/screenshots/diff-filter-output.png)\n\n### Using your own data\nOpusCleaner scans for datasets and finds them automatically if they're in the right format. When you download OPUS data, it will get converted to this format, and there's nothing stopping you from adding your own in the same format.\n\nBy default, it scans for files matching `data/train-parts/*.*.gz` and will derive which files make up a dataset from the filenames: `name.en.gz` and `name.de.gz` will be a dataset called _name_. The files are your standard moses format: a single sentence per line, and each Nth line in the first file will match with the Nth line of the second file.\n\nWhen in doubt, just download one of the OPUS datasets through OpusCleaner, and replicate the format for your own dataset.\n\nIf you want to use another path, you can use the `DATA_PATH` environment variable to change it, e.g. run `DATA_PATH=\"./my-datasets/*.*.gz\" opuscleaner-server`.\n\n### Paths\n- `data/train-parts` is scanned for datasets. You can change this by setting the `DATA_PATH` environment variable, the default is `data/train-parts/*.*.gz`.\n- `filters` should contain filter json files. You can change the `FILTER_PATH` environment variable, the default is `\u003cPYTHON_PACKAGE\u003e/filters/*.json`.\n\n\n### Installation for development\nFor building from source (i.e. git, not anything downloaded from Pypi) you'll need to have node + npm installed.\n\n```sh\npython3 -m venv .env\nbash --init-file .env/bin/activate\npip install -e .\n```\n\nFinally you can run `opuscleaner-server` as normal. The `--reload` option will cause it to restart when any of the python files change.\n\n```sh\nopuscleaner-server serve --reload\n```\n\nThen go to http://127.0.0.1:8000/ for the \"interface\" or http://127.0.0.1:8000/docs for the API.\n\n### Frontend development\n\nIf you're doing frontend development, try also running:\n\n```sh\ncd frontend\nnpm run dev\n```\n\nThen go to http://127.0.0.1:5173/ for the \"interface\".\n\nThis will put vite in hot-reloading mode for easier Javascript dev. All API requests will be proxied to the python server running in 8000, which is why you need to run both at the same time.\n\n## Filters\n\nIf you want to use LASER, you will also need to download its assets:\n\n```sh\npython -m laserembeddings download-models\n```\n\n## Packaging\n\nRun `npm build` in the `frontend/` directory first, and then run `hatch build .` in the project directory to build the wheel and source distribution.\n\nTo push a new release to Pypi from Github, tag a commit with a `vX.Y.Z` version number (including the `v` prefix). Then publish a release on Github. This should trigger a workflow that pushes a sdist + wheel to pypi.\n\n# Acknowledgements\n\nThis project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhplt-project%2Fopuscleaner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhplt-project%2Fopuscleaner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhplt-project%2Fopuscleaner/lists"}