{"id":18821393,"url":"https://github.com/marian-nmt/sotastream","last_synced_at":"2025-07-29T10:09:56.719Z","repository":{"id":184305124,"uuid":"655923167","full_name":"marian-nmt/sotastream","owner":"marian-nmt","description":"A library for data streaming and augmentation","archived":false,"fork":false,"pushed_at":"2025-05-05T19:37:23.000Z","size":553,"stargazers_count":20,"open_issues_count":1,"forks_count":3,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-07-23T00:51:08.290Z","etag":null,"topics":["data-augmentation","data-streaming","machine-learning","pretraining"],"latest_commit_sha":null,"homepage":"https://sotastream.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/marian-nmt.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-06-19T22:53:14.000Z","updated_at":"2025-05-05T19:37:27.000Z","dependencies_parsed_at":"2025-06-19T04:41:12.874Z","dependency_job_id":"e0a966f4-a6b2-4c4a-bcb3-25074cc745e0","html_url":"https://github.com/marian-nmt/sotastream","commit_stats":null,"previous_names":["marian-nmt/sotastream"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/marian-nmt/sotastream","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marian-nmt%2Fsotastream","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marian-nmt%2Fsotastream/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marian-nmt%2Fsotastream/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marian-nmt%2Fsotastream/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/marian-nmt","download_url":"https://codeload.github.com/marian-nmt/sotastream/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marian-nmt%2Fsotastream/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267668757,"owners_count":24124967,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-augmentation","data-streaming","machine-learning","pretraining"],"created_at":"2024-11-08T00:40:25.333Z","updated_at":"2025-07-29T10:09:56.684Z","avatar_url":"https://github.com/marian-nmt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sotastream\n[![image](http://img.shields.io/pypi/v/sotastream.svg)](https://pypi.python.org/pypi/sotastream/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)\n[![Read the Docs](https://img.shields.io/readthedocs/sotastream.svg)](https://sotastream.readthedocs.io/)\n\n\nSotastream is a tool for data augmentation for training\npipeline. It uses `infinibatch` internally to generate an infinite\nstream of shuffled training data and provides a means for on-the-fly\ndata manipulation, augmentation, mixing, and sampling.\n\n\n## Setup\n\nTo install from PyPI (https://pypi.org/project/sotastream/)\n```bash\npip install sotastream\n```\n\n*Developer Setup:*\n\n```bash\n# To begin, clone the repository:\ngit clone https://github.com/marian-nmt/sotastream\ncd sotastream\n# option 1:\npython -m pip install .\n# option 2: install in --editable mode\npython -m pip install -e .\n```\n\n*Entry points*\n* As a module:  `python -m sotastream`\n* As a bin in your $PATH: `sotastream`\n\n## Development\n\nInstall development tools\n```bash\npython -m pip install -e .[dev,test]   # editable mode\n```\nEditable mode (`-e / --editable`) is recommended for development purposes, `pip` creates symbolic link to your source code in a way that any edits made are reflected directly to the installed package. `[dev,test]` installs depencies for development and tests which includes `black`, `pytest` etc.\n\nWe use `black` to reformat code to a common code style.\n```bash\nmake reformat\n```\n\nBefore creating any pull requests, run\n```bash\nmake check          # runs reformatter and tests\n```\n\n## Running tests\n\n```bash\nmake test           # run unit tests\nmake regression     # run regression tests\n```\n\n See `Makefile` for more details.\n\n\n## Usage examples\n\nA folder like `split/parallel` contains training data in tsv format (`src\u003ctab\u003etgt`) split into \n`*.gz` files of around 100,000 lines for better shuffling. The below will output an infinite\nstream of data generated from the gzipped files in these folders, according to the \"wmt\" recipe \nfound in `sotastream/pipelines/example_pipeline.py`.\n\n```\npython -m sotastream example split/parallel split/backtrans\n```\nYou can also provide compressed TSV files directly, in which case sotastream will split them\nto checksummed folders under `/tmp/sotastream/{checksum}`:\n\n```\npython -m sotastream example parallel.tsv.gz backtrans.tsv.gz\n```\n\nThere are currently two main pipelines: \"default\", and \"wmt\". These vary according to\nthe data sources they take as well as the other options available to them.\n\nThere are global options that control behavioral aspects such as splitting and parallelization,\nand also pipeline-specific arguments. You can see these by running\n\n```\n# see global options\npython -m sotastream -h\n\n# see default pipeline options\npython -m sotastream default -h\n\n# see wmt pipeline options\npython -m sotastream wmt -h\n```\n\n## Don't cross the streams!\n\nSotastream workflows build a directed acyclic graph (DAG)\nconsisting of cascades of generators that pass through mutable lines\nfrom the graph inputs to the pipeline output. Since each step provides\ntransformations and manipulations of each input line, the only\nrequirement is that modifications along separate branches must not be\nmerged into a single node in the graph, or at least, that great care \nshould be taken when doing so. An example is the Mixer, which \ndoes not actually merge modifications from alternate branches, but instead\nselects across multiple incoming branches using a provided probability\ndistribution.\n\n# Custom/private pipelines from own (private) directory\n\nYou can create a custom pipeline by adding a file in the current (invocation)\ndirectory with a file name matching the pattern \"*_pipeline.py\". This should\nfollow the interface defined in `sotastream/pipelines`, namely:\n\n* Call `@pipeline(\"name\")` to give your pipeline a name. This name must not conflict with existing names.\n* Inherit from `Pipeline` base class from `sotastream.pipeline`. For document pipelines, use `DocumentPipeline` as base class.\n\nYou can find some examples in `test/dummy_pipeline.py`, as well as the real examples in `sotastream/pipelines`.\n\n# Authors\n\nSotastream is developed by _TextMT Team_ @ Microsoft Translator.\n\nIf you use this tool, please cite: \nPaper link: https://arxiv.org/abs/2308.07489  | https://aclanthology.org/2023.nlposs-1.13/\n\n\n```bibtex\n@inproceedings{post-etal-2023-sotastream,\n    title = \"{SOTASTREAM}: A Streaming Approach to Machine Translation Training\",\n    author = \"Post, Matt  and\n      Gowda, Thamme  and\n      Grundkiewicz, Roman  and\n      Khayrallah, Huda  and\n      Jain, Rohit  and\n      Junczys-Dowmunt, Marcin\",\n    editor = \"Tan, Liling  and\n      Milajevs, Dmitrijs  and\n      Chauhan, Geeticka  and\n      Gwinnup, Jeremy  and\n      Rippeth, Elijah\",\n    booktitle = \"Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)\",\n    month = dec,\n    year = \"2023\",\n    address = \"Singapore, Singapore\",\n    publisher = \"Empirical Methods in Natural Language Processing\",\n    url = \"https://aclanthology.org/2023.nlposs-1.13\",\n    pages = \"110--119\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarian-nmt%2Fsotastream","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarian-nmt%2Fsotastream","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarian-nmt%2Fsotastream/lists"}