{"id":31959499,"url":"https://github.com/huggingface/chug","last_synced_at":"2025-10-14T15:32:53.520Z","repository":{"id":230291218,"uuid":"650363457","full_name":"huggingface/chug","owner":"huggingface","description":"Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.","archived":false,"fork":false,"pushed_at":"2024-04-03T19:54:09.000Z","size":149,"stargazers_count":159,"open_issues_count":1,"forks_count":11,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-09-30T18:02:26.749Z","etag":null,"topics":["computer-vision","dataloading","datasets","distributed-training","document-understanding","multi-modal-learning","pdf-document","webdataset"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-06-06T23:11:49.000Z","updated_at":"2025-09-20T21:11:40.000Z","dependencies_parsed_at":"2024-04-02T22:30:47.699Z","dependency_job_id":"6b4a3ac3-c839-44a7-9c73-080ee40bc75a","html_url":"https://github.com/huggingface/chug","commit_stats":null,"previous_names":["huggingface/chug"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/chug","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fchug","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fchug/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fchug/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fchug/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/chug/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fchug/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279019322,"owners_count":26086711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","dataloading","datasets","distributed-training","document-understanding","multi-modal-learning","pdf-document","webdataset"],"created_at":"2025-10-14T15:32:00.128Z","updated_at":"2025-10-14T15:32:53.515Z","avatar_url":"https://github.com/huggingface.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Chugging Data \n\nA library to help w/ efficient training for multi-modal data. Initially focused on image \u0026 document + text tasks.\n\n`chug` currently leverages `webdataset` and Hugging Face `datasets`.\n\n`webdataset` tar files and dataset pipelines are preferred for scalable pretraining. \n\nHugging Face `datasets` are supported and work great for exploration, validation, and fine-tune use cases.\n\n`chug` provides on the fly PDF decoding and rendering via either pypdfium2 (https://github.com/pypdfium2-team/pypdfium2) as a default, or fitz/pymupdf (https://github.com/pymupdf/PyMuPDF) if your use case is okay with their AGPL-3.0 license. `fitz` support must be manually enabled. The pdf handling is implemented at the webdataset level, so you can plug it in to other webdataset pipelines. This enables large scale sharded streaming of native .pdf files without needing to pre-render to .png/.tiff, etc.\n\n## Status\n\nThis library is still a WIP, consider this an alpha release (pre announcement). Major features should be working, the library has been tested with several PDF datasets we will shortly make public. However, do expect there will still be breaking changes, lots of improvements, etc.\n\n`pip install --pre chug` will install the current dev version.\n\n### TODOs\n\n### Nearish\n* Cleanup and refinement, codebase will change\n* Documentation \u0026 unit-tests\n* Support reading of info .json/.yaml files for automatic shard info resolution for webdatasets (like timm)\n\n### Mediumish\n* Option to output bbox annotations for lines (or word + word output) for tasks that leverage layout\n* Unified preprocessor functions for combined image + text tokenization (img+text token interleaving, etc.)\n* Image token (patch) packing ala NaViT. Online bin packing based algorithms integrated with image preprocessing and pipeline.\n\n### Longish \n* Increase range of task pipelines for other tasks, modelling needs\n* Support additional modalities \u0026 targets (video, audio, detection/dense pixel targets, image/video/audio targets)\n* Explore alternatives to .tar shards (array_record, arrow, etc)\n\n## Design\n\n### Submodule Hierarchy\n\nThe library has been designed so that functions, classes at different levels can be used independently.\n\nIf one wants to build a loader \u0026 pipeline with JSON/YAML serializable configs, use the top-level `chug.create_loader()` in `chug/loader.py`. Depending on dataset sources, one can easily switch this between webdataset, HF datasets (in the future, other sources).\n\nBypassing the highest level, one can also call `build_pipeline_*` methods in `task_pipeline` and then call `create_loader_wds` with a full array of args for `wds` only use cases.\n\nIf one doesn't want to use `chug` loaders and pipelines at all, `image`, `text`, and `wds` (especially decoder) functionality may be useful in other projects.\n\n#### Library modules (highest to lowest level)\n\nThe dependencies of modules within the library are intended to follow the hierarchy below. e.g. doc depends on wds, but wds should never depend on doc.\n\n```\napp\n|\nloader (chug/loader.py)\n|\ntask_pipeline\n|\ndoc\n|\nwds, hfds, image, text\n|\ncommon\n```\n\n### Submodules\n\n#### `common`\n\nConfigs, structures (dataclasses) for general use across the library\n\n#### `wds`\n\nWebdataset (`wds` for short) specific code. Extensions and alterations of webdataset functionality to fit covered use case and improve robustness.\n\nAll data pipelines in `chug` currently leverage `wds` pipelines, even when not using `wds` datasets. \n\nDocument oriented decoding (pdf decoder) is present in `chug/wds/decode.py`, it can be used with any webdataset pipeline as a decoder. e.g. `wds.decode(chug.wds.DecodeDoc('pill'), 'pill')`\n\n#### `hfds`\n\nHugging Face `datasets` support. A minimal wrapper that allows `datasets` to be used with chug processing pipelines. \n\nThe processing pipelines remain webdataset based when using `datasets`, they are invoked by a custom collate class.\n\n#### `image`\n\nImage processing, `torchvision` and `albumentations` based transform building code. A mix of generic image (imagenet, simclr) transforms and document specific transforms, including an implementation of `albumentations` based `nougat` transforms.\n\n#### `text`\n\nText processing, tokenization code.\n\n#### `doc`\n\nDocument processing code. Currently focused on processors that apply image/pdf decoders and process document OCR or VQA annotations.\n\n#### `task_pipeline`\n\nTask specific pipelines, where dataset formats meet modelling needs. \n\nInputs to task pipelines are sample dictionaries based on the dataset form, they are decoded and then processed into outputs that match model input requirements.\n\nTask specific pipelines that handle the data \u003c--\u003e model input interface are inserted into an encompassing data pipeline which handles shard lists, shuffle, wrapping, distributed worker, splitting, batching, etc.\n\n#### `chug.loader`\n\nThis lone top-level file includes the main factory methods for creating loaders w/ associated pipelines from config dataclasses.\n\n#### `app`\n\nMost applications using `chug` will exist outside of the lib in training libraries, etc. Some builtin utility / exploration apps will be included here.\n\n## Concepts\n\nWIP\n\n## Datasets\n\nDatasets that work well with this library can be found on the Hugging Face Hub under the `pixparse` organization (https://huggingface.co/pixparse).\n\nWe'll add links to other noteworthy datasets that can be used as we become aware of them.\n\n\n## Usage / Examples\n\n### Document Reading, Training w/ IDL\n```python\nimport chug\nimg_cfg = chug.ImageInputCfg(size=(1024, 768), transform_type='doc_better')\nimg_fn = chug.create_image_preprocessor(input_cfg=img_cfg, is_training=True)\ntxt_fn = chug.create_text_preprocessor(\n    'naver-clova-ix/donut-base',\n    prompt_end_token='\u003cs_idl\u003e',\n    task_start_token='\u003cs_idl\u003e',  # NOTE needs to be added to tokenizer\n)\n\ntask_cfg = chug.DataTaskDocReadCfg(\n    image_process_fn=img_fn,\n    text_process_fn=txt_fn,\n    page_sampling='random',\n    error_handler='dump_and_reraise',\n)\ndata_cfg = chug.DataCfg(\n    source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/idl-wds/resolve/main/idl-train-0{0000..2999}.tar',\n    batch_size=8,\n    num_samples=3144726,\n    format='wds',\n)\nlb = chug.create_loader(\n    data_cfg,\n    task_cfg,\n    is_training=True,\n)\nii = iter(lb)\nsample = next(ii)\n```\n\n### Document Reading, Exploring IDL\n```python\nimport chug\ntask_cfg = chug.DataTaskDocReadCfg(page_sampling='all')\ndata_cfg = chug.DataCfg(\n    source='pixparse/idl-wds',\n    split='train',\n    batch_size=None,\n    format='hfids',\n    num_workers=0,    \n)\nlb = chug.create_loader(\n    data_cfg,\n    task_cfg,\n)\nii = iter(lb)\nsample = next(ii)\n```\n\n### Document Reading, Training with PDFA\n\n```python\nimport chug\nimg_cfg = chug.ImageInputCfg(size=(1024, 768), transform_type='doc_nougat')\nimg_fn = chug.create_image_preprocessor(input_cfg=img_cfg, is_training=True)\ntxt_fn = chug.create_text_preprocessor(\n    'naver-clova-ix/donut-base',\n    prompt_end_token='\u003cs_pdfa\u003e',\n    task_start_token='\u003cs_pdfa\u003e',  # NOTE needs to be added to tokenizer\n)\n\ntask_cfg = chug.DataTaskDocReadCfg(\n    image_process_fn=img_fn,\n    text_process_fn=txt_fn,\n    page_sampling='random',\n)\ndata_cfg = chug.DataCfg(\n    source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/pdfa-english-train/resolve/main/pdfa-eng-train-{000000..005000}.tar',\n    batch_size=8,\n    num_samples=1000000,  # FIXME replace with actual\n    format='wds',   \n)\nlb = chug.create_loader(\n    data_cfg,\n    task_cfg,\n    is_training=True,\n)\nii = iter(lb)\nsample = next(ii)\n```\n\n### Document Reading, Exploring PDFA\n\n```python\nimport chug\n\ntask_cfg = chug.DataTaskDocReadCfg(\n    page_sampling='all',\n)\ndata_cfg = chug.DataCfg(\n    source='pixparse/pdfa-eng-wds',\n    split='train',\n    batch_size=None,\n    format='hfids',\n    num_workers=0,\n)\nlb = chug.create_loader(\n    data_cfg,\n    task_cfg,\n)\nii = iter(lb)\nsample = next(ii)\n```\n\n\n### Image + Text\n\n### Training\n\n```python\nimport chug\nimport transformers\nfrom functools import partial\nimg_cfg = chug.ImageInputCfg(size=(512, 512), transform_type='image_timm')\nimg_fn = chug.create_image_preprocessor(input_cfg=img_cfg, is_training=True)\ntokenizer = transformers.AutoTokenizer.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K')\ntxt_fn = partial(chug.tokenize, max_length=1000, tokenizer=tokenizer)\ntask_cfg = chug.DataTaskImageTextCfg(\n    image_process_fn=img_fn,\n    text_process_fn=txt_fn,\n)\ndata_cfg = chug.DataCfg(\n    source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/cc12m-wds/resolve/main/cc12m-train-{0000..2175}.tar',\n    batch_size=8,\n    num_samples=10968539,\n    format='wds',   \n)\nlb = chug.create_loader(\n    data_cfg,\n    task_cfg,\n    is_training=True,\n)\nii = iter(lb)\nsample = next(ii)\n```\n\n### Document VQA\n\n#### Training, Fine-tuning\n```python\nimport chug\nfrom chug.task_pipeline import create_task_pipeline\nimg_cfg = chug.ImageInputCfg(size=(1024, 768), transform_type='doc_basic')\nimg_fn = chug.create_image_preprocessor(img_cfg, is_training=True)\ntxt_fn = chug.create_text_preprocessor(\n    'naver-clova-ix/donut-base-finetuned-docvqa',\n    prompt_end_token='\u003cs_answer\u003e',\n    task_start_token='\u003cs_docvqa\u003e',\n)\n\ntask_cfg = chug.DataTaskDocVqaCfg(\n    image_process_fn=img_fn,\n    text_process_fn=txt_fn,\n)\ndata_cfg = chug.DataCfg(\n    source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/docvqa-wds/resolve/main/docvqa-train-{000..383}.tar',\n    batch_size=8,\n    format='wds',\n    num_samples=39463,\n)\nlb = chug.create_loader(\n    data_cfg,\n    task_cfg,\n    is_training=True,\n)\nii = iter(lb)\nsample = next(ii)\n```\n\n#### Exploration\n\n```python\nimport chug\nfrom chug.task_pipeline import create_task_pipeline\ntask_cfg = chug.DataTaskDocVqaCfg(\n    question_prefix='Question: ',\n    question_suffix='',\n    answer_prefix='Answer: ',\n    answer_suffix=''\n)\ndata_cfg = chug.DataCfg(\n    source='pixparse/docvqa-single-page-questions',\n    split='validation',\n    batch_size=None,\n    format='hfids',\n    num_workers=0,\n)\nlb = chug.create_loader(\n    data_cfg,\n    task_cfg\n)\nii = iter(lb)\nsample = next(ii)\n```\n\n## Acknowledgement\n\n`chug` evolve from the `webdataset` datapipeline used successfully in the [OpenCLIP](https://github.com/mlfoundations/open_clip) project. Thanks to all the contributors in that project. Future work will likely involve closing the loop and leveraging `chug` in OpenCLIP for increased capability.\n\nThe image/document augmentations in `chug` rely on a number of external influences. Our document oriented `doc_better` torchvision augmentations are influenced by `nougat`, and the `doc_nougat` is a direct adaptation of the [`albumentations`](https://albumentations.ai/) + `cv2` document pipeline in [`nougat`](https://github.com/facebookresearch/nougat). Several image augmentations leverage existing work in the `timm` library.\n\nAlso, big thanks to the maintainers of [`webdataset`](https://github.com/webdataset/webdataset) and Hugging Face [`datasets`](https://github.com/huggingface/datasets).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fchug","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Fchug","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fchug/lists"}