{"id":13717368,"url":"https://github.com/msamogh/nonechucks","last_synced_at":"2026-01-14T09:12:12.222Z","repository":{"id":38418148,"uuid":"151694183","full_name":"msamogh/nonechucks","owner":"msamogh","description":"Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!","archived":false,"fork":false,"pushed_at":"2022-09-22T23:03:31.000Z","size":26,"stargazers_count":378,"open_issues_count":20,"forks_count":27,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-12-01T14:56:25.715Z","etag":null,"topics":["data-cleaning","data-pipeline","data-preprocessing","data-processing","machine-learning","preprocessing","pytorch","torch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msamogh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-05T08:48:46.000Z","updated_at":"2025-10-28T17:21:18.000Z","dependencies_parsed_at":"2022-09-02T22:20:16.233Z","dependency_job_id":null,"html_url":"https://github.com/msamogh/nonechucks","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/msamogh/nonechucks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamogh%2Fnonechucks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamogh%2Fnonechucks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamogh%2Fnonechucks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamogh%2Fnonechucks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msamogh","download_url":"https://codeload.github.com/msamogh/nonechucks/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamogh%2Fnonechucks/sbom","scorecard":{"id":665237,"data":{"date":"2025-08-11","repo":{"name":"github.com/msamogh/nonechucks","commit":"6e692d07c9e0c957c65726870b4e6fd00222a390"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Code-Review","score":0,"reason":"Found 0/23 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 9 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-21T17:47:20.087Z","repository_id":38418148,"created_at":"2025-08-21T17:47:20.087Z","updated_at":"2025-08-21T17:47:20.087Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414905,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","data-pipeline","data-preprocessing","data-processing","machine-learning","preprocessing","pytorch","torch"],"created_at":"2024-08-03T00:01:21.303Z","updated_at":"2026-01-14T09:12:12.206Z","avatar_url":"https://github.com/msamogh.png","language":"Python","funding_links":[],"categories":["Pytorch \u0026 related libraries｜Pytorch \u0026 相关库","Pytorch \u0026 related libraries","Python"],"sub_categories":["Other libraries｜其他库:","Other libraries:"],"readme":"# nonechucks\n\n**nonechucks** is a library that provides wrappers for PyTorch's datasets, samplers, and transforms to allow for dropping unwanted or invalid samples dynamically.\n\n- [Introduction](#Introduction)\n- [Examples](#Examples)\n- [Installation](#Installation)\n- [Contributing](#Contributing)\n- [Licensing](#Licensing)\n\n---\n\n\n\u003ca name=\"Introduction\"/\u003e\n\n## Introduction\nWhat if you have a dataset of 1000s of images, out of which a few dozen images are unreadable because the image files are corrupted? Or what if your dataset is a folder full of scanned PDFs that you have to OCRize, and then run a language detector on the resulting text, because you want only the ones that are in English? Or maybe you have an `AlternateIndexSampler`, and you want to be able to move to `dataset[6]` after `dataset[4]` fails while attempting to load!\n\nPyTorch's data processing module expects you to rid your dataset of any unwanted or invalid samples before you feed them into its pipeline, and provides no easy way to define a \"fallback policy\" in case such samples are encountered during dataset iteration.    \n\n#### Why do I need it?\nYou might be wondering why this is such a big deal when you could simply `filter` out samples before sending it to your PyTorch dataset or sampler! Well, it turns out that it can be a huge deal in many cases:\n1. When you have a small fraction of undesirable samples in a large dataset, or\n2. When your sample-loading operation is expensive, or\n3. When you want to let downstream consumers know that a sample is undesirable (with nonechucks, transforms are not restricted to modifying samples; they can drop them as well),\n4. When you want your dataset and sampler to be decoupled.\n\nIn such cases, it's either simply too expensive to have a separate step to weed out bad samples, or it's just plain impossible because you don't even know what constitutes as \"bad\", or worse - both!\n\n**nonechucks** allows you to wrap your existing datasets and samplers with \"safe\" versions of them, which can fix all these problems for you.\n\n\n\n\u003ca name=\"Examples\"/\u003e\n\n## Examples\n\n### 1. Dealing with bad samples\nLet's start with the simplest use case, which involves wrapping an existing `Dataset` instance with `SafeDataset`.\n\n#### Create a dataset (the usual way)\nUsing something like torchvision's \u003ca href='https://pytorch.org/docs/stable/torchvision/datasets.html?highlight=folder#torchvision.datasets.ImageFolder'\u003eImageFolder\u003c/a\u003e dataset class, we can load an entire folder of labelled images for a typical supervised classification task.\n\n```python\nimport torchvision.datasets as datasets\nfruits_dataset = datasets.ImageFolder('fruits/')\n```\n#### Without nonechucks\nNow, if you have a sneaky `fruits/apple/143.jpg` (that is corrupted) sitting in your `fruits/` folder, to avoid the entire pipeline from surprise-failing, you would have to resort to something like this:\n```python\nimport random\n\n# Shuffle dataset\nindices = list(range(len(fruits_dataset))\nrandom.shuffle(indices)\n\nbatch_size = 4\nfor i in range(0, len(indices), batch_size):\n    try:\n        batch = [fruits_dataset[idx] for idx in indices[i:i + batch_size]]\n        # Do something with it\n        pass\n    except IOError:\n        # Skip the entire batch\n        continue\n```\nNot only do you have to put your code inside an extra `try-except` block, but you are also forced to use a for-loop, depriving yourself of PyTorch's built-in `DataLoader`, which means you can't use features like batching, shuffling, multiprocessing, and custom samplers for your dataset.\n\nI don't know about you, but not being able to do that kind of defeats the whole point of using a data processing module for me.\n\n\n#### With nonechucks\nYou can transform your dataset into a `SafeDataset` with a single line of code.\n```python\nimport nonechucks as nc\nfruits_dataset = nc.SafeDataset(fruits_dataset)\n```\nThat's it! Seriously.\n\nAnd that's not all. You can also use a `DataLoader` on top of this.\n```python\ndataloader = nc.SafeDataLoader(fruits_dataset, batch_size=4, shuffle=True)\nfor i_batch, sample_batched in enumerate(dataloader):\n    # Do something with it\n    pass\n```\nIn this case, `SafeDataset` will skip the erroneous image, and use the next one in the place of it (as opposed to dropping the entire batch).\n\n### 2. Use Transforms as Filters!\nThe function of transorms in PyTorch is restricted to *modifying* samples. With nonechucks, you can simply return `None` (or raise an exception) from the transform's `__call__` method, and nonechucks will drop the sample from the dataset for you, allowing you to use transforms as filters!\n\nFor the example, we'll assume a `PDFDocumentsDataset`, which reads PDF files from a folder, a `PlainTextTransform`, which transforms the files into raw text, and a `LanguageFilter`, which retains only documents of a particular language.\n```python\nclass LanguageFilter:\n    def __init__(self, language):\n        self.language = language\n        \n    def __call__(self, sample):\n        # Do machine learning magic\n        document_language = detect_language(sample)\n        if document_language != self.language:\n            return None\n        return sample\n\ntransforms = transforms.Compose([\n                PlainTextTransform(),\n                LanguageFilter('en')\n            ])\nen_documents = PDFDocumentsDataset(data_dir='pdf_files/', transform=transforms)\nen_documents = nc.SafeDataset(en_documents)\n```\n\n\n\n\n\u003ca name=\"Installation\" /\u003e\n\n## Installation\nTo install nonechucks, simply use pip:\n\n`$ pip install nonechucks`\n\nor clone this repo, and build from source with:\n\n`$ python setup.py install`.\n\n\u003ca name=\"Contributing\"/\u003e\n\n## Contributing\nAll PRs are welcome.\n\n\u003ca name=\"Licensing\"/\u003e\n\n## Licensing\n\n**nonechucks** is [MIT licensed](https://github.com/msamogh/nonechucks/blob/master/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsamogh%2Fnonechucks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsamogh%2Fnonechucks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsamogh%2Fnonechucks/lists"}