{"id":37062978,"url":"https://github.com/bodywork-ml/bodywork-pipeline-utils","last_synced_at":"2026-01-14T07:01:06.709Z","repository":{"id":41900174,"uuid":"369642150","full_name":"bodywork-ml/bodywork-pipeline-utils","owner":"bodywork-ml","description":"A package of utilities for engineering ML pipelines.","archived":false,"fork":false,"pushed_at":"2022-04-23T19:41:36.000Z","size":56,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-01-03T23:25:26.715Z","etag":null,"topics":["aws","machine-learning","ml-pipeline","mlops","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bodywork-ml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-05-21T20:02:50.000Z","updated_at":"2022-08-15T16:44:09.000Z","dependencies_parsed_at":"2022-08-11T20:40:13.951Z","dependency_job_id":null,"html_url":"https://github.com/bodywork-ml/bodywork-pipeline-utils","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/bodywork-ml/bodywork-pipeline-utils","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bodywork-ml%2Fbodywork-pipeline-utils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bodywork-ml%2Fbodywork-pipeline-utils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bodywork-ml%2Fbodywork-pipeline-utils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bodywork-ml%2Fbodywork-pipeline-utils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bodywork-ml","download_url":"https://codeload.github.com/bodywork-ml/bodywork-pipeline-utils/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bodywork-ml%2Fbodywork-pipeline-utils/sbom","scorecard":{"id":246790,"data":{"date":"2025-08-11","repo":{"name":"github.com/bodywork-ml/bodywork-pipeline-utils","commit":"6e60567a64a3e59d58a161236e7c403c3534988c"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.6,"checks":[{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/7 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":4,"reason":"6 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2024-48 / GHSA-fj7x-q9j7-g6q6","Warn: Project is vulnerable to: PYSEC-2024-110 / GHSA-jw8x-6495-233v","Warn: Project is vulnerable to: GHSA-jxfp-4rvq-9h9m","Warn: Project is vulnerable to: PYSEC-2022-43017 / GHSA-qwmp-2cf2-g9g6","Warn: Project is vulnerable to: PYSEC-2023-238 / GHSA-5wvp-7f3h-6wmm","Warn: Project is vulnerable to: PYSEC-2024-161"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 30 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-17T07:44:53.362Z","repository_id":41900174,"created_at":"2025-08-17T07:44:53.362Z","updated_at":"2025-08-17T07:44:53.362Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28412480,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T05:26:33.345Z","status":"ssl_error","status_checked_at":"2026-01-14T05:21:57.251Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","machine-learning","ml-pipeline","mlops","python"],"created_at":"2026-01-14T07:01:05.686Z","updated_at":"2026-01-14T07:01:06.698Z","avatar_url":"https://github.com/bodywork-ml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bodywork Pipeline Utilities\n\nUtilities for helping with pipeline development and integration with 3rd party MLOps services.\n\n```text\n|-- aws\n    |-- Dataset\n    |-- get_latest_csv_dataset_from_s3\n    |-- get_latest_parquet_dataset_from_s3\n    |-- put_csv_dataset_to_s3\n    |-- put_parquet_dataset_to_s3\n    |-- Model\n    |-- get_latest_pkl_model_from_s3\n|-- logging\n    |-- configure_logger\n```\n\n## AWS\n\nA simple dataset and model management framework built on S3 object storage.\n\n### Datsets\n\nTraining data files in CSV or Parquet format are saved to a S3 bucket using filenames with an ISO timestamp component:\n\n```text\nmy-s3-project-bucket/\n|\n|-- datasets/\n|    |-- ... \n|    |-- dataset_file_2021-07-10T07:42:23.csv\n|    |-- dataset_file_2021-07-11T07:45:12.csv\n|    |-- dataset_file_2021-07-12T07:41:02.csv\n```\n\nYou can use `put_csv_dataset_to_s3` to persist a Pandas DataFrame directly to S3 with a compatible filename, or handle this yourself independently. The latest training data file can be retrieved using `get_latest_csv_dataset_from_s3`, which will return a `Dataset` object, which is an object with the following fields:\n\n```python\nclass Dataset(NamedTuple):\n    \"\"\"Container for downloaded datasets and associated metadata.\"\"\"\n\n    data: DataFrame\n    datetime: datetime\n    bucket: str\n    key: str\n    hash: str\n```\n\nAWS S3 will compute the MD5 hash of every object uploaded to it (referred to as its Entity Tag). This is retrieved from S3 together with other basic metadata about the object. For example,\n\n```python\nget_latest_csv_dataset_from_s3(\"my-s3-project-bucket\", \"datasets\")\n# Dataset(\n#     data=...,\n#     datetime(2021, 7, 12, 7, 41, 02),\n#     bucket=\"my-s3-project-bucket\"),\n#     key=\"datasets/dataset_file_2021-07-12T07:41:02.csv\",\n#     hash=\"759eccda4ceb7a07cda66ad4ef7cdfbc\"\n# )\n```\n\nThis, together with S3 object versioning (if enabled), can be used to track the precise dataset used to train a model.\n\n## Models\n\nThe `Model` class is a simple wrapper for a ML model that adds basic model metadata and the ability to serialise the model directly to S3. It requires a `Dataset` object containing the data used train the model, so that the model artefact can be explicitly linked to the precise version of the data used to train it. For example,\n\n```python\nfrom sklearn.tree import DecisionTreeRegressor\n\n\ndataset = get_latest_csv_dataset_from_s3(\"my-s3-project-bucket\", \"datasets\")\nmodel = Model(\"my-model\", DecisionTreeRegressor(), dataset, {\"features\": [\"x1\", \"x2\"], \"foo\": \"bar\"})\n\nmodel\n# name: my-model\n# model_type: \u003cclass 'sklearn.tree._classes.DecisionTreeRegressor'\u003e\n# model_timestamp: 2021-07-12 07:46:08\n# model_hash: ab6f998e0f5d8829fcb0017819c45020\n# train_dataset_key: datasets/dataset_file_2021-07-12T07:41:02.csv\n# train_dataset_hash: 759eccda4ceb7a07cda66ad4ef7cdfbc\n# pipeline_git_commit_hash: e585fd3\n```\n\nModel objects can be directly serialised to S3,\n\n```python\nmodel.put_model_to_s3(\"my-s3-project-bucket\", \"models\")\n```\n\nWhich will create objects in a S3 bucket as follows,\n\n```text\nmy-s3-project-bucket/\n|\n|-- models/\n|    |-- ... \n|    |-- serialised_model_2021-07-10T07:47:33.pkl\n|    |-- serialised_model_2021-07-11T07:49:14.pkl\n|    |-- serialised_model_2021-07-12T07:46:08.pkl\n```\n\nThe `Model` class is intended as a base class, suitable for pickle-able models (e.g. from Scikit-Learn). More complex model types (e.g. PyTorch or PyMC3 models) should inherit from `Model` and override the appropriate methods.\n\n## Logging\n\nThe `configure_logger` function returns a Python logger configures to print logs using the Bodywork log format. For example,\n\n```python\nlog = configure_logger()\nlog.into(\"foo\")\n# 2021-07-14 07:57:10,854 - INFO - pipeline.train - foo\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbodywork-ml%2Fbodywork-pipeline-utils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbodywork-ml%2Fbodywork-pipeline-utils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbodywork-ml%2Fbodywork-pipeline-utils/lists"}