{"id":24679003,"url":"https://github.com/evinism/tinybaker","last_synced_at":"2025-07-04T15:37:41.579Z","repository":{"id":52440051,"uuid":"315545119","full_name":"evinism/TinyBaker","owner":"evinism","description":"Composable, first-order file-to-file transformations in Python","archived":false,"fork":false,"pushed_at":"2021-04-29T05:49:48.000Z","size":608,"stargazers_count":32,"open_issues_count":4,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2023-08-26T08:36:44.359Z","etag":null,"topics":["cli","multiple-filesystems","propagation","python","transformation","transformations"],"latest_commit_sha":null,"homepage":"https://tinybaker.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/evinism.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"contributing.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-11-24T06:56:36.000Z","updated_at":"2023-08-26T08:36:44.360Z","dependencies_parsed_at":"2022-08-20T07:20:47.675Z","dependency_job_id":null,"html_url":"https://github.com/evinism/TinyBaker","commit_stats":null,"previous_names":[],"tags_count":6,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evinism%2FTinyBaker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evinism%2FTinyBaker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evinism%2FTinyBaker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evinism%2FTinyBaker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/evinism","download_url":"https://codeload.github.com/evinism/TinyBaker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235706333,"owners_count":19032613,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","multiple-filesystems","propagation","python","transformation","transformations"],"created_at":"2025-01-26T13:19:42.831Z","updated_at":"2025-01-26T13:19:43.429Z","avatar_url":"https://github.com/evinism.png","language":"Python","readme":"# TinyBaker: Composable, first-order, file-to-file transformations in Python\n![Python Package](https://github.com/evinism/tinybaker/workflows/Python%20package/badge.svg)\n\n*TinyBaker is in beta release.*\n\nTinyBaker allows programmers to define first-order file-to-file transformations in a concise format and compose them together with clarity. \n\nInstallation via `pip install tinybaker`\n\n![TinyBaker Logo](_static/logo.png)\n\n## The Problem\n\nMany programs can be considered transformations between source and destination files. Training machine learning models, running predictions on dataframes, processing logs, concatenation, compilation, and many others, are examples of tasks that fundamentally pose a transformation from one set of files to another.\n\nSince transforms aren't normally considered a first-order concept, they get wildly unwieldy to work with. Production workloads are configured separately from local transformations. Getting a local script working on production often requires lots of rework, mocking, testing, and iteration by a product team.\n\n## The Solution\n\nTinybaker turns file-to-file transforms into a first-order concept.\n\nTinyBaker transforms can be configured, run, composed, hosted, and tested, all independently from their specific implementations.\n\n## The Model\n\nThe main component of TinyBaker is the base class `Transform`: a standalone mapping from one set of files to another.\n\n```\n                 ___________\n---[ file1 ]---\u003e|           |\n                |           |-\u003e--[ file4 ]---\n---[ file2 ]---\u003e| Transform |\n                |           |-\u003e--[ file5 ]---\n---[ file3 ]---\u003e|___________|\n```\n\nFor example, let's say we were running predictions over a certain ML model. Such a transform might conceptually look like this:\n```\n                  ___________\n---[ config ]---\u003e|           |\n                 |           |-\u003e--[ predictions ]---\n---[ model ]----\u003e|  Predict  |\n                 |           |-\u003e--[ performance ]---\n---[ data ]-----\u003e|___________|\n```\n\nTinyBaker calls the labels associated which each input / output file a `tag`.\n```\n                  ___________\n---[ config ]---\u003e|           |\n      ^ Tag      |           |-\u003e--[ predictions ]---\n---[ model ]----\u003e|  Predict  |       ^ Tag\n      ^ Tag      |           |-\u003e--[ performance ]---\n---[ data ]-----\u003e|___________|       ^ Tag\n      ^ Tag\n```\n\nWe might want to configure where we store input/output files, or configure files to come from separate filesystems entirely. TinyBaker allows you to define the transform while paying attention to only the tags, even when accessing files across multiple filesystems.\n\n```\n                                        ___________\n/path/to/config.json \u003e----[ config ]--\u003e|           |\n                                       |           |-\u003e--[ predictions ]---\u003e hdfs://outputs/predictions.csv\ns3://path/to/model.pkl \u003e--[ model ]---\u003e|  Predict  |\n                                       |           |-\u003e--[ performance ]---\u003e ./performance.pkl\n/path/to/data.csv \u003e-------[ data ]----\u003e|___________|\n```\n\nWe can imagine a situation where we have file transformations that could theoretically compose:\n```\n                   ________________\n                  |                |\n---[ raw_logs ]--\u003e| BuildDataFrame |-\u003e--[ df ]---\n                  |________________|\n                  \n             ____________\n            |            |\n---[ df ]--\u003e| BuildModel |-\u003e--[ model ]---\n            |____________|\n```\n\nTinyBaker allows you to compose these two transformations together:\n\n```\n                   ___________________________\n                  |                           |\n---[ raw_logs ]--\u003e| BuildDataFrame+BuildModel |-\u003e--[ model ]---\n                  |___________________________|\n```\n\nWe now only need to specify the location of 2 files-- TinyBaker handles linking the two steps together\n\n```\n                                 ___________________________\n                                |                           |\n/raw/logs.txt ---[ raw_logs ]--\u003e| BuildDataFrame+BuildModel |-\u003e--[ model ]--- /path/to/model.pkl\n                                |___________________________|\n```\n\nExtra file dependencies are propagated to the top level of a sequence, ensuring you'll never miss a file dependency in step 5 of 17, e.g.\n\n```\n                   ________________\n                  |                |\n---[ raw_logs ]--\u003e| BuildDataFrame |-\u003e--[ df ]---\n                  |________________|\n                  \n                 ____________\n---[ df ]------\u003e|            |\n                | BuildModel |-\u003e--[ model ]---\n---[ config ]--\u003e|____________|\n            \n# Goes to...\n\n                   ___________________________\n---[ raw_logs ]--\u003e|                           |\n                  | BuildDataFrame+BuildModel |-\u003e--[ model ]---\n---[ config ]----\u003e|___________________________|\n```\n\n### In-Code Anatomy of a single transform\n\nThe following describes a minimal transform one can define in TinyBaker\n\n```py\nfrom tinybaker import Transform, InputTag, OutputTag\n\nclass SampleTransform(Transform):\n  # 1 tag per input file\n  first_input = InputTag(\"first_input\")\n  second_input = InputTag(\"second_input\")\n  some_output = OutputTag(\"some_output\")\n\n  # self.script describes what actually executes when the transform task runs\n  script(self):\n    # Within scripts, one can operate on tags as if they're FileRefs\n    with self.first_input.open() as f:\n      do_something_with(f)\n    with self.second_input.open() as f:\n      do_something_else_with(f)\n\n    # and output or something\n    with self.some_output.open() as f:\n      write_something_to(f)\n```\n\nThis would then be executed via:\n\n```py\n\nSampleTransform(\n  input_paths={\"first_input\": \"path/to/input1\", \"second_input\"= \"path/to/input2\"}\n  output_paths={\"some_output\": \"path/to/write/output\"}\n).run()\n\n```\n\n### Real-world example of a single transform\n\nFor a real-world example, consider training an ML model. This is a transformation from the two files `some/path/train.csv` and `some/path/test.csv` to a pickled ML model `another/path/some_model.pkl` and statistics. With `tinybaker`, you can specify this individual configurable step as follows:\n\n```py\n# train_step.py\nfrom tinybaker import Transform, cli, InputTag, OutputTag\nimport pandas as pd\nfrom some_cool_ml_library import train_model, test_model\n\nclass TrainModelStep(Transform):\n  train_csv = InputTag(\"train_csv\")\n  test_csv = InputTag(\"test_csv\")\n  pickled_model = OutputTag(\"pickled_model\")\n  results = OutputTag(\"results\")\n\n  def script():\n    # Read from files\n    with self.train_csv.open() as f:\n      train_data = pd.read_csv(f)\n    with self.test_csv.open() as f:\n      test_data = pd.read_csv(f)\n\n    # Run computations\n    X = train_data.drop([\"label\"])\n    Y = train_data[[\"label\"]]\n    [model, train_results] = train_model(X, Y)\n    test_results = test_model(model, test_data)\n\n    # Write to output files\n    with self.results.open() as f:\n      results = train_results.formatted_summary() + test_results.formatted_summary()\n      f.write(results)\n    with self.pickled_model.openbin() as f:\n      pickle.dump(f, model)\n\nif __name__ == \"__main__\":\n  cli(SampleTransform)\n```\n\n### Operating over multiple filesystems\nSince TinyBaker uses [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html/) as its filesystem, TinyBaker can use [any filesystem that fsspec supports](https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations). For example, you can use s3 via setting the protocol of files to `s3://`\n\nThis makes building test suites for transforms very easy: test suites can operate off of local data, but production jobs can run off of s3 data.\n\n### Validation\n\nTinyBaker performs simple validation, such as raising early if input files are missing, or erroring if fully-qualified file paths form a cycle.\n\n\n## Combining several build steps\n\nWe can compose several build steps together using the methods `merge` and `sequence`.\n\n```py\nfrom tinybaker import Transform, sequence\n\nclass CleanLogs(Transform):\n  raw_logfile = InputTag(\"raw_logfile\")\n  cleaned_logfile = OutputTag(\"cleaned_logfile\")\n  # ...\n\nclass BuildDataframe(Transform):\n  cleaned_logfile = InputTag(\"cleaned_logfile\")\n  dataframe = OutputTag(\"dataframe\")\n  # ...\n\nclass BuildLabels(Transform):\n  cleaned_logfile = InputTag(\"cleaned_logfile\")\n  labels = OutputTag(\"labels\")\n  # ...\n\nclass TrainModelFromDataframe(Transform):\n  dataframe = InputTag(\"dataframe\")\n  labels = InputTag(\"labels\")\n  trained_model = OutputTag(\"trained_model\")\n  # ...\n\n\nTrainFromRawLogs = sequence(\n  CleanLogs,\n  merge(BuildDataframe, BuildLabels),\n  TrainModelFromDataframe\n)\n\ntask = TrainFromRawLogs(\n  input_paths={\"raw_logfile\": \"/path/to/raw.log\"},\n  output_paths={\"trained_model\": \"/path/to/model.pkl\"}\n)\n\ntask.run()\n```\n\nInputs and outputs are hooked up via tag names, e.g. if step 1 outputs tag \"foo\", and step 2 takes tag \"foo\" as inputs, TinyBaker will be automatically hook them together.\n\n### Propagation of inputs and outputs\nLet's say task 3 of 4 in a sequence of tasks requires tag `foo`, but no previous step generates tag `foo`, then this dependency will be propagated to the top level; the sequence as a whole will have a dependency on tag `foo`.\n\nAdditionally, if task 3 of 4 generates a tag `bar`, but no further step requires `bar`, then the sequence exposes \"bar\" as an output.\n\n### expose_intermediates\nIf you need to expose intermediate files within a sequence, you can use the keywork arg `expose_intermediates` to additionally output the listed intermediate tags, e.g.\n\n```py\nsequence([A, B, C], expose_intermediates={\"some_intermediate\", \"some_other_intermediate\"})\n```\n\n### Renaming Tags\n\nRight now, since association of files from one step to the next is based on tags, we may end up in a situation where we want to rename tags. If we want to change the tag names, we can use `map_tags` to change them.\n\n```py\nfrom tinybaker import map_tags\n\nMappedStep = map_tags(\n  SomeStep,\n  input_mapping={\"old_input_name\": \"new_input_name\"},\n  output_mapping={\"old_output_name\": \"new_output_name\"})\n```\n\n## CLI\nTinyBaker can instantly turn a transform into a CLI:\n\n```py\nfrom tinybaker import Transform, cli\n\nclass MNISTPipeline(Transform):\n  # as defined in tests/slow/test_real_world.py\n  # [...]\n\nif __name__ == \"__main__\":\n  cli(MNISTPipeline)\n```\n\nThe above would yield the below when run:\n\n```\n$ python ./mnist_pipeline_transform.py --help\nusage: test_real_world.py [-h] --raw_train_images RAW_TRAIN_IMAGES --raw_test_images\n                          RAW_TEST_IMAGES --accuracy ACCURACY --model MODEL\n                          [--version] [--overwrite]\n\nExecute a MNISTPipeline transform\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --raw_train_images RAW_TRAIN_IMAGES\n                        Path for output tag raw_train_images\n  --raw_test_images RAW_TEST_IMAGES\n                        Path for output tag raw_test_images\n  --accuracy ACCURACY   Path for output tag accuracy\n  --model MODEL         Path for output tag model\n  --version             show program's version number and exit\n  --overwrite           Whether to overwrite any existing output files\n```\n\nNo need to write argument parsers -- TinyBaker knows what arguments the transform needs and \nbuilds a CLI around it.\n\n\n## Filesets\n\nIf a step operates over a dynamic set of files (e.g. logs from n different days), you can use the filesets interface to specify that. Tags that begin with the prefix `fileset::` are interpreted to be filesets rather than just files.\n\nIf a sequence includes a fileset as an intermediate, then TinyBaker expects the developer to specify the paths of the intermediate, via `expose_intermediates`. This is a relatively fundamental restriction of the platform, as TinyBaker expects that all paths are specified before script execution.\n\n### Example\n\nA concat task can be done as follows:\n\n```py\nclass Concat(Transform):\n    files = InputTag(\"fileset::files\")\n    concatted = InputTag(\"concatted\")\n\n    def script(self):\n        content = \"\"\n        for ref in self.files:\n            with ref.open() as f:\n                content = content + f.read()\n\n        with self.concatted.open() as f:\n            f.write(content)\n\nConcat(\n    input_paths={\n        \"fileset::files\": [\"./tests/__data__/foo.txt\", \"./tests/__data__/bar.txt\"],\n    },\n    output_paths={\"concatted\": \"/tmp/concatted\"},\n    overwrite=True,\n).run()\n```\n\n## Experimental API: File-style Transform Definitions\n\nTransforms can be specified in a script-like format:\n\n```py\n# train_model.py\nfrom tinybaker import InputTag, OutputTag, cli\n\ntrain_csv = InputTag(\"train_csv\")\ntest_csv = InputTag(\"test_csv\")\n\nresults = OutputTag(\"results\")\npickled_model = OutputTag(\"pickled_model\")\n\ndef script():\n  # Read from files\n  with train_csv.open() as f:\n    train_data = pd.read_csv(f)\n\n  with test_csv.open() as f:\n    test_data = pd.read_csv(f)\n\n  # Run computations\n  X = train_data.drop([\"label\"])\n  Y = train_data[[\"label\"]]\n  [model, train_results] = train_model(X, Y)\n  test_results = test_model(model, test_data)\n\n  # Write to output files\n  with results.open() as f:\n    results = train_results.formatted_summary() + test_results.formatted_summary()\n    f.write(results)\n  with pickled_model.openbin() as f:\n    pickle.dump(f, model)\n\n\nif __name__ == \"__main__\":\n  # We can still define a cli under this format.\n  cli(locals())\n```\n\nThese can be converted to transforms via:\n\n```py\nfrom tinybaker import Transform\nfrom . import train_model\n\nTrainModelTransform = Transform.from_namespace(train_model)\n\n# This can be consumed just like any other job.\njob = TrainModelTransform(input_files={...}, output_files={...})\njob.run()\n```\n\n## Contributing\n\n[Please contribute!](contributing.md)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevinism%2Ftinybaker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevinism%2Ftinybaker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevinism%2Ftinybaker/lists"}