{"id":13737319,"url":"https://github.com/msaroufim/ml-design-patterns","last_synced_at":"2025-04-06T14:12:30.575Z","repository":{"id":37500880,"uuid":"376663985","full_name":"msaroufim/ml-design-patterns","owner":"msaroufim","description":"Software Architecture for ML engineers","archived":false,"fork":false,"pushed_at":"2022-06-29T00:50:58.000Z","size":38,"stargazers_count":399,"open_issues_count":1,"forks_count":31,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-03-30T13:08:55.241Z","etag":null,"topics":["deep-learning","design-patterns","python","pytorch","systems"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msaroufim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-14T00:06:36.000Z","updated_at":"2025-03-26T19:11:56.000Z","dependencies_parsed_at":"2022-07-13T13:50:35.786Z","dependency_job_id":null,"html_url":"https://github.com/msaroufim/ml-design-patterns","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msaroufim%2Fml-design-patterns","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msaroufim%2Fml-design-patterns/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msaroufim%2Fml-design-patterns/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msaroufim%2Fml-design-patterns/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msaroufim","download_url":"https://codeload.github.com/msaroufim/ml-design-patterns/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247492557,"owners_count":20947545,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","design-patterns","python","pytorch","systems"],"created_at":"2024-08-03T03:01:41.632Z","updated_at":"2025-04-06T14:12:30.559Z","avatar_url":"https://github.com/msaroufim.png","language":null,"funding_links":[],"categories":["Others"],"sub_categories":[],"readme":"# Machine Learning Design patterns\n\n## Pipeline\n\nA pipeline is about processing some data sequentially using an arbitrary number of functions. It's useful for data preprocessing or within the context of an inference framework.\n\nFor example you may want to do `preprocess -\u003e inference -\u003e postprocess`\n\n```python\n\nfrom typing import Union\n\ndef preprocess(input : Union[str, Image, Video, Audio]) -\u003e Tensor:\n    # implementation\n\ndef inference(input : Tensor) -\u003e Tensor:\n    # implementation\n\ndef postprocess(input : Tensor) -\u003e Union[str, Image, Video, Audio]:\n    # implementation\n```\n\nAnd then you'd run your pipeline by saying\n\n```python\npipeline = [preprocess, inference, postprocess]\n\ninput = ...\nfor step in pipeline:\n    input = step(input)\n\nreturn input\n```\n\nAn import detail is that the input and output types of function in a pipeline need to match.\n\nThis pattern isn't only limited to an inferencing framework but a framework like Keras explictly has a concept of a layer so if you were to implement it from scratch a grossly simplified version would be something like.\n\n```python\nclass KerasModel():\n    def __init__(self):\n        self.layers = []\n    \n    def add_layer(self, layer):\n        self.layers.append(layer)\n    \n    def forward(self, input):\n        for layer in self.layers:\n            input = layer(input)\n        return input\n```\n\nAn exercise to the reader is to make the above work with a batch of examples.\n\n## Workflow\nA workflow is a more complex version of a pipeline that allows for sequential behaviors. But the more general pattern is a Directed Acyclic Graph (DAG). This is what DAG providers like Airflow, metaflow and the ensemble support in torchserve do.\n\n\n```python\n# Dag example\ngraph = {\n    'input': ['a'],\n    'a': ['b', 'e'],\n    'b': ['c', 'd'],\n    'd': ['e']}\n```\n\nIn the example we above we use a Python dictionary where the keys on the right hand side are the nodes where arrows are pointing out of and the values on the left hand side have arrows pointing into them. If you don't like python dictionaries you can also create a DAG using YAML or python decorators.\n\n Now imagine if every node was some Python function or even a Pytorch model how would you go about executing this DAG?\n\n```python\nclass WorkflowEngine():\n    def __init__(self, dag):\n        self.dag = dag\n    \n    def execute():\n        for key, value in dag.items():\n            Step(key, value, False)\n\nclass Step():\n    def __init__(self, inputs, outputs, dependencies_met):\n        self.inputs = inputs\n        self.outputs = outputs\n        self.dependencies_met = False\n        self.resources = {\"cpu\" : 2, \"gpu\" : 1}\n    \n    def execute():\n        if self.dependencies_met:\n            # Execute steps\n```\n\nA real world orchestrator would need to take care of dependency management, scheduling and resource allocation.\n\n## Function as data\nFunction as data is something LISP programmers talk a lot about. The main idea is you could have a function like \n\n```lisp\n;; Add 1 and 2\n(+ 1 2)\n```\n\nBut if you add a quote at the beginning of it then it becomes a string\n\n```lisp\n;; The string (+ 1 2)\n'(+ 1 2)\n```\n\nThis is powerful because now you could have a seperate program analyze the string `(+ 1 2)` realize that the inputs never change, the function is pure so the outputs never change so this function can be replaced by `3`\n\nPyTorch also has a similar idea but first let's define a very simple toy model.\n\n\n```python\nclass myModel(torch.nn.Model):\n    def __init__(self):\n        self.linear = torch.nn.Linear(100)\n    \n    def forward(self, input):\n        output = self.linear(input)\n        return output\n```\n\nRun an inference with `myModel(torch.randn(100))` so it's a function! But also if you were to run `myModel.data` you would get the weights of the model so it's also data. So a `function = data`.\n\nThis is also made clearer if you've ever pickled model which is essentially a method to serialize some python objects as strings on disk so again `function = data`\n\n```python\nmodel = myModel()\npickle.dumps(model)\n```\n\n## Iterator design pattern\n\n```python\nfor i in range(10):\n    print(i)\n\n```\n\nBut a more useful operation would be something like\n\n```python\nfor batch in dataset:\n    model(batch)\n```\n\nSo how do you make something like `for _ in _` available for your classes. We do this by implementing the `__iter__()` and `__next()__` functions\n\n\n```python\nfrom typing import List\n\nclass Dataset:\n    def __init__(self, data : List[str]):\n        self.data = data\n        self.elements = 0\n        \n    def __iter__(self):\n        return data[0]\n    \n    def __next__(self, batch : int = 0):\n        \n        # Return a batch of examples\n        if batch \u003e 0:\n            # TODO: Fix typo here, this will only return a single or 0 elements\n            self.elements = self.elements + batch\n            return self.data[self.elements : self.elements + batch]\n        \n        # Return a single example\n        else:\n            self.elements = self.elements + 1\n            return self.data[self.elements]\n```\n\n## Job queues\n\nLet's say we have a service that needs to pick one of `n` PyTorch models to run on some input\n\n```python\nfrom dataclass import dataclass\n\n@dataclass\nclass Job:\n    model : str\n    input : Union[str, Image, Audio, Video]\n    endpoint : Tuple[str, int] # url : port\n\nclass JobProcessor():\n    def __init__(self):\n        self.jobs : List[Job] = []\n    \n    def process_job(self):\n        job = jobs.pop()\n        execute(job)\n    \n    def execute(self, job):\n        output = job.model(job.input)\n        expose(output, endpoint)\n    \n    def expose(self, output, endpoint)L\n        # Use FastAPI or something else\n```\n\nWith only a couple of lines of code we've designed a multi model inferencing framework. Let's say you're not using Python to design this job manager you can also still just spawn a Python process, run the inference and then write it either to disk or stdout and pick it back up from the other language.\n\n\n## Callbacks\nMany trainer loops will implement callbacks where you can trigger some behavior if some condition is fulfilled for example\n\n```python\non_training_ends -\u003e do_something\non_epoch_end -\u003e do_something\n\ndef do_something():\n    save_logs_to_tensorboard()\n    change_learning_rate()\n```\n\nA callback is a particular case of something called the Observer pattern so let's implement that. Code paraphrased from https://refactoring.guru/design-patterns/observer/python/example#lang-features\n\nSo an observer needs to subscribe to some subject that changes its behavior\n\n```python\nclass ModelSubject():\n    def __init__(self):\n        state : Trainer = None # A trainer includes a model, which epoch its on, loss, model weights...\n        observers : List[ModelObserver] = None\n\n    def attach(self, observer : ModelObserver):\n        observers.append(observer)\n\n\n    def detach(self, observer : ModelObserver):\n        observers.remove(observer)\n\n    def notify(self):\n        for observer in observers:\n            observer.update(state)\n```\n\nThe observer is notified of all state changes of the subject and then needs to do something when that happens\n\nAt a high level an Observer is an abstract class that implements a function called update\n\n```python\nfrom abc import abstractmethod, ABC\n\nclass Observer(ABC):\n\n    @abstractmethod\n    def update(self):\n        \"\"\"\n        Implement your own observer here\n        \"\"\"\n        pass\n```\n\nWe can then build specific kinds of observers by by implementing the `update()` function. In the example below we build an observer to adjust the learning rate of a model when the loss increases\n\n```python\nclass ChangeLearningRateObserver(Observer):\n    def __init__(self):\n        self.state : [TrainerState] = None\n    \n    def update(self, new_state):\n        if self.state = None:\n            pass\n        \n        else:\n            # Do not use this in production code this is educational only\n            if new_state.loss \u003e state.loss:\n                state.lr = state.lr * 0.1\n        self.state = new_state\n\n```\n\nBut this is a powerful framework and we can also implement something like logging without changing the library code.\n\n```python\nclass LogObserver(Observer):\n    def __init__(self, log_dir='/logs/'):\n        self.state : [TrainerState] = None\n        self.log_dir : str = log_dir\n\n    def update(self, new_state : Dict): # Asssume new state is a dictionary\n        with open(filename, \"w\") as f:\n            for key, value in new_state.items():\n                f.write(f\"{key}:{value}\")\n        self.state = new_state\n```\n\n\nSo the benefit of this approach you can extend functionality of a library without changing the core code which may require you to get a PR merged in by the core team that may make the core code unmaintable by adding all sorts of usecases that people care about. So the observer pattern is primarily a way to extend code which is why it's very popular in training frameworks like fast.ai or PyTorch LIghtning.\n\n## Learner pattern\n\nLearner pattern was popularized by frameworks like Sci-kit learn that started approach to modeling that was as simple as \n\n`model.fit(data)`\n\nBut implementing code for this at least within the context of neural networks is something you already do if you've used vanillay PyTorch without a training framework.\n\n```python\n\n# data[0][0] means the first input example\n# data[1][5] means the label for the 5th input example\ndata = [[inputs], [labels]]\n\nclass Model:\n    def __init__(self):\n        self.model = nn_model()\n        self.loss_function = substract/square_loss/l1 etc..\n    \n    def fit(self, data):\n        # 1. Compute forward function\n        output = self.model(data) \n\n        # 2. Get loss\n        loss = loss_function(data)\n\n        # 3. Update model\n        model.update(loss)\n    \n    def update(self, loss):\n        # 1. Compute gradients with autograd\n\n        self.model.weights = ...     \n```\n\n## Batch processing\n\nSo suppose you'd like to run `model.forward()` on two different inputs. The naive way of doing this is running\n\n```python\nmodel.forward(input_1)\nmodel.forward(input_2)\n```\n\nBut this becomes painfully slow if you start dealing with a large number of examples\n\n```python\n\n# model.forward is called O(inputs)\nfor input in inputs:\n    model.forward(input)\n```\n\nGenerally in numerical code you should fear `for loops` like the plague and as much as possible try to replace them with batch operations.\n\nSo instead rewrite your code as \n\n```python\n\ntensor = torch.Tensor\nfor input in inputs:\n    tensor.stack(input)\n\n# model.forward is called once\nmodel.forward(tensor)\n```\n\nRemember GPUs aren't that great at doing many small operations because there's an overhead to sending data to it so as much as possible it's better to batch jobs into large ones to take advantage of speedups. (Technically this can be worked around with CUDA graphs but that's still a relatively new feature)\n\nAs another exercise vectorization on CPU is also another technique to eliminate for loops but by operating over chunks of data concurrently. So for example some new newer Intel CPUs will turn matrices into long vectors and do matrix math on them by using a large instruction width AVX512.\n\n## Decorator\nDecorators are a technique to add functionality to a function or class without modifying its code. You may have already heard of or used decorators like `@memoize, @lru_cache, @profile, @step`\n\nAs an example let's take a look at how to implement a `@profile` decorator borrowing code from https://medium.com/uncountable-engineering/pythons-line-profiler-32df2b07b290\n\n```python\nfrom line_profiler import LineProfiler\n\nprofiler = LineProfiler()\n\n# A decorator is just a python function that takes in a function\ndef profile(func)\n    # Inner function takes in unnamed and named arguments\n    def inner(*args, **kwargs)\n        # New code decorator adds\n        profiler.add_function(func)\n        profiler.enable_by_count()\n\n        # Running the decorated function\n        return func(*args, **kwargs)\n    return inner\n```\n\nSo now you can just run\n\n```python\n@profile\ndef my_slow_func():\n    # some terrible code here\n```\n\nIn the above decorator we ran some commands before returning `func` but we could also change `func`, its arguments or do whatever we please this is another one of those patterns like callbacks that let you extend some code without modifying it.\n\nOne of the most interesting decorators is the FastAPI one https://github.com/tiangolo/fastapi\n\n```python\n@app.get(\"/\")\ndef read_root():\n    return {\"Hello\": \"World\"}\n```\n\nThe above application redirects calls to `/` to the `read_root()` function so digging into the code a bit you'll find a function called `get()` in `fastapi/application.py` https://github.com/tiangolo/fastapi/blob/master/fastapi/applications.py#L425\n\nIt's a complicated function but what we care about is\n\n```python\ndef get(...) -\u003e Callable[DecoratedCallable]:\n    return self.router.get(...)\n```\n\nDigging through the code a bit more we find that `add_api_route()` whenever a new `@app.get()` is called where see `func` being returned in much the same way as it is in the plain profiling decorator https://github.com/tiangolo/fastapi/blob/87e29ec2c54ce3651939cc4d10e05d07a2f8b9ce/fastapi/applications.py#L378\n\nThe flipside of decorators is that they can lead you to a monolithic architecture where your infrastructure and deployment is tightly coupled to your implementation, this is generally fine if you're a startup but not so fine if multiple people are contributing code to the same place.\n\n## Strategy Pattern\n\nThe strategy pattern is classic Object Oriented programming and is generally useful when you to set some particular strategy for an object without constraining it too much as a library designer.\n\nFor example suppose you're creating a new Trainer class and don't have time to implement all optimizers that people care about. So you start with adding support for an SGDOptimizer\n```python\nclass Trainer:\n    def __init__(self):\n        optimizer : Optimizer = SGDOptimizer\n        ...\n\n# Create an abstract optimizer class\nclass Optimizer(ABC):\n    @abstractmethod\n    # We don't want to constrain the input types for such a function\n    # Return type is a tensor because value in a tensor needs to be changed by a bit\n    def step(*args, **kwargs) -\u003e Tensor:\n        pass \n\nclass SGDOptimizer(Optimizer):\n    def step(self, learn_rate : float, n_iter : int, tolerance : float):\n        # Your SGD implementation here\n```\n\nSo now someone else that doesn't understand how your whole trainer codebase works could create a new optimizer by just making sure to inherit from `Optimizer`\n\n```python\nclass AdamOptimizer(Optimizer):\n    def step(self, beta_1 : float, beta_2 : float, epsilon : float):\n        # Out of core Adam implementation here\n```\n\n\n## TODO\n* Autograd - https://marksaroufim.medium.com/automatic-differentiation-step-by-step-24240f97a6e6 (Maybe I need to update this tutorial with some python code)\n* Matrix Multiplication\n    * http://supertech.csail.mit.edu/papers/Prokop99.pdf\n    * https://github.com/mitmath/18335/blob/spring21/notes/oblivious-matmul.pdf\n* Distributed patterns: good tutorial here https://huggingface.co/docs/transformers/parallelism\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsaroufim%2Fml-design-patterns","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsaroufim%2Fml-design-patterns","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsaroufim%2Fml-design-patterns/lists"}