{"id":17260059,"url":"https://github.com/paulescu/ml-rest-api-caching","last_synced_at":"2025-08-25T16:12:44.986Z","repository":{"id":252323132,"uuid":"840094033","full_name":"Paulescu/ml-rest-api-caching","owner":"Paulescu","description":"How to serve ML predictions 100x faster","archived":false,"fork":false,"pushed_at":"2024-08-09T14:31:24.000Z","size":1303,"stargazers_count":53,"open_issues_count":2,"forks_count":14,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-31T19:21:18.291Z","etag":null,"topics":["cache","docker","docker-compose","ml","python","real-time","redis"],"latest_commit_sha":null,"homepage":"https://www.realworldml.net/subscribe","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Paulescu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-09T00:49:22.000Z","updated_at":"2025-02-28T20:28:35.000Z","dependencies_parsed_at":"2025-02-10T03:01:08.465Z","dependency_job_id":"00683b96-40f6-49f0-8f47-21d562b80063","html_url":"https://github.com/Paulescu/ml-rest-api-caching","commit_stats":null,"previous_names":["paulescu/ml-rest-api-caching"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Paulescu%2Fml-rest-api-caching","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Paulescu%2Fml-rest-api-caching/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Paulescu%2Fml-rest-api-caching/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Paulescu%2Fml-rest-api-caching/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Paulescu","download_url":"https://codeload.github.com/Paulescu/ml-rest-api-caching/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253163487,"owners_count":21864086,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cache","docker","docker-compose","ml","python","real-time","redis"],"created_at":"2024-10-15T07:47:04.983Z","updated_at":"2025-05-08T23:30:53.187Z","avatar_url":"https://github.com/Paulescu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003ch1\u003eHow to serve ML predictions 100x faster\u003c/h1\u003e\n    \u003cimg src=\"./media/with_caching_second_request.gif\" width='550' /\u003e\n\u003c/div\u003e\n\n#### Table of contents\n* [The problem](#the-problem)\n* [Solution](#solution)\n* [Run the whole thing in 5 minutes](#run-the-whole-thing-in-5-minutes)\n* [Wanna learn more real-world ML?](#wanna-learn-more-real-world-ml)\n\n\n## The problem\n\nA very common way to deploy an ML model, and make its predictions accessible to other services, is by using a REST API.\n\nIt works as follows:\n1. The client requests a prediction -\u003e *Give me the price of ETH/EUR in the next 5 minutes*\n2. The ML model generates the prediction,\n3. The prediction is sent back to the client -\u003e *predicted price = 2,300 USD*\n\n\u003cdiv align=\"center\"\u003e\n    \u003ch3\u003eREST API from your textbook 🐢\u003c/h3\u003e\n    \u003cimg src=\"./media/without_caching.gif\" width='550' /\u003e\n\u003c/div\u003e\n\nThis design works, but it can become terribly unefficient in many real-world scenarios.\n\n*Why?*\n\nBecause more often than not, your ML model will re-compute the exact same prediction it already computed for a previous request.\n\nSo you will be doing the same (costly) work more than once.\n\nThis become a serious bottleneck if the request volume grows, and you model is large, like a Large Language Model.\n\nSo the question is:\n\n\u003e Is there a way to avoid re-computing costly predictions? 🤔\n\nAnd the answer is … YES!\n\n## Solution\n\nCaching is a standard technique to speed up API response time.\n\nThe idea is very simple. You add a fast key-value pair database to your system, for example Redis, and use it to store past predictions.\n\nWhen the first request hits the API, your cache is still empty, so you\n* generate a new prediction with your ML model\n* store it in the cache, as a key-value pair, and\n* return it to the client\n\n\u003cdiv align=\"center\"\u003e\n    \u003ch3\u003eREST API with a fast in-memory cache ⚡\u003c/h3\u003e\n    \u003cimg src=\"./media/with_caching_first_request.gif\" width='550' /\u003e\n\u003c/div\u003e\n\nNow, when the second request arrives, you can simply\n* load it from the cache (which is super fast), and\n* return it to the client\n\n\u003cdiv align=\"center\"\u003e\n    \u003ch3\u003eREST API with a fast in-memory cache ⚡\u003c/h3\u003e\n    \u003cimg src=\"./media/with_caching_second_request.gif\" width='550' /\u003e\n\u003c/div\u003e\n\n\u003cbr\u003e\n\nTo ensure the predictions stored in your cache are still relevant, you can set an expiry date. Whenever a prediction in the cache gets too old, it is replaced by a newly generate prediction.\n\n\u003e **For example**\n\u003e \n\u003e If your underlying ML model is generating price predictions 5 minutes into the future, you can tolerate predictions that are up to, for example, 1-2 minutes old.\n\n\n## Run the whole thing in 5 minutes\n\n1. Install all project dependencies inside an isolated virtual env, using Python Poetry\n    ```\n    $ make install\n    ```\n\n2. Run the REST API without cache\n    ```\n    $ make api-without-cache\n    ```\n\n3. Open another terminal and run\n    ```\n    $ make requests\n    ```\n    to send 100 requests and check the response time\n    ```\n    Time taken: 1014.67ms\n    Time taken: 1027.10ms\n    Time taken: 1013.05ms\n    Time taken: 1011.15ms\n    Time taken: 1004.31ms\n    Time taken: 1017.23ms\n    Time taken: 1011.73ms\n    Time taken: 1009.76ms\n    Time taken: 1011.26ms\n    ...\n    ```\n\n4. Stop the api and re-start it, this time enabling the cache\n    ```\n    $ make api-with-cache\n    ```\n    and resend the 100 requests from another terminal\n    ```\n    $ make requests\n    ```\n    The response time for the first request is still high, but 100x faster for most of the the following requests.\n    ```\n    Time taken: 1029.59ms \u003c-- new prediction\n    Time taken: 13.09ms \u003c-- very fast\n    Time taken: 8.47ms \u003c-- very fast\n    Time taken: 7.74ms \u003c-- very fast\n    Time taken: 12.98ms \u003c-- very fast\n    Time taken: 1020.92ms \u003c-- new prediction\n    Time taken: 8.40ms \u003c-- very fast\n    Time taken: 12.61ms \u003c-- very fast\n    Time taken: 10.55ms \u003c-- very fast\n    ```\n    \n    In the code I am setting the cache expiry to `5 seconds`.\n    ```\n    # src/api.py\n    cache = PredictorCache(seconds_to_invalidate_prediction=5)\n    ```\n    This is a parameter that you can tune based on how fast your ML model predictions become obsolete.\n\n## Wanna learn more real-world ML?\n\nJoin more than 18k builders to the **Real-World ML Newsletter**.\n\nEvery Saturday morning.\n\nFor **FREE**\n\n### [👉🏽 Subscribe for FREE](https://www.realworldml.net/subscribe)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpaulescu%2Fml-rest-api-caching","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpaulescu%2Fml-rest-api-caching","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpaulescu%2Fml-rest-api-caching/lists"}