{"id":27966518,"url":"https://github.com/elijas/token-throttle","last_synced_at":"2025-05-07T20:19:14.696Z","repository":{"id":289060121,"uuid":"969988416","full_name":"Elijas/token-throttle","owner":"Elijas","description":"Simple Multi-Resource Rate Limiting That Saves Unused Tokens.  Rate limit API requests across different resources and workers without wasting your quota. Reserve tokens upfront, get refunds for what you don't use, and avoid over-limiting.","archived":false,"fork":false,"pushed_at":"2025-04-30T03:26:56.000Z","size":240,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T20:19:11.708Z","etag":null,"topics":["ai","ai-agents","ai-engineering","llm","llm-token","llms","openai","openai-api","rate-limit","rate-limit-redis","rate-limiter","rate-limiting","throttle-requests","throttler","tokens"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Elijas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-21T09:15:28.000Z","updated_at":"2025-05-03T07:51:13.000Z","dependencies_parsed_at":"2025-04-24T17:33:29.771Z","dependency_job_id":null,"html_url":"https://github.com/Elijas/token-throttle","commit_stats":null,"previous_names":["elijas/multi-resource-limiter","elijas/multi-resource-rate-limiter","elijas/token-throttle"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijas%2Ftoken-throttle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijas%2Ftoken-throttle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijas%2Ftoken-throttle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijas%2Ftoken-throttle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Elijas","download_url":"https://codeload.github.com/Elijas/token-throttle/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252949336,"owners_count":21830176,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-agents","ai-engineering","llm","llm-token","llms","openai","openai-api","rate-limit","rate-limit-redis","rate-limiter","rate-limiting","throttle-requests","throttler","tokens"],"created_at":"2025-05-07T20:19:14.080Z","updated_at":"2025-05-07T20:19:14.664Z","avatar_url":"https://github.com/Elijas.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# token-throttle\n\n[![Status: Experimental](https://img.shields.io/badge/status-experimental-gold.svg?style=flat)](https://github.com/mkenney/software-guides/blob/master/STABILITY-BADGES.md#experimental)\n[![Maintained: yes](https://img.shields.io/badge/yes-43cd0f.svg?style=flat\u0026label=maintained)](https://github.com/Elijas/token-throttle/issues)\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-43cd0f.svg?style=flat\u0026label=license)](LICENSE)\n[![PyPI Version](https://img.shields.io/badge/v0.3.2-version?color=43cd0f\u0026style=flat\u0026label=pypi)](https://pypi.org/project/token-throttle)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/token-throttle?color=43cd0f\u0026style=flat\u0026label=downloads)](https://pypistats.org/packages/token-throttle)\n[![Linter: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n\n**Simple Multi-Resource Rate Limiting That Saves Unused Tokens.**\n\nRate limit API requests across different resources and workers without wasting your quota. Reserve tokens upfront, get refunds for what you don't use, and avoid over-limiting.\n\n- Limits requests across multiple services (like OpenAI, Anthropic)\n- Works across multiple servers/workers\n- Returns unused tokens to your quota automatically\n- Prevents hitting API rate limits while maximizing throughput\n- Robust against race-conditions through the use Redis-locked atomic operations. Note: you can bring your own backend if you don't want to use Redis.\n- Implements the [generic cell rate algorithm,](https://en.wikipedia.org/wiki/Generic_cell_rate_algorithm) a variant of the leaky bucket pattern with a millisecond precision.\n\nNote: the API may unexpectedly change with future minor versions, therefore install with:\n\n```bash\npip install \"token-throttle[redis,tiktoken]\u003e=0.3.2,\u003c0.4.0\"\n```\n\nFound this useful? Star the repo on GitHub to show support and follow for updates. Also, find me on Discord if you have questions or would like to join a discussion!\n\n![GitHub Repo stars](https://img.shields.io/github/stars/elijas/token-throttle?style=flat\u0026color=fcfcfc\u0026labelColor=white\u0026logo=github\u0026logoColor=black\u0026label=stars)\n\u0026nbsp;\u003ca href=\"https://discord.gg/hCppPqm6\"\u003e\u003cimg alt=\"Discord server invite\" src=\"https://img.shields.io/discord/1119368998161752075?logo=discord\u0026logoColor=white\u0026style=flat\u0026color=fcfcfc\u0026labelColor=7289da\" height=\"20\"\u003e\u003c/a\u003e\n\n### Introduction\n\nThis is a tool I built as a rewrite of [openlimit](https://github.com/shobrook/openlimit/issues/20#issuecomment-2782677483) after [not finding any good Python solutions for rate limiting](https://gist.github.com/justinvanwinkle/d9f04950083c4554835c1a35f9d22dad), especially the ones that would be token-aware and had unused token refund capability.\n\n- Rate-limit multiple resources such as requests and tokens and apples and bananas at the same time\n  - This is needed because different APIs have different resource rules, e,g, Anthropic counts request and completion tokens separately.\n  - While this was originally intended for LLM APIs, it's fully customizable: you can limit bananas per 32-second-time-windows and apples per 2-minute-window simultaneously. You can also connect (through Dependency Injection) your own backend if you don't want Redis.\n- Rate-limit multiple resource consumers (such as LLM calling applications that are using the same API key and model).\n- Rate-limit same resource across multiple time-frames\n- Rate-limit each resource on it's own set of quotas\n- Reserve usage while the request is being completed, and then refund/adjust according to actual usage after the request completes\n- Refund unused resources (such as unused tokens).\n\nTreat this as an early preview (no unit tests or extensive testing) but it was stable and worked correctly for my use cases.\n\n### Illustrating use case\n\n- Imagine you have a single API key to your provider.\n- Only up to 10% of the key's throughput capacity is used by a continuously running production service.\n- You want to run a massively parallelized LLM data processing workflow with the same key.\n- But you need to do it without bringing down the production (or your workflow) with 429 Too Many Requests errors.\n- Also, leaving latency on the table is not a good solution; you want the batch results as soon as possible.\n\n\u003e [!NOTE]\n\u003e Note, this example uses [BAML](https://github.com/BoundaryML/baml) to call LLMs, but you can use absolutely anything, because all you need is just a way to retrieve tokens used in the request and actual tokens used in the response. Also note that (optionally) you can use already existing utilities in `token-throttle` to calculate/extract these two values automatically from OpenAI-compatible requests and responses.\n\n```python\nfrom baml_client import b\nfrom baml_py import Collector\ntoken_counter = Collector()\nb = b.with_options(token_counter)\n\nlimiter = create_limiter([\n    # Let's say your production only uses up to 10% of\n    # Then this should be set to 90% of your capacity\n    Quota(metric=\"requests\", limit=90_000, per_seconds=60),\n    Quota(metric=\"tokens\", limit=90_000_000, per_seconds=60),\n], backend=redis)\n\nasync def massively_parallelized():\n    input_tokens = inp_tok(await b.request.ExtractResume(...))\n    # e.g. max_tokens value of the request\n    # e.g. or 95th percentile of usual b.ExtractResume() consumption\n    output_tokens = 10_000\n\n    # Safe against race-condition and many clients\n    # because it uses Redis locks and atomic operations\n    reservation = await limiter.acquire_capacity(\n        model=\"gpt-4.1\"\n        usage={\n            \"requests\": 1,\n            \"tokens\": input_tokens + output_tokens,\n\n            # Anthropic input and output tokens\n            # have separate rate limits:\n            #   \"input_tokens\": input_tokens\n            #   \"output_tokens\": output_tokens\n\n        },\n    )\n\n    # Request only continues here only after the capacity\n    # has been reserved to not be consumed by any other LLM calls\n    c = Collector()\n    b = b.with_options(collector=c)\n    resume = await b.ExtractResume(...)\n    actual_usage = {\n        \"requests\": get_total_tokens(c),\n        \"tokens\": 1,\n    }\n    await limiter.refund_capacity(actual_usage, reservation)\n    # Now two things happened:\n    # 1. Actual usage recorded\n    #    (e.g. got a capacity refund for unused output tokens)\n    #    (e.g. a negative refund is also possible if actual usage exceeded the expected one)\n    # 2. timestamp of the usage was moved to be the last token generated\n\n```\n\n### Features\n\nHere are the key features of `token-throttle`, explained:\n\n- **Multi-Resource Limiting:**\n\n  - Simultaneously enforce limits on multiple distinct resource types for a single operation (e.g., limit both the number of API requests _and_ the number of tokens consumed within those requests).\n  - Define simultaneous different quotas for different resources (e.g., 60 requests/minute AND 1,000 requests/day AND 1,000,000 tokens/minute).\n\n- **Accurate Capacity Management \u0026 Refunding:**\n\n  - Implements a reserve-then-adjust mechanism (`acquire_capacity` followed by `refund_capacity`).\n  - Initially reserves the maximum potential usage for an operation.\n  - Allows refunding unused capacity _or_ accounting for overuse if the actual usage differs from the reservation, ensuring limits are accurately enforced based on _actual_ consumption.\n\n- **Asyncio Native:**\n\n  - Built from the ground up using `asyncio` for non-blocking operation, ideal for high-throughput applications interacting with external APIs.\n\n- **Flexible Time Windows:**\n\n  - Define quotas over various time periods (e.g., per second, per minute, per hour, per day, or anything in-between) using the `per_seconds` parameter in `Quota`.\n  - Enforce limits across multiple windows concurrently for the same resource (e.g., limit requests per minute _and_ requests per day).\n\n- **Correctness \u0026 Atomicity:**\n\n  - Designed to avoid common race conditions found in simpler rate limiters, especially when used with distributed backends like Redis.\n  - The provided Redis backend uses locks and appropriate commands to guarantee atomic updates to capacity across multiple workers/processes.\n\n- **Pluggable Backend Architecture:**\n\n  - Core logic is separated from the storage mechanism via `RateLimiterBackend` and `RateLimiterBackendBuilderInterface` interfaces.\n  - Ships with a robust `RedisBackend` for distributed rate limiting.\n  - Allows implementing custom backends (e.g., in-memory for single process, other databases) if needed.\n\n- **Configurable Per \"Model\" or Endpoint:**\n\n  - Apply different sets of `UsageQuotas` to different logical entities (referred to as `model` or `model_family` internally, e.g., specific API endpoints, different LLM versions sharing a quota).\n    - i.e. This allows for gpt-4o and gpt-4o-mini automatically have separate quotas, while gpt-4o-20241203 and gpt-4o-20241024 are just aliases of each other but are counted in the same quota bucket instance (i.e. belong to the same model_family).\n  - Supports dynamic configuration lookups via a `PerModelConfigGetter` callable.\n\n- **Extensible Usage Counting:**\n\n  - Define custom logic (`UsageCounter`) to calculate the resource usage of a given request _before_ it happens (e.g., estimate token count for an LLM request based on input messages).\n\n- **Observability Hooks:**\n  - Provides callbacks (`RateLimiterCallbacks`) for monitoring key events like starting to wait for capacity, consuming capacity, refunding capacity, and detecting missing state in the backend. Includes `loguru` integration helpers.\n\n### Getting started\n\nFor out of the box experience just do `limiter = create_openai_redis_rate_limiter()`, and use it as in the [example-1](https://github.com/shobrook/openlimit/issues/20#issuecomment-2782677483) or [example-2](https://gist.github.com/justinvanwinkle/d9f04950083c4554835c1a35f9d22dad). Otherwise, copy the function and customize it to your needs.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felijas%2Ftoken-throttle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felijas%2Ftoken-throttle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felijas%2Ftoken-throttle/lists"}