{"id":28252830,"url":"https://github.com/phantie/python_io_parallel_processing_article","last_synced_at":"2026-02-25T09:33:19.666Z","repository":{"id":289149376,"uuid":"970249382","full_name":"phantie/python_io_parallel_processing_article","owner":"phantie","description":"Article about parallel processing in python using asyncio, showing the common pitfalls and handling strategies","archived":false,"fork":false,"pushed_at":"2025-06-06T00:35:17.000Z","size":28,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-10T09:18:34.516Z","etag":null,"topics":["asyncio","beginner-friendly","parallel-processing","producer-consumer","python","queue"],"latest_commit_sha":null,"homepage":"https://phantie.dev/articles/io_bound_parallel_processing_in_python","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/phantie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-21T18:04:57.000Z","updated_at":"2025-06-06T00:35:18.000Z","dependencies_parsed_at":"2025-04-22T04:16:30.434Z","dependency_job_id":null,"html_url":"https://github.com/phantie/python_io_parallel_processing_article","commit_stats":null,"previous_names":["phantie/python_io_parallel_processing_article"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/phantie/python_io_parallel_processing_article","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phantie%2Fpython_io_parallel_processing_article","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phantie%2Fpython_io_parallel_processing_article/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phantie%2Fpython_io_parallel_processing_article/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phantie%2Fpython_io_parallel_processing_article/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/phantie","download_url":"https://codeload.github.com/phantie/python_io_parallel_processing_article/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phantie%2Fpython_io_parallel_processing_article/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260045758,"owners_count":22950792,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","beginner-friendly","parallel-processing","producer-consumer","python","queue"],"created_at":"2025-05-19T16:16:45.730Z","updated_at":"2026-02-25T09:33:14.639Z","avatar_url":"https://github.com/phantie.png","language":"Python","readme":"# I/O Bound Parallel Processing in Python\n\nWhen it comes to I/O, [*asyncio*](https://pypi.org/project/asyncio/) is a must.\n\nThe state of modern Python programming is to prefer an async library over sync if it satisfies your needs because:\n```txt\n- Efficient resource utilization for I/O tasks\n- A lot easier to get right than with *multiprocessing*\n- asyncio.to_thread can turn a sync function into async by running it in a thread pool\n```\nThe template for any async Python program is:\n\n```python\n# Entry point to your program.\n# It's an asynchronous function because it has \"async\" before \"def main\".\n# A called asynchronous function turns into a \"coroutine\".\n# A coroutine is a state machine that asyncio knows how to handle,\n# and nothing in the body runs until the asyncio runtime handles it.\n#\n# From a practical standpoint, it can be awaited with the \"await\" keyword inside an async function.\n#\n# An **important** thing to remember is that a blocking operation would block\n# the whole runtime - so no other work will be done (no other coroutines will make further progress in their executions)\n# until that operation finishes.\nasync def main():\n    ...\n\n# This part will be omitted in further sections.\n# We'll work on the *main* function.\nif __name__ == \"__main__\":\n    import asyncio\n\n    # Turning the function \"main\" into a coroutine.\n    coroutine = main()\n    # Letting asyncio runtime execute your coroutine.\n    asyncio.run(coroutine)\n```\n\nTo showcase the actual speed of further examples, let's introduce a *timer* function:\n\n```python\nfrom contextlib import contextmanager\nimport time\n\n@contextmanager\ndef timer():\n    \"\"\"\n    Usage:\n        \u003e\u003e\u003e with timer():\n        ...     # example: simulate a long-running operation\n        ...     time.sleep(1)\n        ...\n        elapsed time: 1.00 seconds\n\n    \"\"\"\n    start_time = time.time()\n    try:\n        yield\n    finally:\n        end_time = time.time()\n        elapsed = end_time - start_time\n        print(f\"elapsed time: {elapsed:.2f} seconds\")\n```\n\n## The first task the examples will use:\n\n```python\nimport asyncio\nimport pydantic\n\nasync def normal_task(\n    task_number: pydantic.NonNegativeInt,\n    time_to_execute_in_seconds: pydantic.NonNegativeFloat,\n) -\u003e None:\n    await asyncio.sleep(time_to_execute_in_seconds)\n    print(\n        f\"processed task with {task_number=!r} {time_to_execute_in_seconds=!r} behavior=normal-sleep\"\n    )\n    return None\n```\n\nThe default behavior of a task is to simulate work by sleeping.\n\n## Let's solve problems\n\n### Process 1 task\n\n```python\nasync def main():\n    with timer():\n        await normal_task(task_number=0, time_to_execute_in_seconds=1)\n        # \u003e processed task with task_number=0 time_to_execute_in_seconds=1 behavior=normal-sleep\n\n    # \u003e elapsed time: 1.00 seconds\n```\n\n### Process 5 tasks\n\n```python\nasync def main():\n    with timer():\n        for task_number in range(5):\n            await normal_task(task_number=task_number, time_to_execute_in_seconds=1)\n            # processed task with task_number=0 time_to_execute_in_seconds=1 behavior=normal-sleep\n            # processed task with task_number=1 time_to_execute_in_seconds=1 behavior=normal-sleep\n            # processed task with task_number=2 time_to_execute_in_seconds=1 behavior=normal-sleep\n            # processed task with task_number=3 time_to_execute_in_seconds=1 behavior=normal-sleep\n            # processed task with task_number=4 time_to_execute_in_seconds=1 behavior=normal-sleep\n\n    # \u003e elapsed time: 5.01 seconds\n```\n\nOnly 5 tasks, but it's already getting annoying.\n\nLet's parallelize them. We'll use *asyncio.gather* for it. It takes a list of coroutines and runs them in parallel.\n\n```python\nasync def main():\n    with timer():\n        coroutines = [\n            normal_task(task_number=task_number, time_to_execute_in_seconds=1)\n            for task_number in range(5)\n        ]\n        await asyncio.gather(*coroutines)\n\n        # \u003e processed task with task_number=0 time_to_execute_in_seconds=1 behavior=normal-sleep\n        # \u003e processed task with task_number=1 time_to_execute_in_seconds=1 behavior=normal-sleep\n        # \u003e processed task with task_number=2 time_to_execute_in_seconds=1 behavior=normal-sleep\n        # \u003e processed task with task_number=3 time_to_execute_in_seconds=1 behavior=normal-sleep\n        # \u003e processed task with task_number=4 time_to_execute_in_seconds=1 behavior=normal-sleep\n\n    # \u003e elapsed time: 1.00 seconds\n```\n\n5 tasks done in just 1 second.\n\n### Process 1_000_000 tasks\n\nThere are several problems if we take the previous approach with *asyncio.gather*:\n\n```txt\n- I/O bound tasks almost always involve side effects - practically it would try to perform a DDOS attack on the services you interact with\n- A million coroutines is memory-demanding\n- Due to task switching of the asyncio runtime, performance diminishes linearly relative to the number of simultaneously running coroutines\n```\n\nSo there are problems to solve:\n```txt\n- Do not DDOS the services\n- Keep memory usage acceptable\n- Do not overload the asyncio runtime with too many simultaneous coroutines (it's not Erlang/Elixir)\n```\n\nThe approach we'll take does not have a limit on tasks to process. 1_000_000 * 1000 is okay too.\n\n#### The approach\n\nProducers put tasks in a queue. Consumers process the tasks. I'll demonstrate an example with 1 producer and 50 consumers.\n\nYou probably don't want to hold 1_000_000 parameters for tasks in memory simultaneously — you'd use lazy sequences. For example, you'd get results by pages (subsets) from a database/API/etc.\n\nOur first producer will produce tasks that don't fail, so our consumers can skip that part for now.\n\n```python\n# Poison pill signifies that consumers should not wait for more tasks from a queue\nPOISON_PILL = object()\n\nasync def producer_of_normal_tasks(task_queue: asyncio.Queue, max_tasks: int) -\u003e None:\n    # Producer gets items from some source and puts coroutines in a queue\n    for task_number in range(max_tasks):\n        task = normal_task(task_number=task_number, time_to_execute_in_seconds=1)\n        # When the queue is filled, the producer awaits free space.\n        await task_queue.put(task)\n\n    # Usually you would use a logger with info/warning level for this message\n    print(f\"poison pill put in queue\")\n    await task_queue.put(POISON_PILL)\n\n\nasync def consumer_of_normal_tasks(task_queue: asyncio.Queue):\n    # Consumer perpetually gets items to process from the queue\n    # and terminates upon a poison pill\n    while True:\n        task = await task_queue.get()\n\n        if task is POISON_PILL:\n            # Since the producer put only one instance of a poison pill\n            # (the producer has no knowledge of consumer count)\n            # each consumer will consume a poison pill\n            # and put a new one for the next (possible) consumer\n            await task_queue.put(POISON_PILL)\n            task_queue.task_done()\n            # No more tasks coming, so consumer must terminate\n            return\n\n        # Process the task\n        await task\n        # Specify that the task is done, so another consumer does not get it\n        task_queue.task_done()\n```\n\n\u003e **Important:** In a JoinableQueue (or asyncio’s Queue when used with task_done/join semantics), each put() increments an internal “unfinished tasks” counter, and each task_done() decrements it. If an item is get()’d but never has task_done() called for it, then:\n\u003e\n\u003e 1. The unfinished tasks counter remains stuck at a higher value than it should.  \n\u003e 2. Any code awaiting queue.join() (which waits for the unfinished tasks counter to go back down to 0) will block indefinitely.  \n\u003e\n\u003e So effectively, the queue remains under the impression that one or more tasks are still being processed. Without the matching task_done() call, the “I’m finished with this item” signal is never given, leading to a permanent or long-lasting block on join.\n\n```txt\nSo:\n- a asyncio.Queue is not for every use case\n- for a get()'d task from a queue you *must* call asyncio.Queue.task_done\n```\n\n```python\nasync def main():\n    with timer():\n        # Let's process 1_000 tasks with this approach (1_000_000 is too long to wait)\n        TASKS_TO_PROCESS = 1000\n        # Let's have 50 consumers\n        CONSUMER_COUNT = 50\n\n        # So what is the expected execution time?\n        # 1 task = 1 second\n        # 50 consumers have the processing power of 50 tasks per second\n        # 1000 tasks / 50 tasks per second = 20 seconds\n        # so 20 seconds\n\n        task_queue = asyncio.Queue(\n            # Depends, not a central point\n            maxsize=CONSUMER_COUNT * 2,\n        )\n\n        # Generate consumer coroutines\n        consumers = (\n            consumer_of_normal_tasks(task_queue) for consumer_number in range(CONSUMER_COUNT)\n        )\n\n        the_producer = producer_of_normal_tasks(\n            task_queue=task_queue,\n            max_tasks=TASKS_TO_PROCESS,\n        )\n\n        # Start the produce-consume process\n        await asyncio.gather(\n            the_producer,\n            *consumers,\n        )\n        # \u003e processed task with task_number=0 time_to_execute_in_seconds=1 behavior=normal-sleep\n        # \u003e ...\n        # \u003e processed task with task_number=849 time_to_execute_in_seconds=1 behavior=normal-sleep\n        # \u003e poison pill put in queue\n        # \u003e processed task with task_number=850 time_to_execute_in_seconds=1 behavior=normal-sleep\n        # \u003e ...\n        # \u003e processed task with task_number=999 time_to_execute_in_seconds=1 behavior=normal-sleep\n\n    # \u003e elapsed time: 20.04 seconds\n\n    # So we've got what we expected\n```\n\n### Process unreliable tasks\n\nCases to handle:\n```txt\n- Timeouts\n- Expected exceptions\n- Wildcard exceptions\n```\n\n```python\nimport asyncio\nimport pydantic\n\nPOISON_PILL = object()\n\n\nasync def producer_of_unusual_tasks(task_queue: asyncio.Queue) -\u003e None:\n    # Puts coroutines in the queue:\n    #   - normal_task\n    #   - unusually_long_to_execute_task\n    #   - task_that_raises_specified_exception\n    #   - task_that_raises_unspecified_exception\n\n    def get_sequence_number_generator():\n        number = 0\n        while True:\n            yield number\n            number += 1\n\n    sequence_number_generator = get_sequence_number_generator()\n    get_sequence_number = lambda: next(sequence_number_generator)\n\n    normal_task_coroutine = normal_task(\n        task_number=get_sequence_number(), time_to_execute_in_seconds=1\n    )\n    await task_queue.put(normal_task_coroutine)\n\n    # Pretend that it's a stuck task\n    unusually_long_to_execute_task_coroutine = normal_task(\n        task_number=get_sequence_number(), time_to_execute_in_seconds=1000\n    )\n    await task_queue.put(unusually_long_to_execute_task_coroutine)\n\n    # Exception is specified in the docstring\n    async def task_that_raises_specified_exception(\n        task_number: pydantic.NonNegativeInt,\n    ):\n        \"\"\"\n        Raises:\n            ValueError: Always raised when the function is called.\n        \"\"\"\n        raise ValueError\n\n    task_that_raises_specified_exception_coro = task_that_raises_specified_exception(\n        task_number=get_sequence_number()\n    )\n    await task_queue.put(task_that_raises_specified_exception_coro)\n\n    # To demonstrate/prove that wildcard exception handling is a must\n    # for consumer coroutine protection\n    async def task_that_raises_unspecified_exception(\n        task_number: pydantic.NonNegativeInt,\n    ) -\u003e None:\n        raise ZeroDivisionError(\"did not expect that?\")\n\n    task_that_raises_unspecified_exception_coro = (\n        task_that_raises_unspecified_exception(task_number=get_sequence_number())\n    )\n    await task_queue.put(task_that_raises_unspecified_exception_coro)\n\n    print(f\"poison pill put in queue\")\n    await task_queue.put(POISON_PILL)\n\n\nasync def consumer_of_unusual_tasks(task_queue: asyncio.Queue):\n    # The goal is to not let this consumer (worker) die or get stuck for too long\n\n    while True:\n        task = await task_queue.get()\n\n        if task is POISON_PILL:\n            await task_queue.put(POISON_PILL)\n            task_queue.task_done()\n            return\n\n        while True:\n            try:\n                # asyncio.wait_for takes a coroutine and a timeout value\n                # and raises asyncio.TimeoutError if the coroutine did not succeed during the given time,\n                # so it solves unusually_long_to_execute_task\n                await asyncio.wait_for(\n                    task, timeout=10\n                )  # In a real world scenario, 10 seconds might be too short\n                break\n            except asyncio.TimeoutError as e:\n                # Usually you would retry a few times with exponential backoff before giving up\n                break\n            except ValueError as e:\n                # We know that task_that_raises_specified_exception raises this exception\n                # so handle appropriately\n                break\n            except Exception as e:\n                # For such cases as task_that_raises_unspecified_exception_coro and\n                # generally wild protection is a must\n                #\n                # I'd retry a few times before giving up\n                break\n\n        # After we've tried everything we could, we mark it as done\n        task_queue.task_done()\n\n\nasync def main():\n    with timer():\n        task_queue = asyncio.Queue()\n        await asyncio.gather(\n            producer_of_unusual_tasks(task_queue),\n            consumer_of_unusual_tasks(task_queue),\n            return_exceptions=True,  # much recommended\n        )\n        # \u003e poison pill put in queue\n        # \u003e processed task with task_number=0 time_to_execute_in_seconds=1 behavior=normal-sleep\n\n    # \u003e elapsed time: 11.01 seconds\n\n    # As a result, we've handled common problems with event processing\n    # and protected the consumers from dying/being stuck\n```\n\n### Auto adjust consumer number based on service availability/quotas\n\nChapter TODO\n\n## Further questions you might ask after implementing something like this:\n\n```txt\n- What if the process crashes?\n- What if the process restarts?\n- Observability/alerting/logging?\n```\n\n## Code repostory [Github](https://github.com/phantie/python_io_parallel_processing_article)","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphantie%2Fpython_io_parallel_processing_article","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphantie%2Fpython_io_parallel_processing_article","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphantie%2Fpython_io_parallel_processing_article/lists"}