{"id":13706258,"url":"https://github.com/jimblandy/context-switch","last_synced_at":"2025-05-15T12:06:36.065Z","repository":{"id":37627667,"uuid":"224944398","full_name":"jimblandy/context-switch","owner":"jimblandy","description":"Comparison of Rust async and Linux thread context switch time.","archived":false,"fork":false,"pushed_at":"2024-11-16T01:45:27.000Z","size":115,"stargazers_count":727,"open_issues_count":2,"forks_count":21,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-05-10T00:44:08.741Z","etag":null,"topics":["async","context-switches","linux","measure","rust-async","thread"],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jimblandy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-30T01:37:36.000Z","updated_at":"2025-04-29T08:59:54.000Z","dependencies_parsed_at":"2025-02-08T15:01:33.801Z","dependency_job_id":null,"html_url":"https://github.com/jimblandy/context-switch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimblandy%2Fcontext-switch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimblandy%2Fcontext-switch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimblandy%2Fcontext-switch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jimblandy%2Fcontext-switch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jimblandy","download_url":"https://codeload.github.com/jimblandy/context-switch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254337613,"owners_count":22054253,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["async","context-switches","linux","measure","rust-async","thread"],"created_at":"2024-08-02T22:00:53.627Z","updated_at":"2025-05-15T12:06:31.049Z","avatar_url":"https://github.com/jimblandy.png","language":"Rust","funding_links":[],"categories":["Rust","Linux","⛓️🧰🦀 web3 toolkit - rust edition"],"sub_categories":["Performance"],"readme":"# Comparison of Rust async and Linux thread context switch time and memory use\n\nThese are a few programs that try to measure context switch time and task memory\nuse in various ways. In summary:\n\n-   A context switch takes around 0.2µs between async tasks, versus 1.7µs\n    between kernel threads. But this advantage goes away if the context switch\n    is due to I/O readiness: both converge to 1.7µs. The async advantage also\n    goes away in our microbenchmark if the program is pinned to a single core.\n    So inter-core communication is something to watch out for.\n\n-   Creating a new task takes ~0.3µs for an async task, versus ~17µs for a new\n    kernel thread.\n\n-   Memory consumption per task (i.e. for a task that doesn't do much) starts at\n    around a few hundred bytes for an async task, versus around 20KiB (9.5KiB\n    user, 10KiB kernel) for a kernel thread. This is a minimum: more demanding\n    tasks will naturally use more.\n\n-   It's no problem to create 250,000 async tasks, but I was only able to get my\n    laptop to run 80,000 threads (4 core, two way HT, 32GiB), even after raising\n    every limit I could find. So I don't know what's imposing this limit. See\n    \"Running tests with large numbers of threads\", below.\n\nThese are probably not the limiting factors in your application, but it's nice\nto know that the headroom is there.\n\n## Measuring thread context switch time\n\nThe programs `thread-brigade` and `async-brigade` each create 500 tasks\nconnected by pipes (like a “bucket brigade”) and measure how long it takes to\npropagate a single byte from the first to the last. One is implemented with\nthreads, and the other is implemented with the Tokio crate's async I/O.\n\n    $ cd async-brigade/\n    $ /bin/time cargo run --release\n        Finished release [optimized] target(s) in 0.02s\n         Running `/home/jimb/rust/context-switch/target/release/async-brigade`\n    500 tasks, 10000 iterations:\n    mean 1.795ms per iteration, stddev 82.016µs (3.589µs per task per iter)\n    9.83user 8.33system 0:18.19elapsed 99%CPU (0avgtext+0avgdata 17144maxresident)k\n    0inputs+0outputs (0major+2283minor)pagefaults 0swaps\n    $\n\n    $ cd ../thread-brigade\n    $ /bin/time cargo run --release\n        Finished release [optimized] target(s) in 0.02s\n         Running `/home/jimb/rust/context-switch/target/release/thread-brigade`\n    500 tasks, 10000 iterations:\n    mean 2.657ms per iteration, stddev 231.822µs (5.313µs per task per iter)\n    9.14user 27.88system 0:26.91elapsed 137%CPU (0avgtext+0avgdata 16784maxresident)k\n    0inputs+0outputs (0major+3381minor)pagefaults 0swaps\n    $\n\nIn these runs, I'm seeing 18.19s / 26.91s ≅ 0.68 or a 30% speedup from going\nasync. However, if I pin the threaded version to a single core, the speed\nadvantage of async disappears:\n\n    $ taskset --cpu-list 1 /bin/time cargo run --release\n        Finished release [optimized] target(s) in 0.02s\n         Running `/home/jimb/rust/context-switch/target/release/thread-brigade`\n    500 tasks, 10000 iterations:\n    mean 1.709ms per iteration, stddev 102.926µs (3.417µs per task per iter)\n    4.81user 12.50system 0:17.37elapsed 99%CPU (0avgtext+0avgdata 16744maxresident)k\n    0inputs+0outputs (0major+3610minor)pagefaults 0swaps\n    $\n\nI don't know why.\n\nIt would be interesting to see whether/how the number of tasks in the brigade\naffects these numbers.\n\nPer-thread resident memory use in `thread-brigade` is about 9.5KiB, whereas\nper-async-task memory use in `async-brigade` is around 0.4KiB, a factor of ~20.\nSee 'Measuring memory use', below.\n\nThere are differences in the system calls performed by the two versions:\n\n- In `thread-brigade`, each task does a single `recvfrom` and a `write` per\n  iteration, taking 5.5µs.\n\n- In `async-brigade`, each task does one `recvfrom` and one `write`, neither of\n  which block, and then one more `recvfrom`, which returns `EAGAIN` and suspends\n  the task. Then control returns to the executor. The reactor thread calls\n  `epoll` to see which pipes are readable, and tells the executor which task to\n  run next. All this takes 3.6µs.\n\n- In `one-thread-brigade`, we build the pipes but just have a single thread loop\n  through them all and do the reads and writes. This gives us a baseline cost\n  for the I/O operations themselves, which we can subtract off from the times in\n  the other two programs, in hopes that the remainder reflects the cost of the\n  context switches alone.\n\nThe `async-brigade` performance isn't affected much if we switch from Tokio's\ndefault multi-thread executor to a single-threaded executor, so it's not\nspending much time in kernel context switches. `thread-brigade` does a kernel\ncontext switch from each task to the next. I think this means that context\nswitches are more expensive than a `recvfrom` and `epoll` system call.\n\nIf we run the test with 50000 tasks (and reduce the number of iterations to\n100), the speedup doesn't change much, but `thread-brigade` requires a 466MiB\nresident set, whereas `async-brigade` runs in around 21MiB. That's 10kiB of\nmemory being actively touched by each task, versus 0.4kiB, about a twentieth.\nThis isn't just the effect of pessimistically-sized thread stacks: we're looking\nat the resident set size, which shouldn't include pages allocated to the stack\nthat the thread never actually touches. So the way Rust right-sizes futures\nseems really effective.\n\nThis microbenchmark doesn't do much, but a real application would add to each\ntask's working set, and that difference might become less significant. But I was\nable to run async-brigade with 250,000 tasks; I wasn't able to get my laptop\nto run 250,000 threads at all.\n\nThe other programs are minor variations, or make other measurements:\n\n-   `async-mem-brigade` uses `tokio:sync::mpsc` channels to send `usize` values\n    from one async channel to another. This performs the same number of\n    task-to-task switches, but avoids the overhead of the pipe I/O. It seems\n    that Tokio's channels do use futexes on Linux to signal readiness.\n\n-   `one-thread-brigade` attempts to measure the cost of the pipe I/O alone, by\n    creating all the pipes but having a single thread do all the reading and\n    writing to propagate the byte from the first to the last.\n\n-   `thread-creation` and `async-creation` attempt to measure the time\n    required to create a thread / async task.\n\n## Measuring memory use\n\nThe scripts `thread-brigade/rss-per-thread.sh` and\n`async-brigade/rss-per-task.sh` run their respective brigade microbenchmarks\nwith varying numbers of tasks, and measure the virtual and resident memory\nconsumption at each count. You can then do a linear regression to see the memory\nuse of a single task. Note that `async-brigade/rss-per-task.sh` runs 10x as many\ntasks, to keep the noise down.\n\nAs mentioned above, in my measurements, each thread costs around 9.5KiB, and\neach async task costs around 0.4KiB, so the async version uses about 1/20th as\nmuch memory as the threaded version.\n\nTo run this script, you'll need to have the Linux `pmap` utility installed; this\ngives an accurate measurement of resident set size. On Fedora, this is included\nin the `procps-ng` package. (Pull requests for info about other major\ndistributions welcome.)\n\n## Running tests with large numbers of threads\n\nIt's interesting to play with the number of tasks to see how that affects the\nrelative speed of the async and threaded bucket brigades. But in order to test\nlarge numbers of threads, you may need to remove some of your system's\nguardrails.\n\nOn Linux:\n\n-   You will run out of file descriptors. Each task needs two file descriptors,\n    one for the reading end of the upstream pipe, and one for the writing end of\n    the downstream pipe. The process also needs a few file descriptors for\n    miscellaneous purposes. For 50000 tasks, say:\n\n        $ ulimit -n 100010\n\n-   You may run out of process id numbers. Each thread needs its own pid. So,\n    perhaps something like:\n\n        $ sudo sysctl kernel.pid_max=4194304\n\n    This is overkill, but why worry about this? (The number above is the default\n    in Fedora 33, 4 × 1024 × 1024; apparently systemd was worried about pid\n    rollover.)\n\n-   You will run out of memory map areas. Each thread has its own stack, with an\n    unmapped guard page at the low end to catch stack overflows. There seem to\n    be other constraints as well. In practice, this seems to work for 50000\n    tasks:\n\n        $ sudo sysctl vm.max_map_count=200000\n\n-   Process ID numbers can also be limited by the `pids` cgroup controller.\n\n    A cgroup is a collection of processes on which you can impose system\n    resource limits as a group. Every process belongs to exactly one cgroup.\n    When one process creates another, the new process is placed in the same\n    cgroup as its parent.\n\n    Cgroups are arranged in a tree, where limits set on a cgroup apply to that\n    group and all its descendants. Only leaf cgroups actually contain\n    processes/threads. The cgroups in the hierarchy have names that look like\n    filesystem paths; the root cgroup is named `/`.\n\n    You can see which cgroup your shell belongs to like this:\n\n        $ cat /proc/$$/cgroup\n        0::/user.slice/user-1000.slice/gargle/howl.scope\n\n    This indicates that my shell is in a cgroup named\n    `/user.slice/user-1000.slice/gargle/howl.scope`. The names can get quite\n    long, so this example is simplified.\n\n    On Fedora, at least, the cgroup hierarchy is reflected in the ordinary\n    filesystem as a directory tree under `/sys/fs/cgroup`, so my shell's\n    cgroup appears as a directory here:\n\n        $ ls /sys/fs/cgroup/user.slice/user-1000.slice/gargle/howl.scope\n        cgroup.controllers\t    cpu.stat\t         memory.pressure\n        cgroup.events\t\t    io.pressure\t         memory.stat\n        cgroup.freeze\t\t    memory.current\t     memory.swap.current\n        cgroup.max.depth\t    memory.events\t     memory.swap.events\n        cgroup.max.descendants\tmemory.events.local  memory.swap.high\n        cgroup.procs\t\t    memory.high\t         memory.swap.max\n        cgroup.stat\t\t        memory.low\t         pids.current\n        cgroup.subtree_control\tmemory.max\t         pids.events\n        cgroup.threads\t\t    memory.min\t         pids.max\n        cgroup.type\t\t        memory.numa_stat\n        cpu.pressure\t\t    memory.oom.group\n        $\n\n    You can inspect and manipulate cgroups by looking at these files. Some\n    represent different resources that can be limited, while others relate to\n    the cgroup hierarchy itself.\n\n    In particular, the file `pids.max` shows the limit this cgroup imposes on my\n    shell:\n\n        $ cat /sys/fs/cgroup/user.slice/user-1000.slice/gargle/howl.scope/pids.max\n        max\n        $\n\n    A limit of `max` means that there's no limit. But limits set on parent\n    cgroups also apply to their descendants, so we need to check our ancestor\n    groups:\n\n        $ cat /sys/fs/cgroup/user.slice/user-1000.slice/gargle/pids.max\n        10813\n        $ cat /sys/fs/cgroup/user.slice/user-1000.slice/pids.max\n        84184\n        $ cat /sys/fs/cgroup/user.slice/pids.max\n        max\n        $ cat /sys/fs/cgroup/pids.max\n        cat: /sys/fs/cgroup/pids.max: No such file or directory\n        $\n\n    Apparently there's a limit of 10813 pids imposed by my shell's cgroup's\n    parent, and a higher limit of 84184 pids set for me as a user. (On Fedora,\n    these limits are established by systemd configuration files.) To raise that\n    limit, we can simply write another value to these files, as root:\n\n        $ sudo sh -c 'echo 100000 \u003e /sys/fs/cgroup/user.slice/user-1000.slice/pids.max'\n        $ sudo sh -c 'echo max    \u003e /sys/fs/cgroup/user.slice/user-1000.slice/gargle/pids.max'\n\n    The cgroup machinery seems to vary not only from one Linux distribution to\n    the next, but even from one version to another. So while I hope this is\n    helpful, you may need to consult other documentation. `man cgroups(7)` is a\n    good place to start, but beware, it makes my explanation here look short.\n\n-   The kernel parameter `kernel.threads-max` is a system-wide limit on the\n    number of threads. You probably won't run into this.\n\n        $ sysctl kernel.threads-max\n        kernel.threads-max = 255208\n        $\n\n-   There is a limit on the number of processes that can run under a given real\n    user ID:\n\n        $ ulimit -u\n        127604\n        $\n\n    At the system call level, this is the `getrlimit(2)` system call's\n    `RLIMIT_NPROC` resource. This, too, you're unlikely to run into.\n\n-   The default thread stack size is 8MiB:\n\n        $ ulimit -s\n        8192\n        $\n\n    You might expect this to limit a 32GiB (x86_64) machine to 4096 threads, but\n    the kernel only allocates physical memory to a stack as the thread touches\n    its pages, so the initial memory consumption of a thread in user space is\n    actually only around 8kiB. At this size, 32GiB could accommodate 4Mi\n    threads. Again, this is unlikely to be the limiting factor.\n\n    Although it doesn't matter, `thread-brigade` program in this repository\n    requests a 1MiB stack for each thread, which is plenty for our purposes.\n\nWith these changes made, I was able to run `thread-brigade` with 80000 tasks. I\ntried to run more, but even after raising every limit I could identify, I still\ngot errors. So I don't know what imposes this limit.\n\n## Does any of this matter?\n\nIn GitHub issue #1, @spacejam raised a good point:\n\n\u003e overall, there are a lot of things here that really fade into insignificance\n\u003e when you consider the simple effort required to deserialize JSON or handle\n\u003e TLS. People often see that there's some theoretical benefit of async and then\n\u003e they accept far less ergonomic coding styles and the additional bug classes\n\u003e that only happen on async due to accidental blocking etc... despite the fact\n\u003e that when you consider a real-world deployed application, those \"benefits\"\n\u003e become indistinguishable from noise. However, due to the additional bug\n\u003e classes and worse ergonomics, there is now less energy for actually optimizing\n\u003e the business logic, which is where all of the cycles and resource use are\n\u003e anyway, so in-practice async implementations tend to be buggier and slower.\n\nBelow is my reply to them, lightly edited:\n\n\u003e I have a few responses to this.\n\u003e\n\u003e First of all, the reason I carried out the experiments in this repo in the\n\u003e first place was that I basically agreed with all of your points here. I think\n\u003e async is wildly oversold as \"faster\" without any real investigation into why\n\u003e that would be. It is hard to pin down exactly how the alleged advantages would\n\u003e arise. The same I/O operations have to be carried out either way (or worse);\n\u003e kernel context switches have been heavily optimized over the years (although\n\u003e the Spectre mitigations made them worse); and the whole story of the creation\n\u003e of NPTL was about it beating IBM's competing M-on-N thread implementation\n\u003e (which I see as analogous to async task systems) in the very microbenchmarks\n\u003e in which the M-on-N thread library was expected to have an advantage.\n\u003e\n\u003e However, in conversations that I sought out with people with experience\n\u003e implementing high-volume servers, both with threads and with async designs, my\n\u003e async skepticism met a lot of pushback. They consistently reported struggling\n\u003e with threaded designs and not being able to get performance under control until\n\u003e they went async. Big caveat: they were not using Rust - these were older designs\n\u003e in C++ and even C. But it jibes well with the other successful designs you see\n\u003e out there, like nginx and Elixir (which is used by WhatsApp, among others),\n\u003e which are all essentially async.\n\u003e\n\u003e So the purpose of these experiments was to see if I could isolate some of the\n\u003e sources of async's apparent advantages. It came down to memory consumption,\n\u003e creation time, and context switch time each having best-case\n\u003e order-of-magnitude advantages. Taken together, those advantages are beyond the\n\u003e point that I'm willing to call negligible. How often the best case actually\n\u003e arises is unclear, but one can argue that that, at least, is under the\n\u003e programmer's control, so the ceiling on how far implementation effort can get\n\u003e you is higher, in an async design.\n\u003e\n\u003e Ultimately, as far as this repo is concerned, you need to decide whether you\n\u003e trust your readers to understand both the value and the limitations of\n\u003e microbenchmarks. If you assume your readers are in Twitter mode---they're just\n\u003e going to glance at the headlines and come away with a binary, \"async good, two\n\u003e legs bad\" kind of conclusion---then maybe it's better not to publish\n\u003e microbenchmarks at all, because they're misleading. Reality is more sensitive to\n\u003e details. But I think the benefit of offering these microbenchmarks and the\n\u003e README's analysis to careful readers might(?) outweigh the harm done by the\n\u003e noise from careless readers, because I think the careful readers are more likely\n\u003e to use the material in a way that has lasting impact. The wind changes; the\n\u003e forest does not.\n\u003e\n\u003e The 2nd edition of Programming Rust (due out in June 2021) has a chapter on\n\u003e async that ends with a discussion of the rationale for async programming. It\n\u003e tries to dismiss some of the commonly heard bogus arguments, and present the\n\u003e advantages that async does have with the appropriate qualifications. It\n\u003e mentions tooling disadvantages. Generally, the chapter describes Rust's async\n\u003e implementation in a decent amount of detail, because we want our readers to be\n\u003e able to anticipate how it will perform and where it might help; the summary\n\u003e attempts to make clear what all that machinery can and cannot accomplish.\n\nThe only thing I'd add is that the measurements reported here for asynchronous\nperformance were taken of an implementation that uses `epoll`-style system\ncalls. The newer `io_uring`-style APIs seem radically different, and I'm curious\nto see whether these might change the story here.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimblandy%2Fcontext-switch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjimblandy%2Fcontext-switch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimblandy%2Fcontext-switch/lists"}