Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jimblandy/context-switch

Comparison of Rust async and Linux thread context switch time.
https://github.com/jimblandy/context-switch

async context-switches linux measure rust-async thread

Last synced: 19 days ago
JSON representation

Comparison of Rust async and Linux thread context switch time.

Awesome Lists containing this project

README

        

# Comparison of Rust async and Linux thread context switch time and memory use

These are a few programs that try to measure context switch time and task memory
use in various ways. In summary:

- A context switch takes around 0.2µs between async tasks, versus 1.7µs
between kernel threads. But this advantage goes away if the context switch
is due to I/O readiness: both converge to 1.7µs. The async advantage also
goes away in our microbenchmark if the program is pinned to a single core.
So inter-core communication is something to watch out for.

- Creating a new task takes ~0.3µs for an async task, versus ~17µs for a new
kernel thread.

- Memory consumption per task (i.e. for a task that doesn't do much) starts at
around a few hundred bytes for an async task, versus around 20KiB (9.5KiB
user, 10KiB kernel) for a kernel thread. This is a minimum: more demanding
tasks will naturally use more.

- It's no problem to create 250,000 async tasks, but I was only able to get my
laptop to run 80,000 threads (4 core, two way HT, 32GiB), even after raising
every limit I could find. So I don't know what's imposing this limit. See
"Running tests with large numbers of threads", below.

These are probably not the limiting factors in your application, but it's nice
to know that the headroom is there.

## Measuring thread context switch time

The programs `thread-brigade` and `async-brigade` each create 500 tasks
connected by pipes (like a “bucket brigade”) and measure how long it takes to
propagate a single byte from the first to the last. One is implemented with
threads, and the other is implemented with the Tokio crate's async I/O.

$ cd async-brigade/
$ /bin/time cargo run --release
Finished release [optimized] target(s) in 0.02s
Running `/home/jimb/rust/context-switch/target/release/async-brigade`
500 tasks, 10000 iterations:
mean 1.795ms per iteration, stddev 82.016µs (3.589µs per task per iter)
9.83user 8.33system 0:18.19elapsed 99%CPU (0avgtext+0avgdata 17144maxresident)k
0inputs+0outputs (0major+2283minor)pagefaults 0swaps
$

$ cd ../thread-brigade
$ /bin/time cargo run --release
Finished release [optimized] target(s) in 0.02s
Running `/home/jimb/rust/context-switch/target/release/thread-brigade`
500 tasks, 10000 iterations:
mean 2.657ms per iteration, stddev 231.822µs (5.313µs per task per iter)
9.14user 27.88system 0:26.91elapsed 137%CPU (0avgtext+0avgdata 16784maxresident)k
0inputs+0outputs (0major+3381minor)pagefaults 0swaps
$

In these runs, I'm seeing 18.19s / 26.91s ≅ 0.68 or a 30% speedup from going
async. However, if I pin the threaded version to a single core, the speed
advantage of async disappears:

$ taskset --cpu-list 1 /bin/time cargo run --release
Finished release [optimized] target(s) in 0.02s
Running `/home/jimb/rust/context-switch/target/release/thread-brigade`
500 tasks, 10000 iterations:
mean 1.709ms per iteration, stddev 102.926µs (3.417µs per task per iter)
4.81user 12.50system 0:17.37elapsed 99%CPU (0avgtext+0avgdata 16744maxresident)k
0inputs+0outputs (0major+3610minor)pagefaults 0swaps
$

I don't know why.

It would be interesting to see whether/how the number of tasks in the brigade
affects these numbers.

Per-thread resident memory use in `thread-brigade` is about 9.5KiB, whereas
per-async-task memory use in `async-brigade` is around 0.4KiB, a factor of ~20.
See 'Measuring memory use', below.

There are differences in the system calls performed by the two versions:

- In `thread-brigade`, each task does a single `recvfrom` and a `write` per
iteration, taking 5.5µs.

- In `async-brigade`, each task does one `recvfrom` and one `write`, neither of
which block, and then one more `recvfrom`, which returns `EAGAIN` and suspends
the task. Then control returns to the executor. The reactor thread calls
`epoll` to see which pipes are readable, and tells the executor which task to
run next. All this takes 3.6µs.

- In `one-thread-brigade`, we build the pipes but just have a single thread loop
through them all and do the reads and writes. This gives us a baseline cost
for the I/O operations themselves, which we can subtract off from the times in
the other two programs, in hopes that the remainder reflects the cost of the
context switches alone.

The `async-brigade` performance isn't affected much if we switch from Tokio's
default multi-thread executor to a single-threaded executor, so it's not
spending much time in kernel context switches. `thread-brigade` does a kernel
context switch from each task to the next. I think this means that context
switches are more expensive than a `recvfrom` and `epoll` system call.

If we run the test with 50000 tasks (and reduce the number of iterations to
100), the speedup doesn't change much, but `thread-brigade` requires a 466MiB
resident set, whereas `async-brigade` runs in around 21MiB. That's 10kiB of
memory being actively touched by each task, versus 0.4kiB, about a twentieth.
This isn't just the effect of pessimistically-sized thread stacks: we're looking
at the resident set size, which shouldn't include pages allocated to the stack
that the thread never actually touches. So the way Rust right-sizes futures
seems really effective.

This microbenchmark doesn't do much, but a real application would add to each
task's working set, and that difference might become less significant. But I was
able to run async-brigade with 250,000 tasks; I wasn't able to get my laptop
to run 250,000 threads at all.

The other programs are minor variations, or make other measurements:

- `async-mem-brigade` uses `tokio:sync::mpsc` channels to send `usize` values
from one async channel to another. This performs the same number of
task-to-task switches, but avoids the overhead of the pipe I/O. It seems
that Tokio's channels do use futexes on Linux to signal readiness.

- `one-thread-brigade` attempts to measure the cost of the pipe I/O alone, by
creating all the pipes but having a single thread do all the reading and
writing to propagate the byte from the first to the last.

- `thread-creation` and `async-creation` attempt to measure the time
required to create a thread / async task.

## Measuring memory use

The scripts `thread-brigade/rss-per-thread.sh` and
`async-brigade/rss-per-task.sh` run their respective brigade microbenchmarks
with varying numbers of tasks, and measure the virtual and resident memory
consumption at each count. You can then do a linear regression to see the memory
use of a single task. Note that `async-brigade/rss-per-task.sh` runs 10x as many
tasks, to keep the noise down.

As mentioned above, in my measurements, each thread costs around 9.5KiB, and
each async task costs around 0.4KiB, so the async version uses about 1/20th as
much memory as the threaded version.

To run this script, you'll need to have the Linux `pmap` utility installed; this
gives an accurate measurement of resident set size. On Fedora, this is included
in the `procps-ng` package. (Pull requests for info about other major
distributions welcome.)

## Running tests with large numbers of threads

It's interesting to play with the number of tasks to see how that affects the
relative speed of the async and threaded bucket brigades. But in order to test
large numbers of threads, you may need to remove some of your system's
guardrails.

On Linux:

- You will run out of file descriptors. Each task needs two file descriptors,
one for the reading end of the upstream pipe, and one for the writing end of
the downstream pipe. The process also needs a few file descriptors for
miscellaneous purposes. For 50000 tasks, say:

$ ulimit -n 100010

- You may run out of process id numbers. Each thread needs its own pid. So,
perhaps something like:

$ sudo sysctl kernel.pid_max=4194304

This is overkill, but why worry about this? (The number above is the default
in Fedora 33, 4 × 1024 × 1024; apparently systemd was worried about pid
rollover.)

- You will run out of memory map areas. Each thread has its own stack, with an
unmapped guard page at the low end to catch stack overflows. There seem to
be other constraints as well. In practice, this seems to work for 50000
tasks:

$ sudo sysctl vm.max_map_count=200000

- Process ID numbers can also be limited by the `pids` cgroup controller.

A cgroup is a collection of processes on which you can impose system
resource limits as a group. Every process belongs to exactly one cgroup.
When one process creates another, the new process is placed in the same
cgroup as its parent.

Cgroups are arranged in a tree, where limits set on a cgroup apply to that
group and all its descendants. Only leaf cgroups actually contain
processes/threads. The cgroups in the hierarchy have names that look like
filesystem paths; the root cgroup is named `/`.

You can see which cgroup your shell belongs to like this:

$ cat /proc/$$/cgroup
0::/user.slice/user-1000.slice/gargle/howl.scope

This indicates that my shell is in a cgroup named
`/user.slice/user-1000.slice/gargle/howl.scope`. The names can get quite
long, so this example is simplified.

On Fedora, at least, the cgroup hierarchy is reflected in the ordinary
filesystem as a directory tree under `/sys/fs/cgroup`, so my shell's
cgroup appears as a directory here:

$ ls /sys/fs/cgroup/user.slice/user-1000.slice/gargle/howl.scope
cgroup.controllers cpu.stat memory.pressure
cgroup.events io.pressure memory.stat
cgroup.freeze memory.current memory.swap.current
cgroup.max.depth memory.events memory.swap.events
cgroup.max.descendants memory.events.local memory.swap.high
cgroup.procs memory.high memory.swap.max
cgroup.stat memory.low pids.current
cgroup.subtree_control memory.max pids.events
cgroup.threads memory.min pids.max
cgroup.type memory.numa_stat
cpu.pressure memory.oom.group
$

You can inspect and manipulate cgroups by looking at these files. Some
represent different resources that can be limited, while others relate to
the cgroup hierarchy itself.

In particular, the file `pids.max` shows the limit this cgroup imposes on my
shell:

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/gargle/howl.scope/pids.max
max
$

A limit of `max` means that there's no limit. But limits set on parent
cgroups also apply to their descendants, so we need to check our ancestor
groups:

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/gargle/pids.max
10813
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/pids.max
84184
$ cat /sys/fs/cgroup/user.slice/pids.max
max
$ cat /sys/fs/cgroup/pids.max
cat: /sys/fs/cgroup/pids.max: No such file or directory
$

Apparently there's a limit of 10813 pids imposed by my shell's cgroup's
parent, and a higher limit of 84184 pids set for me as a user. (On Fedora,
these limits are established by systemd configuration files.) To raise that
limit, we can simply write another value to these files, as root:

$ sudo sh -c 'echo 100000 > /sys/fs/cgroup/user.slice/user-1000.slice/pids.max'
$ sudo sh -c 'echo max > /sys/fs/cgroup/user.slice/user-1000.slice/gargle/pids.max'

The cgroup machinery seems to vary not only from one Linux distribution to
the next, but even from one version to another. So while I hope this is
helpful, you may need to consult other documentation. `man cgroups(7)` is a
good place to start, but beware, it makes my explanation here look short.

- The kernel parameter `kernel.threads-max` is a system-wide limit on the
number of threads. You probably won't run into this.

$ sysctl kernel.threads-max
kernel.threads-max = 255208
$

- There is a limit on the number of processes that can run under a given real
user ID:

$ ulimit -u
127604
$

At the system call level, this is the `getrlimit(2)` system call's
`RLIMIT_NPROC` resource. This, too, you're unlikely to run into.

- The default thread stack size is 8MiB:

$ ulimit -s
8192
$

You might expect this to limit a 32GiB (x86_64) machine to 4096 threads, but
the kernel only allocates physical memory to a stack as the thread touches
its pages, so the initial memory consumption of a thread in user space is
actually only around 8kiB. At this size, 32GiB could accommodate 4Mi
threads. Again, this is unlikely to be the limiting factor.

Although it doesn't matter, `thread-brigade` program in this repository
requests a 1MiB stack for each thread, which is plenty for our purposes.

With these changes made, I was able to run `thread-brigade` with 80000 tasks. I
tried to run more, but even after raising every limit I could identify, I still
got errors. So I don't know what imposes this limit.

## Does any of this matter?

In GitHub issue #1, @spacejam raised a good point:

> overall, there are a lot of things here that really fade into insignificance
> when you consider the simple effort required to deserialize JSON or handle
> TLS. People often see that there's some theoretical benefit of async and then
> they accept far less ergonomic coding styles and the additional bug classes
> that only happen on async due to accidental blocking etc... despite the fact
> that when you consider a real-world deployed application, those "benefits"
> become indistinguishable from noise. However, due to the additional bug
> classes and worse ergonomics, there is now less energy for actually optimizing
> the business logic, which is where all of the cycles and resource use are
> anyway, so in-practice async implementations tend to be buggier and slower.

Below is my reply to them, lightly edited:

> I have a few responses to this.
>
> First of all, the reason I carried out the experiments in this repo in the
> first place was that I basically agreed with all of your points here. I think
> async is wildly oversold as "faster" without any real investigation into why
> that would be. It is hard to pin down exactly how the alleged advantages would
> arise. The same I/O operations have to be carried out either way (or worse);
> kernel context switches have been heavily optimized over the years (although
> the Spectre mitigations made them worse); and the whole story of the creation
> of NPTL was about it beating IBM's competing M-on-N thread implementation
> (which I see as analogous to async task systems) in the very microbenchmarks
> in which the M-on-N thread library was expected to have an advantage.
>
> However, in conversations that I sought out with people with experience
> implementing high-volume servers, both with threads and with async designs, my
> async skepticism met a lot of pushback. They consistently reported struggling
> with threaded designs and not being able to get performance under control until
> they went async. Big caveat: they were not using Rust - these were older designs
> in C++ and even C. But it jibes well with the other successful designs you see
> out there, like nginx and Elixir (which is used by WhatsApp, among others),
> which are all essentially async.
>
> So the purpose of these experiments was to see if I could isolate some of the
> sources of async's apparent advantages. It came down to memory consumption,
> creation time, and context switch time each having best-case
> order-of-magnitude advantages. Taken together, those advantages are beyond the
> point that I'm willing to call negligible. How often the best case actually
> arises is unclear, but one can argue that that, at least, is under the
> programmer's control, so the ceiling on how far implementation effort can get
> you is higher, in an async design.
>
> Ultimately, as far as this repo is concerned, you need to decide whether you
> trust your readers to understand both the value and the limitations of
> microbenchmarks. If you assume your readers are in Twitter mode---they're just
> going to glance at the headlines and come away with a binary, "async good, two
> legs bad" kind of conclusion---then maybe it's better not to publish
> microbenchmarks at all, because they're misleading. Reality is more sensitive to
> details. But I think the benefit of offering these microbenchmarks and the
> README's analysis to careful readers might(?) outweigh the harm done by the
> noise from careless readers, because I think the careful readers are more likely
> to use the material in a way that has lasting impact. The wind changes; the
> forest does not.
>
> The 2nd edition of Programming Rust (due out in June 2021) has a chapter on
> async that ends with a discussion of the rationale for async programming. It
> tries to dismiss some of the commonly heard bogus arguments, and present the
> advantages that async does have with the appropriate qualifications. It
> mentions tooling disadvantages. Generally, the chapter describes Rust's async
> implementation in a decent amount of detail, because we want our readers to be
> able to anticipate how it will perform and where it might help; the summary
> attempts to make clear what all that machinery can and cannot accomplish.

The only thing I'd add is that the measurements reported here for asynchronous
performance were taken of an implementation that uses `epoll`-style system
calls. The newer `io_uring`-style APIs seem radically different, and I'm curious
to see whether these might change the story here.