OpenBSD Journal

Unlocking UVM faults yields significant performance boost

Contributed by Peter N. M. Hansteen on from the no fault of UVM dept.

In a recent message to tech@ Martin Pieuchot (mpi@) wrote about analysis of kernel lock contention. We reproduce the message(s) here, reformatted with his permission.

Unlocking UVM [virtual memory - Ed.] faults makes build time decrease a lot and improve the overall latency of mixed userland workload. In other words it gives a smoother feeling for "desktop usage": it is now possible to do 'make -j17' and watch a HD video at the same time.

So what next? The 4 Flamegraphs below were captured with patrick@ during the WE. We used its desktop 16-core arm64 machine with amdgpu(4). They all include the UVM unlocking diff and one also includes the poll(2)/select(2) diff + unlocked sowakeup(). Web browsing has been performed with iridium.

  1. make-j17_arm64.svg
    flame graph, click to enlarge further

    Building a kernel with 17 jobs is hard and only 30% of CPU time is spend in userland.

    • Overall spinning time is ~40% (18% on KERNEL_LOCK(), 10% on SCHED_LOCK(), 12% on UVM's pageqlock)
      • the UVM unlocking diff made the contention shift from the KERNEL_LOCK() to the global pageqlock and per-amap rwlock. Due to the high contention on shared amap in this workload many threads go to sleep at the same time which makes some contention appear on the SCHED_LOCK().
      • The SCHED_LOCK() is not *yet* a problem. What is happening here shows that our rwlock implementation relying on a global sleep queue is suboptimal. However in UVM's `vmobjlock' case we should hopefully turn many of the existing write locks into read locks. NetBSD is already doing that and this should be good enough to prevent some threads to go to sleep thus avoiding SCHED_LOCK() (or any global lock for the sleep queue) contention.
      • contention on the pageqlock could be reduced by revisiting/adding per UVM page locking
    • 10% of CPU time is spent idle. It is hard to say how much this is because of the scheduler and/or its interaction with high spinning time. However it is worth investigation.
    • Syscalls that need the KERNEL_LOCK() for this workload fall into 2 categories:
  2. 2ytHD+make-j17_arm64.svg flame graph, click to enlarge

    Goal of this test was to generate enough workload to not have idle CPUs and to expose where the contention is with a "desktop" usage. Almost the same amount of CPU time is spend in userland ~30-35%. Which gives us an indication that OpenBSD kernel isn't yet scaling to 16 CPUs for such use case.

    • Overall spinning time is also ~40% but with a different repartition (30% on KERNEL_LOCK(), 2% on SCHED_LOCK(), 8% on UVM's pageqlock).
    • syscalls that need the KERNEL_LOCK() for this workload are the same as above (for obvious reasons) but the following are, IMHO, the most important ones:
      • The kernel lock spinning time in futex(2) is there because sleeping with PCATCH still require it.
      • pipe, unix and network sockets all use selwakeup() and spin there because poll(2) & select(2) still need it.
    • With the kqpoll diff (2ytHD+make-j17+kqpoll_unlocked_arm64.svg) the contention in sowakeup() disappear, the one in pipeselwakeup() could receive the same treatment.
      2ytHD+make-j17+kqpoll_unlocked_arm64.svg
      flame graph, click to enlarge further
  3. 2ytHD+googlemap_arm64.svg
    flame graph, click to enlarge

    The intend of this test is to expose where the contention is for heavy multi-threaded process workload. We didn't care much about idle time, it is much more about low latency, how "smooth" can run desktop apps in other words what happens in the kernel.

    • UVM fault unlocking is "good enough" for such workload and all the contention is due to syscalls
    • If we look at time spent in kernel, 37% is spent spinning on the KERNEL_LOCK() and 12% on the SCHED_LOCK(). So almost half of %sys time is spinning.
      • futex(2) for FUTEX_WAIT exposes most of it. It spins on the KERNEL_LOCK() because sleeping with PCATCH requires it, then it spins on the SCHED_LOCK() to put itself on the sleep queue.
      • kevent(2), poll(2), and DRM ioctl(2) are responsible for a lot of KERNEL_LOCK() contention in this workload
      • NET_LOCK() contention in poll(2) and kqueue(2) generate a lot of sleeps which, together with a lot of futex(2) make the SCHED_LOCK() contention bad.

Conclusion

Unlocking UVM fault is the obvious next step and we are not finished with that yet.

Making poll(2) & select(2) work on top of the kqueue subsystem will allow us to unlock selwakeup() & friends. This will also help for workloads with network traffic going to userland (server, proxy, etc).

Completely unlocking poll(2), select(2) and kqueue(2) will require making rwsleep(9) w/ PCATCH work without KERNEL_LOCK(). This implies make signals work w/o KERNEL_LOCK(). This will also reduce the contention in futex(2).

Unlocking UVM fault will make it easier to unlock many UVM related syscalls. This will help for workloads that fork a lot.

Pushing the KERNEL_LOCK() at the VFS border in all other syscalls that matter can already be done and should already help, so I see no reason to wait.

Questions?

All in all, quite some scope for improvements. Read the entire thing (including any followups) from your favorite archive site or local mailbox.

This promises good things on the horizon.


Comments
  1. By Tristan (tristan) tristan@etheria.eu on

    Very interesting and really want to check the userland improvements on my gnome desktop. Any idea of timeline for hitting current?

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]