Coming soon to a -current system near you: parallel raw IP input

Contributed by Peter N. M. Hansteen on 2024-04-11 from the all the packets all at once dept.

The work to improve the capabilities of the network stack is about to take a noticeable step forward. In a message to tech@ titled parallel raw IP input, Alexander Bluhm (bluhm@) posted a patch that he describes as

List:       openbsd-tech
Subject:    parallel raw IP input
From:       Alexander Bluhm <bluhm () openbsd ! org>
Date:       2024-04-11 20:24:39

Hi,

As mvs@ mentioned, running raw IP in parallel is easier as it is
less complex than UDP.  Especially there is no socket splicing.

So I fixed one race in rip_input() and reused my shared net lock
ip_deliver() loop.

The idea is that ip_deliver() may run with shared or exclusive net
lock.  The last parameter indicates the mode.  If is is running
with shared netlock and encounters a protocol that needs exclusive
lock, the packet is queued.

Before ip_ours() always queued the packet.  Now it tries to deliver
with shared net lock, and if that is not possible, it queues the
packet.

In case we have an IPv6 header chain that must switch from shared
to exclusive processing, the next protocol and mbuf offset are
stored in a mbuf tag.

The only drawback is that we have very limited test coverage for
raw IP.  The ip_deliver() shared locking change works also with
UDP, Hvroje has tested it in 2022.

ok?

bluhm

Followed by the patch itself, which should apply to a then-recent -current checkout.

a little later, the patch was committed:

CVSROOT:	/cvs
Module name:	src
Changes by:	bluhm@cvs.openbsd.org	2024/04/14 14:46:27

Modified files:
	sys/net        : if_bridge.c 
	sys/netinet    : in_proto.c ip_input.c ip_var.h 
	sys/netinet6   : ip6_input.c 
	sys/sys        : mbuf.h protosw.h 

Log message:
Run raw IP input in parallel.

Running raw IPv4 input with shared net lock in parallel is less
complex than UDP.  Especially there is no socket splicing.

New ip_deliver() may run with shared or exclusive net lock.  The
last parameter indicates the mode.  If is is running with shared
netlock and encounters a protocol that needs exclusive lock, the
packet is queued.  Old ip_ours() always queued the packet.  Now it
calls ip_deliver() with shared net lock, and if that cannot handle
the packet completely, the packet is queued and later processed
with exclusive net lock.

In case of an IPv6 header chain, that switches from shared to
exclusive processing, the next protocol and mbuf offset are stored
in a mbuf tag.

OK mvs@

Via email, bluhm@ added some further explanation:

The commit from January is sending UDP in parallel.  Socket send,
when called from userland, uses shared net lock.  You need multiple
UDP sockets and threads writing to them to see an effect.

Now we are working on parallel input.  When traffic is directed to
different IP or ports, the network hardware can distribute flows
to different receive queues.  These queues are processed by one CPU
each.  Goal is to keep procssing parallel until data reaches userland.

IP input and forward runs parallel for a while.  Until last week
all protocol input was single threaded.  Now raw IP can run in
parallel.

Next step is UDP input in parallel.  It kind of works, but locking
in socket splicing is wrong.  In my experiments I see increase in
UDP througput of factor 4 to 7.  But the locking problems are quite
nasty.  I think we need more tests that agressively splice and
unsplice sockets.

Advantage of UDP over raw IP would be that testing is much easier.

Final protocol will be TCP, but that is hardest of all.  Single
stream TCP performance already got a performance boost by hardware
offloading.

hardware receive => IP input -> protocol input => userland =>
protocol output => IP output => hardware transmit

bluhm@ is working on -> protocol input to make it parallel for more
protocols.  It is the final bottle neck.

mvs@ is looking at => to and from userland.  They behave differently
for UNIX domain, raw IP, UDP, and TCP, the latter is still single
threaded.

This all boils down to faster packets, due to the system's now ever more increasing ability to fully utilize multiple cores to process network traffic.

Testing is of course still appreciated, but this code is anyway destined to be in the next release.

Latest Articles

Fri, Apr 26
- 08:33 Passphrase timeout for disk decryption at boot added (potential battery lifesaver) (4)
Wed, Apr 24
- 04:35 Game of Trees 0.98 released (3)
Tue, Apr 23
- 05:25 pfctl(8) and systat(8) to display fragment reassembly statistics (0)
Thu, Apr 18
- 05:05 Coming soon to a -current system near you: parallel raw IP input (0)
Wed, Apr 17
- 05:33 In -current, default write format for tar(1) changed to "pax" (17)
Wed, Apr 10
- 18:50 OpenSMTPD 7.5.0p0 Released (2)
Tue, Apr 09
- 04:49 20 years since "and we're just starting": undeadly.org turns 20 (2024-04-09) (13)
Fri, Apr 05
- 06:16 OpenBSD 7.5 released (3)
Thu, Mar 28
- 18:18 LibreSSL 3.8.4 and 3.9.1 released (0)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]